A Study on Modeling Affective Content of Music and Speech

博士 === 國立中央大學 === 資訊工程學系 === 105 === Affective computing (AC) is a hot topic in machine learning. One of the goals of AC is modeling the affective content of various sources such as music, face, body language, and speech via mathematical methods. A recognition approach is often carried out to recogn...

Full description

Bibliographic Details
Main Authors: YU-HAO CHIN, 秦餘皞
Other Authors: JIA-CHING WANG
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/uw3mnm
Description
Summary:博士 === 國立中央大學 === 資訊工程學系 === 105 === Affective computing (AC) is a hot topic in machine learning. One of the goals of AC is modeling the affective content of various sources such as music, face, body language, and speech via mathematical methods. A recognition approach is often carried out to recognize the modeled affect. Among all the emotional cues, we particularly focus on the emotion recognition of audio. Music and speech are important affect-brought audio sources. Both of them are investigated in this study. Computationally modeling the affective content of music has been intensively studied in recent years because of its wide applications in music retrieval and recommendation. Although significant progress has been made, this task remains challenging due to the difficulty in properly characterizing the emotion of a music piece. Music emotion perceived by people is subjective by nature and thus complicates the process of collecting the emotion annotations as well as developing the predictive model. Instead of assuming people can reach a consensus on the emotion of music, in this work we propose a novel machine learning approach that characterizes the music emotion as a probability distribution in the valence-arousal (VA) emotion space, not only tackling the subjectivity but also precisely describing the emotions of a music piece. Specifically, we represent the emotion of a music piece as a probability density function (PDF) in the VA space via kernel density estimation from human annotations. To associate emotion with the audio features extracted from music pieces, we learn the combination coefficients by optimizing some objective functions of audio features, and then predict the emotion of an unseen piece by linearly combining the PDFs of the training pieces with the coefficients. Several algorithms for learning the coefficients are studied. Evaluations on the NTUMIR and MediaEval2013 datasets validate the effectiveness of the proposed methods in predicting the probability distributions of emotion from audio features. We also demonstrate how to use the proposed approach in emotion-based music retrieval. It has been recognized that music emotion is influenced by multiple factors and sometimes singing voices and accompaniments may express different emotions. However, most existing work on music emotion recognition (MER) considered music audio as a single source for feature extraction, whereas the audio of most songs can be separated into singing voice and accompaniments with various instruments. The separated sources may potentially provide additional information that can help improve performances of MER, but are seldom explored. This study aims to fill this gap by investigating whether considering singing voice and accompaniments separately can help predicting dynamic VA values of music. Specifically, a deep recurrent neural network (DRNN)-based singing-voice separation algorithm was applied to separate the two sources. Rhythm, timbre, tonality, energy, and pitch-related features were then extracted from both sources which were then combined to predict the VA values of the original music with unseparated source. In combining the sources, four variations of DRNN-based approaches were proposed and evaluated, and different combinations of the features extracted from different sources are compared. Experiments on the MediaEval2013 dataset indicated that the performance can be improved by using the above method. For the speech emotion part, this paper develops a system to implement speech-based emotion verification based on emotion variance modeling and discriminant scale-frequency maps. The proposed system consists of two parts — feature extraction and emotion verification. In the first part, for each sound frame, important atoms from the Gabor dictionary are selected by using the Matching Pursuit algorithm. The scale, frequency, and magnitude of the atoms are extracted to construct a scale-frequency map, which supports auditory discriminability by the analysis of critical bands. Next, sparse representation is used to transform scale-frequency maps into sparse coefficients to enhance the robustness against emotion variance. In the second part, emotion verification, two scores are calculated. A novel sparse representation verification approach based on Gaussian-modeled residual errors is proposed to generate the first score from the sparse coefficients. The second score is calculated by using the Emotional Agreement Index (EAI) from the same coefficients. These two scores are combined to obtain the final detection result. Experiments on an emotional database of spoken speech were conducted and indicate that the proposed approach can achieve an average Equal Error Rate (EER) of as low as 6.61%. A comparison among different approaches reveals that the proposed method is effective.