Summary: | 博士 === 國立中央大學 === 資訊工程學系 === 105 === Robustness against noise is a critical characteristic of an audio recognition (AR) system. To develop a robust AR system, this dissertation proposes two front-end processing methods. To suppress the effects of background noise on target sound, a speech enhancement method that is based on compressive sensing (CS) is proposed. A quasi-SNR criterion are first utilized to determine whether a frequency bin in the spectrogram is reliable, and a corresponding mask is designed. The mask-extracted components of spectra are regarded as partial observation. The CS theory is used to reconstruct components that are missing from partial observations. The noise component can be further removed by multiplying the imputed spectrum with the optimized gain. To separate the target sound from the interference, a source separation method that is based on a complex-valued deep recurrent neural network (C-DRNN) is developed. A key aspect of the C-DRNN is that the activations and weights are complex-valued. Phase estimation is integrated into the C-DRNN by the construction of a deep and complex-valued regression model in the time-frequency domain. This dissertation also develops two novel methods for back-end recognition. The first is a joint kernel dictionary learning (JKDL) method for sound event classification. Our JKDL learns the collaborative representation instead of the sparse representation. The learned representation is thus ``denser'' than the sparse representation that is learned by K-SVD. Moreover, the discriminative ability is improved by adding a classification error term into the objective function. The second is a hierarchical Dirichlet process mixture model (HPDMM), whose components can be shared between models of each audio category. Therefore, the proposed emotion models provide a better capture of the relationship between real-world emotional states.
|