Summary: | 博士 === 國立臺灣大學 === 電信工程學研究所 === 106 === Speech is the most essential communication interface for human-human and human-computer interactions. In real-world scenarios, the communication effectiveness may be seriously degraded by the environmental noises. To address this issue, this thesis investigated to use the discrete wavelet packet transform (DWPT) for speech enhancement (SE) and feature compression (FC) to attain better human-human and human-computer interactions.
For the first part of this thesis, we applied DWPT to design an advanced SE approach. For most conventional SE methods, a sequence of spectral features are usually used as a compact representation for raw waveforms. However, one major problem for the conventional SE is that the phase of the noisy speech is directly used as the phase of the enhanced speech, when reconstructing enhanced waveforms. Since the phase information of the noisy and clean speech can be different, this process can potentially distort the reconstructed speech waveforms. To address this issue, we proposed to apply the DWPT to form different types of feature representation for SE. In this thesis, we investigate to apply DWPT with two SE approaches: nonnegative matrix factorization (NMF) and robust principal component analysis (RPCA). In brief, the DWPT is first applied to split a time-domain speech signal into a series of subband signals without introducing any distortions. Then we exploit either NMF or RPCA to highlight the speech component for each subband. Finally, the enhanced subband signals are joined together via the inverse DWPT to reconstruct a noise-reduced signal in time domain. We evaluate the proposed method on the Mandarin hearing in noise test (MHINT) task. Experimental results show that this new method behaves very well in prompting speech quality and intelligibility and outperforms the conventional STFT-based methods.
For the second part of this thesis, we applied DWPT to derive advanced FC approach for robust distributed speech recognition (DSR). DSR splits the processing of data between a mobile device and a network server. In the front-end, features are extracted and compressed to transmit over a wireless channel to a back-end server, where the incoming stream is received and reconstructed for recognition tasks. In this thesis, we propose a FC algorithm termed suppression by selecting wavelets (SSW) for DSR: minimizing memory and device requirements while also maintaining or even improving the recognition performance. The SSW approach first applies the DWPT to filter the incoming speech feature sequence into two temporal sub-sequences at the client terminal. FC is achieved by keeping the low (modulation) frequency sub-sequence while discarding the high frequency counterpart. The low-frequency sub-sequence is then transmitted across the remote network for specific feature statistics normalization. Wavelets are favorable for resolving the temporal properties of the feature sequence, and the down-sampling process in DWPT reduces the amount of data at the terminal prior to transmission across the network, which can be interpreted as data compression. Once the compressed features have arrived at the server, the feature sequence can be enhanced by statistics normalization, reconstructed with inverse DWPT, and compensated with a simple post filter to alleviate any over-smoothing effects from the compression stage. Results on a standard robustness task (Aurora-4) and on a Mandarin Chinese news corpus (MATBN) showed SSW outperforms conventional noise-robustness techniques while also providing nearly a 50% compression rate during the transmission stage of DSR systems.
|