Summary: | 碩士 === 國立清華大學 === 電機工程學系 === 97 === Harmonic-temporal structured clustering (HTC) is a method for sound separation in a single-channel recording. In this work, HTC is applied to speech enhancement by separating single-speaker speech from additive background music due to one or two musical instruments. Separation is done in the short-time Fourier transform (STFT) time-frequency domain, and the separated sounds are reconstructed by overlap-and-add (OLA). HTC essentially fits a Gaussian mixture model (GMM) to the mixture’s power spectrum. The GMM is composed of a speech sound model and a music sound model that respectively approximate the speech and music power spectrum contributions. A masking function is created from each sound model and applied to the observed power spectrum to extract an estimate of the clean sound’s power spectrum. The speech and music power spectra are assumed additive, and HTC only estimates the STFT magnitude. The mixture’s phase is used in OLA to reconstruct an estimate of the clean sound’s waveform.
The speech and music sound models are sums of weighted Gaussians placed at time-frequency locations corresponding to a harmonic and temporal structure. This structure models the pitch (F0), onset time, timbre, and duration of musical notes or voicing in phones. Sound models approximate notes and phones in the observation by finding components in the power spectrum that best fit the structure.
The speech sound model is distinguished from the music sound model by their different types and initializations of F0 contours. The music model estimates the F0 of music notes as lines of constant frequency. The speech model estimates the F0s of voiced speech as segments along a continuous, slightly curving contour that does not deviate much from the speaker’s average voicing pitch. It is assumed that the speaker’s average F0 is given, so that it can be used to initialize the speech model’s parameters and help the model fit to voiced speech with pitch near the average F0. The speech model uses voicing to find phones, but it can also roughly estimate unvoiced speech. The different types and initializations of F0 estimates are the foundation for why the two sound models can distinguish between speech and music and thus separate the two sounds.
The sound models are fit to the observed spectrogram by iteratively updating their parameters in the “expectation-maximization algorithm” (EM algorithm). This algorithm tries to maximize an objective function that measures the similarity between the total GMM and the observed power spectrum. The objective function is biased to favor models with certain preferred characteristics.
An experiment demonstrates that HTC is an effective method for speech enhancement. When HTC is applied to mixtures at low signal-to-noise ratio, there is significant improvement of speech. For mixtures at high SNR, there is little improvement. HTC also provides an estimate of the music signal, but results show that estimated music has good quality only if the mixture is dominated by music and poor quality otherwise. Thus HTC speech-music separation is more applicable to speech enhancement than music enhancement.
|