Summary: | 碩士 === 國立臺灣大學 === 電信工程學研究所 === 99 === There are many applications of the query by humming (QBH) system. It combines the techniques of feature selection, MIDI number analysis, and melody match processes for the 1-D voice signal. The core techniques include the signal transform theory, feature analysis, and the segmentation of voice signal, which can make us understand and classify the voice signal for more applications. Applying these analysis techniques in the QBH system, the similar songs in the database can be retrieved. Moreover, in the related applications, such as speaking-to-text, speaking translation and multi-lingual transcriptions can be included after speech recognition.
The QBH system can be majorly separated into several processes. First, it emphasizes the features in the spectrum and removes the irrelevant noise. The onsets are obtained by the classification of the segmentation with different pitch features. Then the pitches are transformed into MIDI numbers as a series of code sequences. The outputs of the QBH system are obtained from comparing the pitches of humming signal with those of the songs in our database. It is called melody match, which utilizes dynamic programming, hidden Markov model…etc. for the arrangement and the similarity measurements. Besides, other improvements we proposed are shown below.
(Filter design)
Focusing on humming signal restoration, we proposed a new adaptive algorithm for filter design. It has the advantages of high analysis efficiency, high SNR ratio and small MSE with reliable stability. Compared with the conventional signal restoration algorithms, such as the Wiener filter and the Butterworth filter, it can improve the SNR ratio and reduce the reconstruction error.
Many researches in tele-communication engineering focus on signal and noise analysis, transformation, and the feature extraction of voice signal. The FT transforms single signal into the freq domain and removes the noise. However, according to psychoacoustics, we proposed a new math model for representing the humming voice and used it for the signal restoration. After the implementation of our algorithms, we showed a variety of simulation results and compared the performance with the existing filters in Chapter 4.
(Onset detection)
The second improvement is related to the onset architecture. The “amplitude”, ”frequency”, and ”phase” based segmentations were proposed in many reference papers. According to the implementation and lots of comparisons in papers, the results have the trend of over-detection due to the specific noise characteristic. Moreover, amplitude fluctuation may cause under-detection due to the background noise interference and the attached successive sound. To overcome the above problems, we proposed a new onset architecture, which involved the features in both the spectrum and the time domains. It improves the accuracy to meet the human perception and has less complexity in implementation. Afterward, the complete test results and comparisons are shown in Chapter 4.
(Pitch estimation)
The third improvement is related to instantaneous frequency detection. The pitch extraction method is very important for the entire system. The pitch feature can be utilized for the speaker identification, classification, onset detection, and voice tracking. Therefore, in Chapter 6, we proposed a new improvement based on the sub-harmonics summation and has high accuracy under noise interference.
(Adaptive MIDI)
The fourth improvement is related to adaptive pitch representation. The most common pitch representation method is to use the MIDI number. It separates each octave into 12 notes and the instantaneous frequency can be easily mapped into its corresponding number. However, the standard MIDI numbers are designed for the connection among different musical instruments. There exists the difference between standard MIDI numbers and hearing perception. The accuracy rate of the new pitch estimation method is 95% and the adaptive MIDI numbers revise the measurement according to individuals to construct adaptive MIDI mapping and prevent the off-key cases.
We also focused on the improvement of the entire onset detection system for retrieving the correct pitches. After many tests in a variety of aspects, it shows that the proposed method has the high accuracy rate of the onset detection, lower complexity, and high stability. Therefore, the algorithm we proposed can improve the QBH system, the voice signal analysis system, the music signal coding system. It can also further improve speech recognition system in the future.
|