Summary: | 碩士 === 國立臺北科技大學 === 資訊工程系研究所 === 105 === In general, we neeed to parse vocal segments for song searched or singer identification. Therefore, detecting singing voice, is a very important issue. Many studies use machine learning or deep learning to identify vocal segments. We apply various deep learning algorithms to discriminate vocals from pure background instrumental music. We aim to directly apply raw data without preprocessing to neural networks, and then get accuracy of the test set. MCNN(MFCC+CNN) and FCNN(FFT+CNN) using frequency domain preprocessing are compared with the various architectures of USCL(uni-size convolutional layer) and MSCL (multi-size convolutional layer) mentioned in this thesis. USCL architecture is divided into end-to-end training, fixed Sin/Cos wave weight, and fixed four angles of waveform, MSCL architecture is divided into end-to-end training, four different lengths of training, fixed different length of the Sin/Cos waveform weight training, The highest average accuracy rate for preprocessing and no-preprocessing is 92% and 91% respectively, with the preprocessing architecture of the FCNN(FFT+CNN), and the no-preprocessing architecture of the fixed Sin/Cos wave weight of USCL(uni-size convolutional layer).
|