Overlapped speech and music segmentation using singular spectrum analysis and random forests

Recent years have seen ever-increasing volumes of digital media archives and an enormous amount of user-contributed content. As demand for indexing and searching these resources has increased, and new technologies such as multimedia content management systems, en-hanced digital broadcasting, and sem...

Full description

Bibliographic Details
Main Author: Mohammed, D. Y.
Published: University of Salford 2017
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.736411
id ndltd-bl.uk-oai-ethos.bl.uk-736411
record_format oai_dc
collection NDLTD
sources NDLTD
description Recent years have seen ever-increasing volumes of digital media archives and an enormous amount of user-contributed content. As demand for indexing and searching these resources has increased, and new technologies such as multimedia content management systems, en-hanced digital broadcasting, and semantic web have emerged, audio information mining and automated metadata generation have received much attention. Manual indexing and metadata tagging are time-consuming and subject to the biases of individual workers. An automated architecture able to extract information from audio signals, generate content-related text descriptors or metadata, and enable further information mining and searching would be a tangible and valuable solution. In the field of audio classification, audio signals may be broadly divided into speech or music. Most studies, however, neglect the fact that real audio soundtracks may have either speech or music, or a combination of the two, and this is considered the major hurdle to achieving high performance in automatic audio classification, since overlapping can contaminate relevant characteristics and features, causing incorrect classification or information loss. This research undertakes an extensive review of the state of the art by outlining the well-established audio features and machine learning techniques that have been applied in a broad range of audio segmentation and recognition areas. Audio classification systems and the suggested solutions for the mixed soundtracks problem are presented. The suggested solutions can be listed as follows: developing augmented and modified features for recognising audio classes even in the presence of overlaps between them; robust segmentation of a given overlapped soundtrack stream depends on an innovative method of audio decomposition using Singular Spectrum Analysis (SSA) that has been studied extensively and has received increasing attention in the past two decades as a time series decomposition method with many applications; adoption and development of driven classification methods; and finally a technique for continuous time series tasks. In this study, SSA has been investigated and found to be an efficient way to discriminate speech/music in mixed soundtracks by two different methods, each of which has been developed and validated in this research. The first method serves to mitigate the overlapping ratio between speech and music in the mixed soundtracks by generating two new soundtracks with a lower level of overlapping. Next, feature space is calculated for the output audio streams, and these are classified using random forests into either speech or music. One of the distinct characteristics of this method is the separation of the speech/music key features that lead to improve the classification performance. Nevertheless, that did encounter a few obstructions, including excessively long processing time, increased storage requirements (each frame symbolised by two outputs), and this all leads to greater computational load than previously. Meanwhile, the second method em-ploys the SSA technique to decompose a given audio signal into a series of Principal Components (PCs), where each PC corresponds to a particular pattern of oscillation. Then, the transformed well-established feature is measured for each PC in order to classify it into either speech or music based on the baseline classification system using a RF machine learning technique. The classification performance of real-world soundtracks is effectively improved, which is demonstrated by comparing speech/music recognition using conventional classification methods and the proposed SSA method. The second proposed and de-veloped method can detect pure speech, pure music, and mix with a much lower complexity level.
author Mohammed, D. Y.
spellingShingle Mohammed, D. Y.
Overlapped speech and music segmentation using singular spectrum analysis and random forests
author_facet Mohammed, D. Y.
author_sort Mohammed, D. Y.
title Overlapped speech and music segmentation using singular spectrum analysis and random forests
title_short Overlapped speech and music segmentation using singular spectrum analysis and random forests
title_full Overlapped speech and music segmentation using singular spectrum analysis and random forests
title_fullStr Overlapped speech and music segmentation using singular spectrum analysis and random forests
title_full_unstemmed Overlapped speech and music segmentation using singular spectrum analysis and random forests
title_sort overlapped speech and music segmentation using singular spectrum analysis and random forests
publisher University of Salford
publishDate 2017
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.736411
work_keys_str_mv AT mohammeddy overlappedspeechandmusicsegmentationusingsingularspectrumanalysisandrandomforests
_version_ 1718691843966763008
spelling ndltd-bl.uk-oai-ethos.bl.uk-7364112018-06-06T15:18:36ZOverlapped speech and music segmentation using singular spectrum analysis and random forestsMohammed, D. Y.2017Recent years have seen ever-increasing volumes of digital media archives and an enormous amount of user-contributed content. As demand for indexing and searching these resources has increased, and new technologies such as multimedia content management systems, en-hanced digital broadcasting, and semantic web have emerged, audio information mining and automated metadata generation have received much attention. Manual indexing and metadata tagging are time-consuming and subject to the biases of individual workers. An automated architecture able to extract information from audio signals, generate content-related text descriptors or metadata, and enable further information mining and searching would be a tangible and valuable solution. In the field of audio classification, audio signals may be broadly divided into speech or music. Most studies, however, neglect the fact that real audio soundtracks may have either speech or music, or a combination of the two, and this is considered the major hurdle to achieving high performance in automatic audio classification, since overlapping can contaminate relevant characteristics and features, causing incorrect classification or information loss. This research undertakes an extensive review of the state of the art by outlining the well-established audio features and machine learning techniques that have been applied in a broad range of audio segmentation and recognition areas. Audio classification systems and the suggested solutions for the mixed soundtracks problem are presented. The suggested solutions can be listed as follows: developing augmented and modified features for recognising audio classes even in the presence of overlaps between them; robust segmentation of a given overlapped soundtrack stream depends on an innovative method of audio decomposition using Singular Spectrum Analysis (SSA) that has been studied extensively and has received increasing attention in the past two decades as a time series decomposition method with many applications; adoption and development of driven classification methods; and finally a technique for continuous time series tasks. In this study, SSA has been investigated and found to be an efficient way to discriminate speech/music in mixed soundtracks by two different methods, each of which has been developed and validated in this research. The first method serves to mitigate the overlapping ratio between speech and music in the mixed soundtracks by generating two new soundtracks with a lower level of overlapping. Next, feature space is calculated for the output audio streams, and these are classified using random forests into either speech or music. One of the distinct characteristics of this method is the separation of the speech/music key features that lead to improve the classification performance. Nevertheless, that did encounter a few obstructions, including excessively long processing time, increased storage requirements (each frame symbolised by two outputs), and this all leads to greater computational load than previously. Meanwhile, the second method em-ploys the SSA technique to decompose a given audio signal into a series of Principal Components (PCs), where each PC corresponds to a particular pattern of oscillation. Then, the transformed well-established feature is measured for each PC in order to classify it into either speech or music based on the baseline classification system using a RF machine learning technique. The classification performance of real-world soundtracks is effectively improved, which is demonstrated by comparing speech/music recognition using conventional classification methods and the proposed SSA method. The second proposed and de-veloped method can detect pure speech, pure music, and mix with a much lower complexity level.University of Salfordhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.736411http://usir.salford.ac.uk/43773/Electronic Thesis or Dissertation