Image processing and forward propagation using binary representations, and robust audio analysis using deep learning

The work presented in this thesis consists of three main topics: document segmentation and classification into text and score, efficient computation with binary representations, and deep learning architectures for polyphonic music transcription and classification. In the case...

Full description

Bibliographic Details
Main Author:	Pedersoli, Fabrizio
Other Authors:	Tzanetakis, George
Format:	Others
Language:	English en
Published:	2019
Subjects:	Document segmentation Optimized implementations Deep learning Music classification
Online Access:	http://hdl.handle.net/1828/10653

id	ndltd-uvic.ca-oai-dspace.library.uvic.ca-1828-10653
record_format	oai_dc
spelling	ndltd-uvic.ca-oai-dspace.library.uvic.ca-1828-106532019-03-16T18:24:40Z Image processing and forward propagation using binary representations, and robust audio analysis using deep learning Pedersoli, Fabrizio Tzanetakis, George Document segmentation Optimized implementations Deep learning Music classification The work presented in this thesis consists of three main topics: document segmentation and classification into text and score, efficient computation with binary representations, and deep learning architectures for polyphonic music transcription and classification. In the case of musical documents, an important problem is separating text from musical score by detecting the corresponding boundary boxes. A new algorithm is proposed for pixel-wise classification of digital documents in musical score and text. It is based on a bag-of-visual-words approach and random forest classification. A robust technique for identifying bounding boxes of text and music score from the pixel-wise classification is also proposed. For efficient processing of learned models, we turn our attention to binary representations. When dealing with binary data, the use of bit-packing and bit-wise computation can reduce computational time and memory requirements considerably. Efficiency is a key factor when processing large scale datasets and in industrial applications. SPmat is an optimized framework for binary image processing. We propose a bit-packed representation for binary images that encodes both pixels and square neighborhoods, and design SPmat, an optimized framework for binary image processing, around it. Bit-packing and bit-wise computation can also be used for efficient forward propagation in deep neural networks. Quantified deep neural networks have recently been proposed with the goal of improving computational time performance and memory requirements while maintaining as much as possible classification performance. A particular type of quantized neural networks are binary neural networks in which the weights and activations are constrained to $-1$ and $+1$. In this thesis, we describe and evaluate Espresso, a novel optimized framework for fast inference of binary neural networks that takes advantage of bit-packing and bit-wise computations. Espresso is self contained, written in C/CUDA and provides optimized implementations of all the building blocks needed to perform forward propagation. Following the recent success, we further investigate Deep neural networks. They have achieved state-of-the-art results and outperformed traditional machine learning methods in many applications such as: computer vision, speech recognition, and machine translation. However, in the case of music information retrieval (MIR) and audio analysis, shallow neural networks are commonly used. The effectiveness of deep and very deep architectures for MIR and audio tasks has not been explored in detail. It is also not clear what is the best input representation for a particular task. We therefore investigate deep neural networks for the following audio analysis tasks: polyphonic music transcription, musical genre classification, and urban sound classification. We analyze the performance of common classification network architectures using different input representations, paying specific attention to residual networks. We also evaluate the robustness of these models in case of degraded audio using different combinations of training/testing data. Through experimental evaluation we show that residual networks provide consistent performance improvements when analyzing degraded audio across different representations and tasks. Finally, we present a convolutional architecture based on U-Net that can improve polyphonic music transcription performance of different baseline transcription networks. Graduate 2019-03-15T23:29:45Z 2019-03-15T23:29:45Z 2019 2019-03-15 Thesis http://hdl.handle.net/1828/10653 English en Available to the World Wide Web application/pdf
collection	NDLTD
language	English en
format	Others
sources	NDLTD
topic	Document segmentation Optimized implementations Deep learning Music classification
spellingShingle	Document segmentation Optimized implementations Deep learning Music classification Pedersoli, Fabrizio Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
description	The work presented in this thesis consists of three main topics: document segmentation and classification into text and score, efficient computation with binary representations, and deep learning architectures for polyphonic music transcription and classification. In the case of musical documents, an important problem is separating text from musical score by detecting the corresponding boundary boxes. A new algorithm is proposed for pixel-wise classification of digital documents in musical score and text. It is based on a bag-of-visual-words approach and random forest classification. A robust technique for identifying bounding boxes of text and music score from the pixel-wise classification is also proposed. For efficient processing of learned models, we turn our attention to binary representations. When dealing with binary data, the use of bit-packing and bit-wise computation can reduce computational time and memory requirements considerably. Efficiency is a key factor when processing large scale datasets and in industrial applications. SPmat is an optimized framework for binary image processing. We propose a bit-packed representation for binary images that encodes both pixels and square neighborhoods, and design SPmat, an optimized framework for binary image processing, around it. Bit-packing and bit-wise computation can also be used for efficient forward propagation in deep neural networks. Quantified deep neural networks have recently been proposed with the goal of improving computational time performance and memory requirements while maintaining as much as possible classification performance. A particular type of quantized neural networks are binary neural networks in which the weights and activations are constrained to $-1$ and $+1$. In this thesis, we describe and evaluate Espresso, a novel optimized framework for fast inference of binary neural networks that takes advantage of bit-packing and bit-wise computations. Espresso is self contained, written in C/CUDA and provides optimized implementations of all the building blocks needed to perform forward propagation. Following the recent success, we further investigate Deep neural networks. They have achieved state-of-the-art results and outperformed traditional machine learning methods in many applications such as: computer vision, speech recognition, and machine translation. However, in the case of music information retrieval (MIR) and audio analysis, shallow neural networks are commonly used. The effectiveness of deep and very deep architectures for MIR and audio tasks has not been explored in detail. It is also not clear what is the best input representation for a particular task. We therefore investigate deep neural networks for the following audio analysis tasks: polyphonic music transcription, musical genre classification, and urban sound classification. We analyze the performance of common classification network architectures using different input representations, paying specific attention to residual networks. We also evaluate the robustness of these models in case of degraded audio using different combinations of training/testing data. Through experimental evaluation we show that residual networks provide consistent performance improvements when analyzing degraded audio across different representations and tasks. Finally, we present a convolutional architecture based on U-Net that can improve polyphonic music transcription performance of different baseline transcription networks. === Graduate
author2	Tzanetakis, George
author_facet	Tzanetakis, George Pedersoli, Fabrizio
author	Pedersoli, Fabrizio
author_sort	Pedersoli, Fabrizio
title	Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_short	Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_full	Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_fullStr	Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_full_unstemmed	Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_sort	image processing and forward propagation using binary representations, and robust audio analysis using deep learning
publishDate	2019
url	http://hdl.handle.net/1828/10653
work_keys_str_mv	AT pedersolifabrizio imageprocessingandforwardpropagationusingbinaryrepresentationsandrobustaudioanalysisusingdeeplearning
_version_	1719004090463158272

Image processing and forward propagation using binary representations, and robust audio analysis using deep learning

Similar Items