Image processing and forward propagation using binary representations, and robust audio analysis using deep learning

The work presented in this thesis consists of three main topics: document segmentation and classification into text and score, efficient computation with binary representations, and deep learning architectures for polyphonic music transcription and classification. In the case...

Full description

Bibliographic Details
Main Author: Pedersoli, Fabrizio
Other Authors: Tzanetakis, George
Format: Others
Language:English
en
Published: 2019
Subjects:
Online Access:http://hdl.handle.net/1828/10653
id ndltd-uvic.ca-oai-dspace.library.uvic.ca-1828-10653
record_format oai_dc
spelling ndltd-uvic.ca-oai-dspace.library.uvic.ca-1828-106532019-03-16T18:24:40Z Image processing and forward propagation using binary representations, and robust audio analysis using deep learning Pedersoli, Fabrizio Tzanetakis, George Document segmentation Optimized implementations Deep learning Music classification The work presented in this thesis consists of three main topics: document segmentation and classification into text and score, efficient computation with binary representations, and deep learning architectures for polyphonic music transcription and classification. In the case of musical documents, an important problem is separating text from musical score by detecting the corresponding boundary boxes. A new algorithm is proposed for pixel-wise classification of digital documents in musical score and text. It is based on a bag-of-visual-words approach and random forest classification. A robust technique for identifying bounding boxes of text and music score from the pixel-wise classification is also proposed. For efficient processing of learned models, we turn our attention to binary representations. When dealing with binary data, the use of bit-packing and bit-wise computation can reduce computational time and memory requirements considerably. Efficiency is a key factor when processing large scale datasets and in industrial applications. SPmat is an optimized framework for binary image processing. We propose a bit-packed representation for binary images that encodes both pixels and square neighborhoods, and design SPmat, an optimized framework for binary image processing, around it. Bit-packing and bit-wise computation can also be used for efficient forward propagation in deep neural networks. Quantified deep neural networks have recently been proposed with the goal of improving computational time performance and memory requirements while maintaining as much as possible classification performance. A particular type of quantized neural networks are binary neural networks in which the weights and activations are constrained to $-1$ and $+1$. In this thesis, we describe and evaluate Espresso, a novel optimized framework for fast inference of binary neural networks that takes advantage of bit-packing and bit-wise computations. Espresso is self contained, written in C/CUDA and provides optimized implementations of all the building blocks needed to perform forward propagation. Following the recent success, we further investigate Deep neural networks. They have achieved state-of-the-art results and outperformed traditional machine learning methods in many applications such as: computer vision, speech recognition, and machine translation. However, in the case of music information retrieval (MIR) and audio analysis, shallow neural networks are commonly used. The effectiveness of deep and very deep architectures for MIR and audio tasks has not been explored in detail. It is also not clear what is the best input representation for a particular task. We therefore investigate deep neural networks for the following audio analysis tasks: polyphonic music transcription, musical genre classification, and urban sound classification. We analyze the performance of common classification network architectures using different input representations, paying specific attention to residual networks. We also evaluate the robustness of these models in case of degraded audio using different combinations of training/testing data. Through experimental evaluation we show that residual networks provide consistent performance improvements when analyzing degraded audio across different representations and tasks. Finally, we present a convolutional architecture based on U-Net that can improve polyphonic music transcription performance of different baseline transcription networks. Graduate 2019-03-15T23:29:45Z 2019-03-15T23:29:45Z 2019 2019-03-15 Thesis http://hdl.handle.net/1828/10653 English en Available to the World Wide Web application/pdf
collection NDLTD
language English
en
format Others
sources NDLTD
topic Document segmentation
Optimized implementations
Deep learning
Music classification
spellingShingle Document segmentation
Optimized implementations
Deep learning
Music classification
Pedersoli, Fabrizio
Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
description The work presented in this thesis consists of three main topics: document segmentation and classification into text and score, efficient computation with binary representations, and deep learning architectures for polyphonic music transcription and classification. In the case of musical documents, an important problem is separating text from musical score by detecting the corresponding boundary boxes. A new algorithm is proposed for pixel-wise classification of digital documents in musical score and text. It is based on a bag-of-visual-words approach and random forest classification. A robust technique for identifying bounding boxes of text and music score from the pixel-wise classification is also proposed. For efficient processing of learned models, we turn our attention to binary representations. When dealing with binary data, the use of bit-packing and bit-wise computation can reduce computational time and memory requirements considerably. Efficiency is a key factor when processing large scale datasets and in industrial applications. SPmat is an optimized framework for binary image processing. We propose a bit-packed representation for binary images that encodes both pixels and square neighborhoods, and design SPmat, an optimized framework for binary image processing, around it. Bit-packing and bit-wise computation can also be used for efficient forward propagation in deep neural networks. Quantified deep neural networks have recently been proposed with the goal of improving computational time performance and memory requirements while maintaining as much as possible classification performance. A particular type of quantized neural networks are binary neural networks in which the weights and activations are constrained to $-1$ and $+1$. In this thesis, we describe and evaluate Espresso, a novel optimized framework for fast inference of binary neural networks that takes advantage of bit-packing and bit-wise computations. Espresso is self contained, written in C/CUDA and provides optimized implementations of all the building blocks needed to perform forward propagation. Following the recent success, we further investigate Deep neural networks. They have achieved state-of-the-art results and outperformed traditional machine learning methods in many applications such as: computer vision, speech recognition, and machine translation. However, in the case of music information retrieval (MIR) and audio analysis, shallow neural networks are commonly used. The effectiveness of deep and very deep architectures for MIR and audio tasks has not been explored in detail. It is also not clear what is the best input representation for a particular task. We therefore investigate deep neural networks for the following audio analysis tasks: polyphonic music transcription, musical genre classification, and urban sound classification. We analyze the performance of common classification network architectures using different input representations, paying specific attention to residual networks. We also evaluate the robustness of these models in case of degraded audio using different combinations of training/testing data. Through experimental evaluation we show that residual networks provide consistent performance improvements when analyzing degraded audio across different representations and tasks. Finally, we present a convolutional architecture based on U-Net that can improve polyphonic music transcription performance of different baseline transcription networks. === Graduate
author2 Tzanetakis, George
author_facet Tzanetakis, George
Pedersoli, Fabrizio
author Pedersoli, Fabrizio
author_sort Pedersoli, Fabrizio
title Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_short Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_full Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_fullStr Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_full_unstemmed Image processing and forward propagation using binary representations, and robust audio analysis using deep learning
title_sort image processing and forward propagation using binary representations, and robust audio analysis using deep learning
publishDate 2019
url http://hdl.handle.net/1828/10653
work_keys_str_mv AT pedersolifabrizio imageprocessingandforwardpropagationusingbinaryrepresentationsandrobustaudioanalysisusingdeeplearning
_version_ 1719004090463158272