Computational neuroscience of speech recognition

Physical variability of speech combined with its perceptual constancy make speech recognition a challenging task. The human auditory brain, however, is able to perform speech recognition effortlessly. This thesis aims to understand the precise computational mechanisms that allow the auditory brain t...

Full description

Bibliographic Details
Main Author: Higgins, Irina
Other Authors: Stringer, Simon ; Schnupp, Jan
Published: University of Oxford 2015
Subjects:
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.714056
id ndltd-bl.uk-oai-ethos.bl.uk-714056
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7140562018-09-05T03:34:53ZComputational neuroscience of speech recognitionHiggins, IrinaStringer, Simon ; Schnupp, Jan2015Physical variability of speech combined with its perceptual constancy make speech recognition a challenging task. The human auditory brain, however, is able to perform speech recognition effortlessly. This thesis aims to understand the precise computational mechanisms that allow the auditory brain to do so. In particular, we look for the minimal subset of sub-cortical auditory brain areas that allow the primary auditory cortex to learn 'good representations' of speech-like auditory objects through spike-timing dependent plasticity (STDP) learning mechanisms as described by Bi & Poo (1998). A 'good representation' is defined as that which is informative of the stimulus class regardless of the variability in the raw input, while being less redundant and more compressed than the representations within the auditory nerve, which provides the firing inputs to the rest of the auditory brain hierarchy (Barlow 1961). Neurophysiological studies have provided insights into the architecture and response properties of different areas within the auditory brain hierarchy. We use these insights to guide the development of an unsupervised spiking neural network grounded in the neurophysiology of the auditory brain and equipped with spike-time dependent plasticity (STDP) learning (Bi & Poo 1998). The model was exposed to simple controlled speech- like stimuli (artificially synthesised phonemes and naturally spoken words) to investigate how stable representations that are invariant to the within- and between-speaker differences can emerge in the output area of the model. The output of the model is roughly equivalent to the primary auditory cortex. The aim of the first part of the thesis was to investigate what was the minimal taxonomy necessary for such representations to emerge through the interactions of spiking dynamics of the network neurons, their ability to learn through STDP learning and the statistics of the auditory input stimuli. It was found that sub-cortical pre-processing within the ventral cochlear nucleus and inferior colliculus was necessary to remove jitter inherent to the auditory nerve spike rasters, which would disrupt STDP learning in the primary auditory cortex otherwise. The second half of the thesis investigated the nature of neural encoding used within the primary auditory cortex stage of the model to represent the learnt auditory object categories. It was found that single cell binary encoding (DeWeese & Zador 2003) was sufficient to represent two synthesised vowel classes, however more complex population encoding using precisely timed spikes within polychronous chains (Izhikevich 2006) represented more complex naturally spoken words in speaker-invariant manner.612.8University of Oxfordhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.714056https://ora.ox.ac.uk/objects/uuid:daa8d096-6534-4174-b63e-cc4161291c90Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 612.8
spellingShingle 612.8
Higgins, Irina
Computational neuroscience of speech recognition
description Physical variability of speech combined with its perceptual constancy make speech recognition a challenging task. The human auditory brain, however, is able to perform speech recognition effortlessly. This thesis aims to understand the precise computational mechanisms that allow the auditory brain to do so. In particular, we look for the minimal subset of sub-cortical auditory brain areas that allow the primary auditory cortex to learn 'good representations' of speech-like auditory objects through spike-timing dependent plasticity (STDP) learning mechanisms as described by Bi & Poo (1998). A 'good representation' is defined as that which is informative of the stimulus class regardless of the variability in the raw input, while being less redundant and more compressed than the representations within the auditory nerve, which provides the firing inputs to the rest of the auditory brain hierarchy (Barlow 1961). Neurophysiological studies have provided insights into the architecture and response properties of different areas within the auditory brain hierarchy. We use these insights to guide the development of an unsupervised spiking neural network grounded in the neurophysiology of the auditory brain and equipped with spike-time dependent plasticity (STDP) learning (Bi & Poo 1998). The model was exposed to simple controlled speech- like stimuli (artificially synthesised phonemes and naturally spoken words) to investigate how stable representations that are invariant to the within- and between-speaker differences can emerge in the output area of the model. The output of the model is roughly equivalent to the primary auditory cortex. The aim of the first part of the thesis was to investigate what was the minimal taxonomy necessary for such representations to emerge through the interactions of spiking dynamics of the network neurons, their ability to learn through STDP learning and the statistics of the auditory input stimuli. It was found that sub-cortical pre-processing within the ventral cochlear nucleus and inferior colliculus was necessary to remove jitter inherent to the auditory nerve spike rasters, which would disrupt STDP learning in the primary auditory cortex otherwise. The second half of the thesis investigated the nature of neural encoding used within the primary auditory cortex stage of the model to represent the learnt auditory object categories. It was found that single cell binary encoding (DeWeese & Zador 2003) was sufficient to represent two synthesised vowel classes, however more complex population encoding using precisely timed spikes within polychronous chains (Izhikevich 2006) represented more complex naturally spoken words in speaker-invariant manner.
author2 Stringer, Simon ; Schnupp, Jan
author_facet Stringer, Simon ; Schnupp, Jan
Higgins, Irina
author Higgins, Irina
author_sort Higgins, Irina
title Computational neuroscience of speech recognition
title_short Computational neuroscience of speech recognition
title_full Computational neuroscience of speech recognition
title_fullStr Computational neuroscience of speech recognition
title_full_unstemmed Computational neuroscience of speech recognition
title_sort computational neuroscience of speech recognition
publisher University of Oxford
publishDate 2015
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.714056
work_keys_str_mv AT higginsirina computationalneuroscienceofspeechrecognition
_version_ 1718730568830550016