Computational neuroscience of speech recognition
Physical variability of speech combined with its perceptual constancy make speech recognition a challenging task. The human auditory brain, however, is able to perform speech recognition effortlessly. This thesis aims to understand the precise computational mechanisms that allow the auditory brain t...
Main Author: | |
---|---|
Other Authors: | |
Published: |
University of Oxford
2015
|
Subjects: | |
Online Access: | https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.714056 |
id |
ndltd-bl.uk-oai-ethos.bl.uk-714056 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-bl.uk-oai-ethos.bl.uk-7140562018-09-05T03:34:53ZComputational neuroscience of speech recognitionHiggins, IrinaStringer, Simon ; Schnupp, Jan2015Physical variability of speech combined with its perceptual constancy make speech recognition a challenging task. The human auditory brain, however, is able to perform speech recognition effortlessly. This thesis aims to understand the precise computational mechanisms that allow the auditory brain to do so. In particular, we look for the minimal subset of sub-cortical auditory brain areas that allow the primary auditory cortex to learn 'good representations' of speech-like auditory objects through spike-timing dependent plasticity (STDP) learning mechanisms as described by Bi & Poo (1998). A 'good representation' is defined as that which is informative of the stimulus class regardless of the variability in the raw input, while being less redundant and more compressed than the representations within the auditory nerve, which provides the firing inputs to the rest of the auditory brain hierarchy (Barlow 1961). Neurophysiological studies have provided insights into the architecture and response properties of different areas within the auditory brain hierarchy. We use these insights to guide the development of an unsupervised spiking neural network grounded in the neurophysiology of the auditory brain and equipped with spike-time dependent plasticity (STDP) learning (Bi & Poo 1998). The model was exposed to simple controlled speech- like stimuli (artificially synthesised phonemes and naturally spoken words) to investigate how stable representations that are invariant to the within- and between-speaker differences can emerge in the output area of the model. The output of the model is roughly equivalent to the primary auditory cortex. The aim of the first part of the thesis was to investigate what was the minimal taxonomy necessary for such representations to emerge through the interactions of spiking dynamics of the network neurons, their ability to learn through STDP learning and the statistics of the auditory input stimuli. It was found that sub-cortical pre-processing within the ventral cochlear nucleus and inferior colliculus was necessary to remove jitter inherent to the auditory nerve spike rasters, which would disrupt STDP learning in the primary auditory cortex otherwise. The second half of the thesis investigated the nature of neural encoding used within the primary auditory cortex stage of the model to represent the learnt auditory object categories. It was found that single cell binary encoding (DeWeese & Zador 2003) was sufficient to represent two synthesised vowel classes, however more complex population encoding using precisely timed spikes within polychronous chains (Izhikevich 2006) represented more complex naturally spoken words in speaker-invariant manner.612.8University of Oxfordhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.714056https://ora.ox.ac.uk/objects/uuid:daa8d096-6534-4174-b63e-cc4161291c90Electronic Thesis or Dissertation |
collection |
NDLTD |
sources |
NDLTD |
topic |
612.8 |
spellingShingle |
612.8 Higgins, Irina Computational neuroscience of speech recognition |
description |
Physical variability of speech combined with its perceptual constancy make speech recognition a challenging task. The human auditory brain, however, is able to perform speech recognition effortlessly. This thesis aims to understand the precise computational mechanisms that allow the auditory brain to do so. In particular, we look for the minimal subset of sub-cortical auditory brain areas that allow the primary auditory cortex to learn 'good representations' of speech-like auditory objects through spike-timing dependent plasticity (STDP) learning mechanisms as described by Bi & Poo (1998). A 'good representation' is defined as that which is informative of the stimulus class regardless of the variability in the raw input, while being less redundant and more compressed than the representations within the auditory nerve, which provides the firing inputs to the rest of the auditory brain hierarchy (Barlow 1961). Neurophysiological studies have provided insights into the architecture and response properties of different areas within the auditory brain hierarchy. We use these insights to guide the development of an unsupervised spiking neural network grounded in the neurophysiology of the auditory brain and equipped with spike-time dependent plasticity (STDP) learning (Bi & Poo 1998). The model was exposed to simple controlled speech- like stimuli (artificially synthesised phonemes and naturally spoken words) to investigate how stable representations that are invariant to the within- and between-speaker differences can emerge in the output area of the model. The output of the model is roughly equivalent to the primary auditory cortex. The aim of the first part of the thesis was to investigate what was the minimal taxonomy necessary for such representations to emerge through the interactions of spiking dynamics of the network neurons, their ability to learn through STDP learning and the statistics of the auditory input stimuli. It was found that sub-cortical pre-processing within the ventral cochlear nucleus and inferior colliculus was necessary to remove jitter inherent to the auditory nerve spike rasters, which would disrupt STDP learning in the primary auditory cortex otherwise. The second half of the thesis investigated the nature of neural encoding used within the primary auditory cortex stage of the model to represent the learnt auditory object categories. It was found that single cell binary encoding (DeWeese & Zador 2003) was sufficient to represent two synthesised vowel classes, however more complex population encoding using precisely timed spikes within polychronous chains (Izhikevich 2006) represented more complex naturally spoken words in speaker-invariant manner. |
author2 |
Stringer, Simon ; Schnupp, Jan |
author_facet |
Stringer, Simon ; Schnupp, Jan Higgins, Irina |
author |
Higgins, Irina |
author_sort |
Higgins, Irina |
title |
Computational neuroscience of speech recognition |
title_short |
Computational neuroscience of speech recognition |
title_full |
Computational neuroscience of speech recognition |
title_fullStr |
Computational neuroscience of speech recognition |
title_full_unstemmed |
Computational neuroscience of speech recognition |
title_sort |
computational neuroscience of speech recognition |
publisher |
University of Oxford |
publishDate |
2015 |
url |
https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.714056 |
work_keys_str_mv |
AT higginsirina computationalneuroscienceofspeechrecognition |
_version_ |
1718730568830550016 |