Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos

We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (< 5 years old). At any point in these videos, multiple...

Full description

Bibliographic Details
Main Author:	Zylich, Brian Matthew
Other Authors:	Gillian Smith, Reader
Format:	Others
Published:	Digital WPI 2019
Subjects:	automated feedback classroom observation multitask learning speech recognition
Online Access:	https://digitalcommons.wpi.edu/etd-theses/1289 https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2288&context=etd-theses

id	ndltd-wpi.edu-oai-digitalcommons.wpi.edu-etd-theses-2288
record_format	oai_dc
spelling	ndltd-wpi.edu-oai-digitalcommons.wpi.edu-etd-theses-22882019-06-05T04:42:49Z Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos Zylich, Brian Matthew We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (< 5 years old). At any point in these videos, multiple people may be talking, shouting, crying, or singing simultaneously. Our goal is to recognize polite speech phrases such as "Good job", "Thank you", "Please", and "You're welcome", as the occurrence of such speech is one of the behavioral markers used in classroom observation coding via the Classroom Assessment Scoring System (CLASS) protocol. Commercial speech recognition services such as Google Cloud Speech are impractical because of data privacy concerns. Therefore, we train and test our own custom models using a combination of publicly available classroom videos from YouTube, as well as a private dataset of real classroom observation videos collected by our colleagues at the University of Virginia. We also crowdsource an additional 1152 recordings of polite speech phrases to augment our training dataset. Our contributions are the following: (1) we design a crowdsourcing task for efficiently labeling speech events in classroom videos, (2) we develop a neural network-based architecture for speech recognition, robust to noise and overlapping speech, and (3) we explore methods to synthesize new and authentic audio data, both to increase the training set size and reduce the class imbalance. Finally, using our trained polite speech detector, (4) we investigate the relationship between polite speech and CLASS scores and enable teachers to visualize their use of polite language. 2019-04-25T07:00:00Z text application/pdf https://digitalcommons.wpi.edu/etd-theses/1289 https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2288&context=etd-theses Masters Theses (All Theses, All Years) Digital WPI Gillian Smith, Reader Jacob Whitehill, Advisor Craig E. Wills, Department Head automated feedback classroom observation multitask learning speech recognition
collection	NDLTD
format	Others
sources	NDLTD
topic	automated feedback classroom observation multitask learning speech recognition
spellingShingle	automated feedback classroom observation multitask learning speech recognition Zylich, Brian Matthew Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos
description	We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (< 5 years old). At any point in these videos, multiple people may be talking, shouting, crying, or singing simultaneously. Our goal is to recognize polite speech phrases such as "Good job", "Thank you", "Please", and "You're welcome", as the occurrence of such speech is one of the behavioral markers used in classroom observation coding via the Classroom Assessment Scoring System (CLASS) protocol. Commercial speech recognition services such as Google Cloud Speech are impractical because of data privacy concerns. Therefore, we train and test our own custom models using a combination of publicly available classroom videos from YouTube, as well as a private dataset of real classroom observation videos collected by our colleagues at the University of Virginia. We also crowdsource an additional 1152 recordings of polite speech phrases to augment our training dataset. Our contributions are the following: (1) we design a crowdsourcing task for efficiently labeling speech events in classroom videos, (2) we develop a neural network-based architecture for speech recognition, robust to noise and overlapping speech, and (3) we explore methods to synthesize new and authentic audio data, both to increase the training set size and reduce the class imbalance. Finally, using our trained polite speech detector, (4) we investigate the relationship between polite speech and CLASS scores and enable teachers to visualize their use of polite language.
author2	Gillian Smith, Reader
author_facet	Gillian Smith, Reader Zylich, Brian Matthew
author	Zylich, Brian Matthew
author_sort	Zylich, Brian Matthew
title	Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos
title_short	Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos
title_full	Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos
title_fullStr	Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos
title_full_unstemmed	Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos
title_sort	training noise-robust spoken phrase detectors with scarce and private data: an application to classroom observation videos
publisher	Digital WPI
publishDate	2019
url	https://digitalcommons.wpi.edu/etd-theses/1289 https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2288&context=etd-theses
work_keys_str_mv	AT zylichbrianmatthew trainingnoiserobustspokenphrasedetectorswithscarceandprivatedataanapplicationtoclassroomobservationvideos
_version_	1719199873622867968

Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos

Similar Items