Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

In the last years, acoustic word embeddings (AWEs) have gained significant interest in the research community. It applies specifically to the application of acoustic embeddings in the Query-by-Example Spoken Term Detection (QbE-STD) search and related word discrimination tasks. It has been shown tha...

Full description

Bibliographic Details
Main Authors:	Denis Shitov, Elena Pirogova, Tadeusz A. Wysocki, Margaret Lech
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Acoustic word embedding dynamic time warping triplet network query-by-example
Online Access:	https://ieeexplore.ieee.org/document/9104974/

id	doaj-854e1fc0a1b74bea9c398c13dd402f67
record_format	Article
spelling	doaj-854e1fc0a1b74bea9c398c13dd402f672021-03-30T02:13:03ZengIEEEIEEE Access2169-35362020-01-01810332710333810.1109/ACCESS.2020.29990559104974Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet NetworksDenis Shitov0https://orcid.org/0000-0003-0009-0985Elena Pirogova1Tadeusz A. Wysocki2Margaret Lech3https://orcid.org/0000-0002-7860-7289School of Engineering, RMIT University, Melbourne, VIC, AustraliaSchool of Engineering, RMIT University, Melbourne, VIC, AustraliaCollege of Electrical and Computer Engineering, University of Nebraska-Lincoln, Lincoln, NE, USASchool of Engineering, RMIT University, Melbourne, VIC, AustraliaIn the last years, acoustic word embeddings (AWEs) have gained significant interest in the research community. It applies specifically to the application of acoustic embeddings in the Query-by-Example Spoken Term Detection (QbE-STD) search and related word discrimination tasks. It has been shown that AWEs learned for the word or phone classification in one or several languages can outperform approaches that use dynamic time warping (DTW). In this paper, a new method of learning AWEs in the DTW framework is proposed. It employs a multitask triplet neural network to generate the AWEs. The triplet network learns acoustic representations of words through a comparison of DTW distances. In addition, a multitask objective, including a conventional word classification component, and a triplet loss component is proposed. The triplet loss component applies the DTW distance for the word discrimination task. The multitask objective ensures that the embeddings can be used with DTW directly. Experimental validation shows that the proposed approach is well-suited, but not necessarily restricted to the QbE-STD search. A comparison with several baseline methods shows that the new method leads to a significant improvement of the results on the word discrimination task. An evaluation of the word clustering in the learned embedding space is presented.https://ieeexplore.ieee.org/document/9104974/Acoustic word embeddingdynamic time warpingtriplet networkquery-by-example
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Denis Shitov Elena Pirogova Tadeusz A. Wysocki Margaret Lech
spellingShingle	Denis Shitov Elena Pirogova Tadeusz A. Wysocki Margaret Lech Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks IEEE Access Acoustic word embedding dynamic time warping triplet network query-by-example
author_facet	Denis Shitov Elena Pirogova Tadeusz A. Wysocki Margaret Lech
author_sort	Denis Shitov
title	Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks
title_short	Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks
title_full	Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks
title_fullStr	Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks
title_full_unstemmed	Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks
title_sort	learning acoustic word embeddings with dynamic time warping triplet networks
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2020-01-01
description	In the last years, acoustic word embeddings (AWEs) have gained significant interest in the research community. It applies specifically to the application of acoustic embeddings in the Query-by-Example Spoken Term Detection (QbE-STD) search and related word discrimination tasks. It has been shown that AWEs learned for the word or phone classification in one or several languages can outperform approaches that use dynamic time warping (DTW). In this paper, a new method of learning AWEs in the DTW framework is proposed. It employs a multitask triplet neural network to generate the AWEs. The triplet network learns acoustic representations of words through a comparison of DTW distances. In addition, a multitask objective, including a conventional word classification component, and a triplet loss component is proposed. The triplet loss component applies the DTW distance for the word discrimination task. The multitask objective ensures that the embeddings can be used with DTW directly. Experimental validation shows that the proposed approach is well-suited, but not necessarily restricted to the QbE-STD search. A comparison with several baseline methods shows that the new method leads to a significant improvement of the results on the word discrimination task. An evaluation of the word clustering in the learned embedding space is presented.
topic	Acoustic word embedding dynamic time warping triplet network query-by-example
url	https://ieeexplore.ieee.org/document/9104974/
work_keys_str_mv	AT denisshitov learningacousticwordembeddingswithdynamictimewarpingtripletnetworks AT elenapirogova learningacousticwordembeddingswithdynamictimewarpingtripletnetworks AT tadeuszawysocki learningacousticwordembeddingswithdynamictimewarpingtripletnetworks AT margaretlech learningacousticwordembeddingswithdynamictimewarpingtripletnetworks
_version_	1724185646471315456

Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Similar Items