Speaker diarization system using HXLPS and deep neural network
In general, speaker diarization is defined as the process of segmenting the input speech signal and grouped the homogenous regions with regard to the speaker identity. The main idea behind this system is that it is able to discriminate the speaker signal by assigning the label of the each speaker si...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2018-03-01
|
Series: | Alexandria Engineering Journal |
Online Access: | http://www.sciencedirect.com/science/article/pii/S1110016816303416 |
id |
doaj-3644fea7c2974a1caf051cef12ebb7f1 |
---|---|
record_format |
Article |
spelling |
doaj-3644fea7c2974a1caf051cef12ebb7f12021-06-02T02:30:16ZengElsevierAlexandria Engineering Journal1110-01682018-03-01571255266Speaker diarization system using HXLPS and deep neural networkV. Subba Ramaiah0R. Rajeswara Rao1Mahatma Gandhi Institute of Technology, Kokapet, Hyderabad, Telangana 500075, India; Corresponding author.JNTUK-UCEV, Kakinada, Andhra Pradesh 535002, IndiaIn general, speaker diarization is defined as the process of segmenting the input speech signal and grouped the homogenous regions with regard to the speaker identity. The main idea behind this system is that it is able to discriminate the speaker signal by assigning the label of the each speaker signal. Due to rapid growth of broadcasting and meeting, the speaker diarization is burdensome to enhance the readability of the speech transcription. In order to solve this issue, Holoentropy with the eXtended Linear Prediction using autocorrelation Snapshot (HXLPS) and deep neural network (DNN) is proposed for the speaker diarization system. The HXLPS extraction method is newly developed by incorporating the Holoentropy with the XLPS. Once we attain the features, the speech and non-speech signals are detected by the Voice Activity Detection (VAD) method. Then, i-vector representation of every segmented signal is obtained using Universal Background Model (UBM) model. Consequently, DNN is utilized to assign the label for the speaker signal which is then clustered according to the speaker label. The performance is analysed using the evaluation metrics, such as tracking distance, false alarm rate and diarization error rate. The outcome of the proposed method ensures the better diarization performance by achieving the lower DER of 1.36% based on lambda value and DER of 2.23% depends on the frame length. Keywords: Speaker diarization, HXLPS feature extraction, Voice activity detection, Deep neural network, Speaker clustering, Diarization Error Rate (DER)http://www.sciencedirect.com/science/article/pii/S1110016816303416 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
V. Subba Ramaiah R. Rajeswara Rao |
spellingShingle |
V. Subba Ramaiah R. Rajeswara Rao Speaker diarization system using HXLPS and deep neural network Alexandria Engineering Journal |
author_facet |
V. Subba Ramaiah R. Rajeswara Rao |
author_sort |
V. Subba Ramaiah |
title |
Speaker diarization system using HXLPS and deep neural network |
title_short |
Speaker diarization system using HXLPS and deep neural network |
title_full |
Speaker diarization system using HXLPS and deep neural network |
title_fullStr |
Speaker diarization system using HXLPS and deep neural network |
title_full_unstemmed |
Speaker diarization system using HXLPS and deep neural network |
title_sort |
speaker diarization system using hxlps and deep neural network |
publisher |
Elsevier |
series |
Alexandria Engineering Journal |
issn |
1110-0168 |
publishDate |
2018-03-01 |
description |
In general, speaker diarization is defined as the process of segmenting the input speech signal and grouped the homogenous regions with regard to the speaker identity. The main idea behind this system is that it is able to discriminate the speaker signal by assigning the label of the each speaker signal. Due to rapid growth of broadcasting and meeting, the speaker diarization is burdensome to enhance the readability of the speech transcription. In order to solve this issue, Holoentropy with the eXtended Linear Prediction using autocorrelation Snapshot (HXLPS) and deep neural network (DNN) is proposed for the speaker diarization system. The HXLPS extraction method is newly developed by incorporating the Holoentropy with the XLPS. Once we attain the features, the speech and non-speech signals are detected by the Voice Activity Detection (VAD) method. Then, i-vector representation of every segmented signal is obtained using Universal Background Model (UBM) model. Consequently, DNN is utilized to assign the label for the speaker signal which is then clustered according to the speaker label. The performance is analysed using the evaluation metrics, such as tracking distance, false alarm rate and diarization error rate. The outcome of the proposed method ensures the better diarization performance by achieving the lower DER of 1.36% based on lambda value and DER of 2.23% depends on the frame length. Keywords: Speaker diarization, HXLPS feature extraction, Voice activity detection, Deep neural network, Speaker clustering, Diarization Error Rate (DER) |
url |
http://www.sciencedirect.com/science/article/pii/S1110016816303416 |
work_keys_str_mv |
AT vsubbaramaiah speakerdiarizationsystemusinghxlpsanddeepneuralnetwork AT rrajeswararao speakerdiarizationsystemusinghxlpsanddeepneuralnetwork |
_version_ |
1721409268460027904 |