Speaker diarization system using HXLPS and deep neural network

In general, speaker diarization is defined as the process of segmenting the input speech signal and grouped the homogenous regions with regard to the speaker identity. The main idea behind this system is that it is able to discriminate the speaker signal by assigning the label of the each speaker si...

Full description

Bibliographic Details
Main Authors: V. Subba Ramaiah, R. Rajeswara Rao
Format: Article
Language:English
Published: Elsevier 2018-03-01
Series:Alexandria Engineering Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S1110016816303416
id doaj-3644fea7c2974a1caf051cef12ebb7f1
record_format Article
spelling doaj-3644fea7c2974a1caf051cef12ebb7f12021-06-02T02:30:16ZengElsevierAlexandria Engineering Journal1110-01682018-03-01571255266Speaker diarization system using HXLPS and deep neural networkV. Subba Ramaiah0R. Rajeswara Rao1Mahatma Gandhi Institute of Technology, Kokapet, Hyderabad, Telangana 500075, India; Corresponding author.JNTUK-UCEV, Kakinada, Andhra Pradesh 535002, IndiaIn general, speaker diarization is defined as the process of segmenting the input speech signal and grouped the homogenous regions with regard to the speaker identity. The main idea behind this system is that it is able to discriminate the speaker signal by assigning the label of the each speaker signal. Due to rapid growth of broadcasting and meeting, the speaker diarization is burdensome to enhance the readability of the speech transcription. In order to solve this issue, Holoentropy with the eXtended Linear Prediction using autocorrelation Snapshot (HXLPS) and deep neural network (DNN) is proposed for the speaker diarization system. The HXLPS extraction method is newly developed by incorporating the Holoentropy with the XLPS. Once we attain the features, the speech and non-speech signals are detected by the Voice Activity Detection (VAD) method. Then, i-vector representation of every segmented signal is obtained using Universal Background Model (UBM) model. Consequently, DNN is utilized to assign the label for the speaker signal which is then clustered according to the speaker label. The performance is analysed using the evaluation metrics, such as tracking distance, false alarm rate and diarization error rate. The outcome of the proposed method ensures the better diarization performance by achieving the lower DER of 1.36% based on lambda value and DER of 2.23% depends on the frame length. Keywords: Speaker diarization, HXLPS feature extraction, Voice activity detection, Deep neural network, Speaker clustering, Diarization Error Rate (DER)http://www.sciencedirect.com/science/article/pii/S1110016816303416
collection DOAJ
language English
format Article
sources DOAJ
author V. Subba Ramaiah
R. Rajeswara Rao
spellingShingle V. Subba Ramaiah
R. Rajeswara Rao
Speaker diarization system using HXLPS and deep neural network
Alexandria Engineering Journal
author_facet V. Subba Ramaiah
R. Rajeswara Rao
author_sort V. Subba Ramaiah
title Speaker diarization system using HXLPS and deep neural network
title_short Speaker diarization system using HXLPS and deep neural network
title_full Speaker diarization system using HXLPS and deep neural network
title_fullStr Speaker diarization system using HXLPS and deep neural network
title_full_unstemmed Speaker diarization system using HXLPS and deep neural network
title_sort speaker diarization system using hxlps and deep neural network
publisher Elsevier
series Alexandria Engineering Journal
issn 1110-0168
publishDate 2018-03-01
description In general, speaker diarization is defined as the process of segmenting the input speech signal and grouped the homogenous regions with regard to the speaker identity. The main idea behind this system is that it is able to discriminate the speaker signal by assigning the label of the each speaker signal. Due to rapid growth of broadcasting and meeting, the speaker diarization is burdensome to enhance the readability of the speech transcription. In order to solve this issue, Holoentropy with the eXtended Linear Prediction using autocorrelation Snapshot (HXLPS) and deep neural network (DNN) is proposed for the speaker diarization system. The HXLPS extraction method is newly developed by incorporating the Holoentropy with the XLPS. Once we attain the features, the speech and non-speech signals are detected by the Voice Activity Detection (VAD) method. Then, i-vector representation of every segmented signal is obtained using Universal Background Model (UBM) model. Consequently, DNN is utilized to assign the label for the speaker signal which is then clustered according to the speaker label. The performance is analysed using the evaluation metrics, such as tracking distance, false alarm rate and diarization error rate. The outcome of the proposed method ensures the better diarization performance by achieving the lower DER of 1.36% based on lambda value and DER of 2.23% depends on the frame length. Keywords: Speaker diarization, HXLPS feature extraction, Voice activity detection, Deep neural network, Speaker clustering, Diarization Error Rate (DER)
url http://www.sciencedirect.com/science/article/pii/S1110016816303416
work_keys_str_mv AT vsubbaramaiah speakerdiarizationsystemusinghxlpsanddeepneuralnetwork
AT rrajeswararao speakerdiarizationsystemusinghxlpsanddeepneuralnetwork
_version_ 1721409268460027904