Clustering Semantically Related Questions

There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and us...

Full description

Bibliographic Details
Main Author:	Karagkiozis, Nikolaos
Format:	Others
Language:	English
Published:	Örebro universitet, Institutionen för naturvetenskap och teknik 2019
Subjects:	Computer Sciences Datavetenskap (datalogi)
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604

id	ndltd-UPSALLA1-oai-DiVA.org-oru-76604
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-oru-766042019-09-21T04:26:22ZClustering Semantically Related QuestionsengKaragkiozis, NikolaosÖrebro universitet, Institutionen för naturvetenskap och teknik2019Computer SciencesDatavetenskap (datalogi)There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is diﬃcult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to ﬁnd semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To obtain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Commons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the eﬀectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Computer Sciences Datavetenskap (datalogi)
spellingShingle	Computer Sciences Datavetenskap (datalogi) Karagkiozis, Nikolaos Clustering Semantically Related Questions
description	There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is diﬃcult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to ﬁnd semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To obtain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Commons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the eﬀectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent.
author	Karagkiozis, Nikolaos
author_facet	Karagkiozis, Nikolaos
author_sort	Karagkiozis, Nikolaos
title	Clustering Semantically Related Questions
title_short	Clustering Semantically Related Questions
title_full	Clustering Semantically Related Questions
title_fullStr	Clustering Semantically Related Questions
title_full_unstemmed	Clustering Semantically Related Questions
title_sort	clustering semantically related questions
publisher	Örebro universitet, Institutionen för naturvetenskap och teknik
publishDate	2019
url	http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604
work_keys_str_mv	AT karagkiozisnikolaos clusteringsemanticallyrelatedquestions
_version_	1719254051121528832

Clustering Semantically Related Questions

Similar Items