Clustering Semantically Related Questions
There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and us...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Örebro universitet, Institutionen för naturvetenskap och teknik
2019
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604 |
id |
ndltd-UPSALLA1-oai-DiVA.org-oru-76604 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-oru-766042019-09-21T04:26:22ZClustering Semantically Related QuestionsengKaragkiozis, NikolaosÖrebro universitet, Institutionen för naturvetenskap och teknik2019Computer SciencesDatavetenskap (datalogi)There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is difficult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to find semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To obtain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Commons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the effectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Computer Sciences Datavetenskap (datalogi) |
spellingShingle |
Computer Sciences Datavetenskap (datalogi) Karagkiozis, Nikolaos Clustering Semantically Related Questions |
description |
There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is difficult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to find semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To obtain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Commons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the effectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent. |
author |
Karagkiozis, Nikolaos |
author_facet |
Karagkiozis, Nikolaos |
author_sort |
Karagkiozis, Nikolaos |
title |
Clustering Semantically Related Questions |
title_short |
Clustering Semantically Related Questions |
title_full |
Clustering Semantically Related Questions |
title_fullStr |
Clustering Semantically Related Questions |
title_full_unstemmed |
Clustering Semantically Related Questions |
title_sort |
clustering semantically related questions |
publisher |
Örebro universitet, Institutionen för naturvetenskap och teknik |
publishDate |
2019 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604 |
work_keys_str_mv |
AT karagkiozisnikolaos clusteringsemanticallyrelatedquestions |
_version_ |
1719254051121528832 |