Clustering Semantically Related Questions

There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and us...

Full description

Bibliographic Details
Main Author: Karagkiozis, Nikolaos
Format: Others
Language:English
Published: Örebro universitet, Institutionen för naturvetenskap och teknik 2019
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604
id ndltd-UPSALLA1-oai-DiVA.org-oru-76604
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-oru-766042019-09-21T04:26:22ZClustering Semantically Related QuestionsengKaragkiozis, NikolaosÖrebro universitet, Institutionen för naturvetenskap och teknik2019Computer SciencesDatavetenskap (datalogi)There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is difficult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to find semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To obtain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Commons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the effectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Computer Sciences
Datavetenskap (datalogi)
spellingShingle Computer Sciences
Datavetenskap (datalogi)
Karagkiozis, Nikolaos
Clustering Semantically Related Questions
description There has been a vast increase of users that use the internet in order to communicate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is difficult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to find semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To obtain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Commons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the effectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent.
author Karagkiozis, Nikolaos
author_facet Karagkiozis, Nikolaos
author_sort Karagkiozis, Nikolaos
title Clustering Semantically Related Questions
title_short Clustering Semantically Related Questions
title_full Clustering Semantically Related Questions
title_fullStr Clustering Semantically Related Questions
title_full_unstemmed Clustering Semantically Related Questions
title_sort clustering semantically related questions
publisher Örebro universitet, Institutionen för naturvetenskap och teknik
publishDate 2019
url http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76604
work_keys_str_mv AT karagkiozisnikolaos clusteringsemanticallyrelatedquestions
_version_ 1719254051121528832