Automatisk extraktion av nyckelord ur ett kundforum

Konversationerna i ett kundforum rör sig över olika ämnen och språket är inkonsekvent. Texterna uppfyller inte de krav som brukar ställas på material inför automatisk nyckelordsextraktion. Uppsatsens undersöker hur nyckelord automatiskt kan extraheras ur ett kundforum trots dessa svårigheter. Fokus...

Full description

Bibliographic Details
Main Author:	Ekman, Sara
Format:	Others
Language:	Swedish
Published:	Stockholms universitet, Avdelningen för datorlingvistik 2018
Subjects:	Automatic keyword extraction Information extraction Noisy text TF*IDF User generated text Användargenererad text Automatisk nyckelordsextraktion Brusig text Informationsextraktion General Language Studies and Linguistics Jämförande språkvetenskap och allmän lingvistik
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-160686

id	ndltd-UPSALLA1-oai-DiVA.org-su-160686
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-su-1606862018-10-13T06:14:40ZAutomatisk extraktion av nyckelord ur ett kundforumsweAutomatic keyword extraction from a customer forumEkman, SaraStockholms universitet, Avdelningen för datorlingvistik2018Automatic keyword extractionInformation extractionNoisy textTFIDFUser generated textAnvändargenererad textAutomatisk nyckelordsextraktionBrusig textInformationsextraktionTFIDFGeneral Language Studies and LinguisticsJämförande språkvetenskap och allmän lingvistikKonversationerna i ett kundforum rör sig över olika ämnen och språket är inkonsekvent. Texterna uppfyller inte de krav som brukar ställas på material inför automatisk nyckelordsextraktion. Uppsatsens undersöker hur nyckelord automatiskt kan extraheras ur ett kundforum trots dessa svårigheter. Fokus i undersökningen ligger på tre aspekter av nyckelordsextraktion. Den första faktorn rör hur den etablerade nyckelordsextraktionsmetoden TFIDF presterar jämfört med fyra metoder som skapas med hänsyn till materialets ovanliga struktur. Nästa faktor som testas är om olika sätt att räkna ordfrekvens påverkar resultatet. Den tredje faktorn är hur metoderna presterar om de endast använder inläggen, rubrikerna eller båda texttyperna i sina extraktioner. Icke-parametriska test användes för utvärdering av extraktionerna. Ett antal Friedmans test visar att metoderna i några fall skiljer sig åt gällande förmåga att identifiera relevanta nyckelord. I post-hoc-test mellan de högst presterande metoderna ses en av de nya metoderna i ett fall prestera signifikant bättre än de andra nya metoderna men inte bättre än TFIDF. Ingen skillnad hittades mellan användning av olika texttyper eller sätt att räkna ordfrekvens. För framtida forskning rekommenderas reliabilitetstest av manuellt annoterade nyckelord. Ett större stickprov bör användas än det i aktuell studie och olika förslag ges för att förbättra rättning av extraherade nyckelord. Conversations in a customer forum span across different topics and the language is inconsistent. The text type do not meet the demands for automatic keyword extraction. This essay examines how keywords can be automatically extracted despite these difficulties. Focus in the study are three areas of keyword extraction. The first factor regards how the established keyword extraction method TFIDF performs compared to four methods created with the unusual material in mind. The next factor deals with different ways to calculate word frequency. The third factor regards if the methods use only posts, only titles, or both in their extractions. Non-parametric tests were conducted to evaluate the extractions. A number of Friedman's tests shows the methods in some cases differ in their ability to identify relevant keywords. In post-hoc tests performed between the highest performing methods, one of the new methods perform significantly better than the other new methods but not better than TFIDF. No difference was found between the use of different text types or ways to calculate word frequency. For future research reliability test of manually annotated keywords is recommended. A larger sample size should be used than in the current study and further suggestions are given to improve the results of keyword extractions. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-160686application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	Swedish
format	Others
sources	NDLTD
topic	Automatic keyword extraction Information extraction Noisy text TFIDF User generated text Användargenererad text Automatisk nyckelordsextraktion Brusig text Informationsextraktion TFIDF General Language Studies and Linguistics Jämförande språkvetenskap och allmän lingvistik
spellingShingle	Automatic keyword extraction Information extraction Noisy text TFIDF User generated text Användargenererad text Automatisk nyckelordsextraktion Brusig text Informationsextraktion TFIDF General Language Studies and Linguistics Jämförande språkvetenskap och allmän lingvistik Ekman, Sara Automatisk extraktion av nyckelord ur ett kundforum
description	Konversationerna i ett kundforum rör sig över olika ämnen och språket är inkonsekvent. Texterna uppfyller inte de krav som brukar ställas på material inför automatisk nyckelordsextraktion. Uppsatsens undersöker hur nyckelord automatiskt kan extraheras ur ett kundforum trots dessa svårigheter. Fokus i undersökningen ligger på tre aspekter av nyckelordsextraktion. Den första faktorn rör hur den etablerade nyckelordsextraktionsmetoden TFIDF presterar jämfört med fyra metoder som skapas med hänsyn till materialets ovanliga struktur. Nästa faktor som testas är om olika sätt att räkna ordfrekvens påverkar resultatet. Den tredje faktorn är hur metoderna presterar om de endast använder inläggen, rubrikerna eller båda texttyperna i sina extraktioner. Icke-parametriska test användes för utvärdering av extraktionerna. Ett antal Friedmans test visar att metoderna i några fall skiljer sig åt gällande förmåga att identifiera relevanta nyckelord. I post-hoc-test mellan de högst presterande metoderna ses en av de nya metoderna i ett fall prestera signifikant bättre än de andra nya metoderna men inte bättre än TFIDF. Ingen skillnad hittades mellan användning av olika texttyper eller sätt att räkna ordfrekvens. För framtida forskning rekommenderas reliabilitetstest av manuellt annoterade nyckelord. Ett större stickprov bör användas än det i aktuell studie och olika förslag ges för att förbättra rättning av extraherade nyckelord. === Conversations in a customer forum span across different topics and the language is inconsistent. The text type do not meet the demands for automatic keyword extraction. This essay examines how keywords can be automatically extracted despite these difficulties. Focus in the study are three areas of keyword extraction. The first factor regards how the established keyword extraction method TFIDF performs compared to four methods created with the unusual material in mind. The next factor deals with different ways to calculate word frequency. The third factor regards if the methods use only posts, only titles, or both in their extractions. Non-parametric tests were conducted to evaluate the extractions. A number of Friedman's tests shows the methods in some cases differ in their ability to identify relevant keywords. In post-hoc tests performed between the highest performing methods, one of the new methods perform significantly better than the other new methods but not better than TFIDF. No difference was found between the use of different text types or ways to calculate word frequency. For future research reliability test of manually annotated keywords is recommended. A larger sample size should be used than in the current study and further suggestions are given to improve the results of keyword extractions.
author	Ekman, Sara
author_facet	Ekman, Sara
author_sort	Ekman, Sara
title	Automatisk extraktion av nyckelord ur ett kundforum
title_short	Automatisk extraktion av nyckelord ur ett kundforum
title_full	Automatisk extraktion av nyckelord ur ett kundforum
title_fullStr	Automatisk extraktion av nyckelord ur ett kundforum
title_full_unstemmed	Automatisk extraktion av nyckelord ur ett kundforum
title_sort	automatisk extraktion av nyckelord ur ett kundforum
publisher	Stockholms universitet, Avdelningen för datorlingvistik
publishDate	2018
url	http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-160686
work_keys_str_mv	AT ekmansara automatiskextraktionavnyckelordurettkundforum AT ekmansara automatickeywordextractionfromacustomerforum
_version_	1718773293689864192

Automatisk extraktion av nyckelord ur ett kundforum

Similar Items