Searching and Recommending Texts Related to Climate Change

This project considers the design of a machine learning system to search efficiently a database of texts related to climate change. The efficient search and navigation of such a database make it easier to find actionable information, detect trends, or derives other useful information. A key feature...

Full description

Bibliographic Details
Main Author:	Gjöthlén, Karolin
Format:	Others
Language:	English
Published:	Uppsala universitet, Institutionen för informationsteknologi 2021
Subjects:	Engineering and Technology Teknik och teknologier
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-443535

id	ndltd-UPSALLA1-oai-DiVA.org-uu-443535
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-uu-4435352021-08-18T05:24:07ZSearching and Recommending Texts Related to Climate ChangeengGjöthlén, KarolinUppsala universitet, Institutionen för informationsteknologi2021Engineering and TechnologyTeknik och teknologierThis project considers the design of a machine learning system to search efficiently a database of texts related to climate change. The efficient search and navigation of such a database make it easier to find actionable information, detect trends, or derives other useful information. A key feature of such an information retrieval system is the numerical representation of such a text. This project implements and compares three different ways to represent a text in a vector space. Specifically, we contrast Bag-of-Words, Term Frequency - Inverse Document Frequency, and Doc2Vec in this context. The reported results indicate two cases: firstly, we observe that all 3 embeddings outperform a naive (fixed, expert rule-based) method for retrieving a text. In this case, the query contains part of the text with a small modification, while the result of the query should be the text itself. The Bag-of-Words approach turns out to be best in class for this task. Secondly, we consider the task where the query is a random string, while the desired result is based on a manual comparison of the results. Here we observe that the doc2vec approach is best in class. If the random queries become abstract-alike, the Bag-of-Words approach is performing almost as well. Det har projektet tar hänsyn till utformningen av ett maskininlärningssystem för att effektivt söka i en databas med texter relaterade till klimatförändringar. Effektiv sökning och navigering av en sådan databas gör det lättare att upptäcka trender eller hitta användbar information. En nyckelfunktion i ett sådant informationshämtningssystem är den numeriska representationen av en sådan text. Detta projekt implementerar och jämför tre olika sätt att representera en text i en vektorrymd. Specifikt jämför vi Bag-of-Words, Term Frequency - Inverse Document Frequency och Doc2Vec i detta sammanhang. De rapporterade resultaten indikerar två fall: i det första fallet observerar vi att alla 3 implementationer overträffar en naiv metod för att hitta en text. I det här fallet innehåller forfrågan en del av texten med en mindre modifikation, medan resultatet bör vara själva texten. Bag-of-Words-metoden visar sig vara bäst lämpad för denna uppgift. I det andra fallet är f örfrågan en slumpmässig sträng, medan det önskade resultatet baseras på en manuell jämförelse av resultaten. Här observerar vi att doc2vec-metoden är bäst. Om förfrågan är lik ett förväntat resultat fungerar Bag-of-Words-metoden nästan lika bra. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-443535UPTEC IT, 1401-5749 ; 21006application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Engineering and Technology Teknik och teknologier
spellingShingle	Engineering and Technology Teknik och teknologier Gjöthlén, Karolin Searching and Recommending Texts Related to Climate Change
description	This project considers the design of a machine learning system to search efficiently a database of texts related to climate change. The efficient search and navigation of such a database make it easier to find actionable information, detect trends, or derives other useful information. A key feature of such an information retrieval system is the numerical representation of such a text. This project implements and compares three different ways to represent a text in a vector space. Specifically, we contrast Bag-of-Words, Term Frequency - Inverse Document Frequency, and Doc2Vec in this context. The reported results indicate two cases: firstly, we observe that all 3 embeddings outperform a naive (fixed, expert rule-based) method for retrieving a text. In this case, the query contains part of the text with a small modification, while the result of the query should be the text itself. The Bag-of-Words approach turns out to be best in class for this task. Secondly, we consider the task where the query is a random string, while the desired result is based on a manual comparison of the results. Here we observe that the doc2vec approach is best in class. If the random queries become abstract-alike, the Bag-of-Words approach is performing almost as well. === Det har projektet tar hänsyn till utformningen av ett maskininlärningssystem för att effektivt söka i en databas med texter relaterade till klimatförändringar. Effektiv sökning och navigering av en sådan databas gör det lättare att upptäcka trender eller hitta användbar information. En nyckelfunktion i ett sådant informationshämtningssystem är den numeriska representationen av en sådan text. Detta projekt implementerar och jämför tre olika sätt att representera en text i en vektorrymd. Specifikt jämför vi Bag-of-Words, Term Frequency - Inverse Document Frequency och Doc2Vec i detta sammanhang. De rapporterade resultaten indikerar två fall: i det första fallet observerar vi att alla 3 implementationer overträffar en naiv metod för att hitta en text. I det här fallet innehåller forfrågan en del av texten med en mindre modifikation, medan resultatet bör vara själva texten. Bag-of-Words-metoden visar sig vara bäst lämpad för denna uppgift. I det andra fallet är f örfrågan en slumpmässig sträng, medan det önskade resultatet baseras på en manuell jämförelse av resultaten. Här observerar vi att doc2vec-metoden är bäst. Om förfrågan är lik ett förväntat resultat fungerar Bag-of-Words-metoden nästan lika bra.
author	Gjöthlén, Karolin
author_facet	Gjöthlén, Karolin
author_sort	Gjöthlén, Karolin
title	Searching and Recommending Texts Related to Climate Change
title_short	Searching and Recommending Texts Related to Climate Change
title_full	Searching and Recommending Texts Related to Climate Change
title_fullStr	Searching and Recommending Texts Related to Climate Change
title_full_unstemmed	Searching and Recommending Texts Related to Climate Change
title_sort	searching and recommending texts related to climate change
publisher	Uppsala universitet, Institutionen för informationsteknologi
publishDate	2021
url	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-443535
work_keys_str_mv	AT gjothlenkarolin searchingandrecommendingtextsrelatedtoclimatechange
_version_	1719460625167417344

Searching and Recommending Texts Related to Climate Change

Similar Items