Analysis of similarity and differences between articles using semantics

Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurem...

Full description

Bibliographic Details
Main Author:	Bihi, Ahmed
Format:	Others
Language:	English
Published:	Mälardalens högskola, Akademin för innovation, design och teknik 2017
Subjects:	Natural language processing similarity semantic analysis computer science Computer Sciences Datavetenskap (datalogi) Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling)
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843

id	ndltd-UPSALLA1-oai-DiVA.org-mdh-34843
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-mdh-348432018-01-14T05:11:04ZAnalysis of similarity and differences between articles using semanticsengBihi, AhmedMälardalens högskola, Akademin för innovation, design och teknik2017Natural language processingsimilaritysemantic analysiscomputer scienceComputer SciencesDatavetenskap (datalogi)Language Technology (Computational Linguistics)Språkteknologi (språkvetenskaplig databehandling)Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%). Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Natural language processing similarity semantic analysis computer science Computer Sciences Datavetenskap (datalogi) Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling)
spellingShingle	Natural language processing similarity semantic analysis computer science Computer Sciences Datavetenskap (datalogi) Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling) Bihi, Ahmed Analysis of similarity and differences between articles using semantics
description	Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%).
author	Bihi, Ahmed
author_facet	Bihi, Ahmed
author_sort	Bihi, Ahmed
title	Analysis of similarity and differences between articles using semantics
title_short	Analysis of similarity and differences between articles using semantics
title_full	Analysis of similarity and differences between articles using semantics
title_fullStr	Analysis of similarity and differences between articles using semantics
title_full_unstemmed	Analysis of similarity and differences between articles using semantics
title_sort	analysis of similarity and differences between articles using semantics
publisher	Mälardalens högskola, Akademin för innovation, design och teknik
publishDate	2017
url	http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843
work_keys_str_mv	AT bihiahmed analysisofsimilarityanddifferencesbetweenarticlesusingsemantics
_version_	1718609901683474432

Analysis of similarity and differences between articles using semantics

Similar Items