Analysis of similarity and differences between articles using semantics
Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurem...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Mälardalens högskola, Akademin för innovation, design och teknik
2017
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843 |
id |
ndltd-UPSALLA1-oai-DiVA.org-mdh-34843 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-mdh-348432018-01-14T05:11:04ZAnalysis of similarity and differences between articles using semanticsengBihi, AhmedMälardalens högskola, Akademin för innovation, design och teknik2017Natural language processingsimilaritysemantic analysiscomputer scienceComputer SciencesDatavetenskap (datalogi)Language Technology (Computational Linguistics)Språkteknologi (språkvetenskaplig databehandling)Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%). Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Natural language processing similarity semantic analysis computer science Computer Sciences Datavetenskap (datalogi) Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling) |
spellingShingle |
Natural language processing similarity semantic analysis computer science Computer Sciences Datavetenskap (datalogi) Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling) Bihi, Ahmed Analysis of similarity and differences between articles using semantics |
description |
Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%). |
author |
Bihi, Ahmed |
author_facet |
Bihi, Ahmed |
author_sort |
Bihi, Ahmed |
title |
Analysis of similarity and differences between articles using semantics |
title_short |
Analysis of similarity and differences between articles using semantics |
title_full |
Analysis of similarity and differences between articles using semantics |
title_fullStr |
Analysis of similarity and differences between articles using semantics |
title_full_unstemmed |
Analysis of similarity and differences between articles using semantics |
title_sort |
analysis of similarity and differences between articles using semantics |
publisher |
Mälardalens högskola, Akademin för innovation, design och teknik |
publishDate |
2017 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843 |
work_keys_str_mv |
AT bihiahmed analysisofsimilarityanddifferencesbetweenarticlesusingsemantics |
_version_ |
1718609901683474432 |