Analysis of similarity and differences between articles using semantics

Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurem...

Full description

Bibliographic Details
Main Author: Bihi, Ahmed
Format: Others
Language:English
Published: Mälardalens högskola, Akademin för innovation, design och teknik 2017
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843
id ndltd-UPSALLA1-oai-DiVA.org-mdh-34843
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-mdh-348432018-01-14T05:11:04ZAnalysis of similarity and differences between articles using semanticsengBihi, AhmedMälardalens högskola, Akademin för innovation, design och teknik2017Natural language processingsimilaritysemantic analysiscomputer scienceComputer SciencesDatavetenskap (datalogi)Language Technology (Computational Linguistics)Språkteknologi (språkvetenskaplig databehandling)Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%). Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Natural language processing
similarity
semantic analysis
computer science
Computer Sciences
Datavetenskap (datalogi)
Language Technology (Computational Linguistics)
Språkteknologi (språkvetenskaplig databehandling)
spellingShingle Natural language processing
similarity
semantic analysis
computer science
Computer Sciences
Datavetenskap (datalogi)
Language Technology (Computational Linguistics)
Språkteknologi (språkvetenskaplig databehandling)
Bihi, Ahmed
Analysis of similarity and differences between articles using semantics
description Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%).
author Bihi, Ahmed
author_facet Bihi, Ahmed
author_sort Bihi, Ahmed
title Analysis of similarity and differences between articles using semantics
title_short Analysis of similarity and differences between articles using semantics
title_full Analysis of similarity and differences between articles using semantics
title_fullStr Analysis of similarity and differences between articles using semantics
title_full_unstemmed Analysis of similarity and differences between articles using semantics
title_sort analysis of similarity and differences between articles using semantics
publisher Mälardalens högskola, Akademin för innovation, design och teknik
publishDate 2017
url http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843
work_keys_str_mv AT bihiahmed analysisofsimilarityanddifferencesbetweenarticlesusingsemantics
_version_ 1718609901683474432