News Feeds Clustering Research Study
With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to...
Main Author: | |
---|---|
Format: | Others |
Published: |
NSUWorks
2015
|
Subjects: | |
Online Access: | http://nsuworks.nova.edu/gscis_etd/52 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd |
id |
ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-1051 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-10512016-04-25T19:34:37Z News Feeds Clustering Research Study Abuel-Futuh, Haytham With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to browse topics of interest on different structured sites. One problem that arises for these applications and others with similar goals is identifying documents with similar contents. This helps the applications show users documents with unique contents as well as group various similar documents under similar topics. There has been a lot of effort into algorithms that can achieve that task. Prior research include Yang, Pierce & Carbonell (1998) research where they looked at the problem of identifying news events exploiting chronology order, Nallapati, et al (2004) research who built a dependency model for news events and Shah & Elbahesh (2004) research where they used Jaccard coefficient to generate a flat list of topics. This research will identify training and testing datasets, and it will train and evaluate (Pera & Ng) algorithm. The chosen algorithm is a hierarchical clustering algorithm that incorporates many of the ideas researched earlier. In evaluation phase, error will be measured in the ratio of miss-categorized documents to the total number of documents. The research will show error can be as low as 0.03 with a model built on a single node processing 1000 random distinct documents. In evaluation of the algorithm, the experiments will show that (Pera & Ng)’s fuzzy equivalence algorithm does produce acceptable results when compared to Google News as a reference. The algorithm, however, requires a huge amount of memory to hold the trained model. This renders it not suitable to run on portable devices. 2015-04-01T07:00:00Z text application/pdf http://nsuworks.nova.edu/gscis_etd/52 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd CEC Theses and Dissertations NSUWorks Computer Science Information technology Clustering News feeds Computer Sciences |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
Computer Science Information technology Clustering News feeds Computer Sciences |
spellingShingle |
Computer Science Information technology Clustering News feeds Computer Sciences Abuel-Futuh, Haytham News Feeds Clustering Research Study |
description |
With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to browse topics of interest on different structured sites. One problem that arises for these applications and others with similar goals is identifying documents with similar contents. This helps the applications show users documents with unique contents as well as group various similar documents under similar topics. There has been a lot of effort into algorithms that can achieve that task. Prior research include Yang, Pierce & Carbonell (1998) research where they looked at the problem of identifying news events exploiting chronology order, Nallapati, et al (2004) research who built a dependency model for news events and Shah & Elbahesh (2004) research where they used Jaccard coefficient to generate a flat list of topics.
This research will identify training and testing datasets, and it will train and evaluate (Pera & Ng) algorithm. The chosen algorithm is a hierarchical clustering algorithm that incorporates many of the ideas researched earlier. In evaluation phase, error will be measured in the ratio of miss-categorized documents to the total number of documents. The research will show error can be as low as 0.03 with a model built on a single node processing 1000 random distinct documents. In evaluation of the algorithm, the experiments will show that (Pera & Ng)’s fuzzy equivalence algorithm does produce acceptable results when compared to Google News as a reference. The algorithm, however, requires a huge amount of memory to hold the trained model.
This renders it not suitable to run on portable devices. |
author |
Abuel-Futuh, Haytham |
author_facet |
Abuel-Futuh, Haytham |
author_sort |
Abuel-Futuh, Haytham |
title |
News Feeds Clustering Research Study |
title_short |
News Feeds Clustering Research Study |
title_full |
News Feeds Clustering Research Study |
title_fullStr |
News Feeds Clustering Research Study |
title_full_unstemmed |
News Feeds Clustering Research Study |
title_sort |
news feeds clustering research study |
publisher |
NSUWorks |
publishDate |
2015 |
url |
http://nsuworks.nova.edu/gscis_etd/52 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd |
work_keys_str_mv |
AT abuelfutuhhaytham newsfeedsclusteringresearchstudy |
_version_ |
1718248471699390464 |