News Feeds Clustering Research Study

With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to...

Full description

Bibliographic Details
Main Author: Abuel-Futuh, Haytham
Format: Others
Published: NSUWorks 2015
Subjects:
Online Access:http://nsuworks.nova.edu/gscis_etd/52
http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd
id ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-1051
record_format oai_dc
spelling ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-10512016-04-25T19:34:37Z News Feeds Clustering Research Study Abuel-Futuh, Haytham With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to browse topics of interest on different structured sites. One problem that arises for these applications and others with similar goals is identifying documents with similar contents. This helps the applications show users documents with unique contents as well as group various similar documents under similar topics. There has been a lot of effort into algorithms that can achieve that task. Prior research include Yang, Pierce & Carbonell (1998) research where they looked at the problem of identifying news events exploiting chronology order, Nallapati, et al (2004) research who built a dependency model for news events and Shah & Elbahesh (2004) research where they used Jaccard coefficient to generate a flat list of topics. This research will identify training and testing datasets, and it will train and evaluate (Pera & Ng) algorithm. The chosen algorithm is a hierarchical clustering algorithm that incorporates many of the ideas researched earlier. In evaluation phase, error will be measured in the ratio of miss-categorized documents to the total number of documents. The research will show error can be as low as 0.03 with a model built on a single node processing 1000 random distinct documents. In evaluation of the algorithm, the experiments will show that (Pera & Ng)’s fuzzy equivalence algorithm does produce acceptable results when compared to Google News as a reference. The algorithm, however, requires a huge amount of memory to hold the trained model. This renders it not suitable to run on portable devices. 2015-04-01T07:00:00Z text application/pdf http://nsuworks.nova.edu/gscis_etd/52 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd CEC Theses and Dissertations NSUWorks Computer Science Information technology Clustering News feeds Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic Computer Science
Information technology
Clustering
News feeds
Computer Sciences
spellingShingle Computer Science
Information technology
Clustering
News feeds
Computer Sciences
Abuel-Futuh, Haytham
News Feeds Clustering Research Study
description With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to browse topics of interest on different structured sites. One problem that arises for these applications and others with similar goals is identifying documents with similar contents. This helps the applications show users documents with unique contents as well as group various similar documents under similar topics. There has been a lot of effort into algorithms that can achieve that task. Prior research include Yang, Pierce & Carbonell (1998) research where they looked at the problem of identifying news events exploiting chronology order, Nallapati, et al (2004) research who built a dependency model for news events and Shah & Elbahesh (2004) research where they used Jaccard coefficient to generate a flat list of topics. This research will identify training and testing datasets, and it will train and evaluate (Pera & Ng) algorithm. The chosen algorithm is a hierarchical clustering algorithm that incorporates many of the ideas researched earlier. In evaluation phase, error will be measured in the ratio of miss-categorized documents to the total number of documents. The research will show error can be as low as 0.03 with a model built on a single node processing 1000 random distinct documents. In evaluation of the algorithm, the experiments will show that (Pera & Ng)’s fuzzy equivalence algorithm does produce acceptable results when compared to Google News as a reference. The algorithm, however, requires a huge amount of memory to hold the trained model. This renders it not suitable to run on portable devices.
author Abuel-Futuh, Haytham
author_facet Abuel-Futuh, Haytham
author_sort Abuel-Futuh, Haytham
title News Feeds Clustering Research Study
title_short News Feeds Clustering Research Study
title_full News Feeds Clustering Research Study
title_fullStr News Feeds Clustering Research Study
title_full_unstemmed News Feeds Clustering Research Study
title_sort news feeds clustering research study
publisher NSUWorks
publishDate 2015
url http://nsuworks.nova.edu/gscis_etd/52
http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd
work_keys_str_mv AT abuelfutuhhaytham newsfeedsclusteringresearchstudy
_version_ 1718248471699390464