News Feeds Clustering Research Study

With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to...

Full description

Bibliographic Details
Main Author:	Abuel-Futuh, Haytham
Format:	Others
Published:	NSUWorks 2015
Subjects:	Computer Science Information technology Clustering News feeds Computer Sciences
Online Access:	http://nsuworks.nova.edu/gscis_etd/52 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd

id	ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-1051
record_format	oai_dc
spelling	ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-10512016-04-25T19:34:37Z News Feeds Clustering Research Study Abuel-Futuh, Haytham With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to browse topics of interest on different structured sites. One problem that arises for these applications and others with similar goals is identifying documents with similar contents. This helps the applications show users documents with unique contents as well as group various similar documents under similar topics. There has been a lot of effort into algorithms that can achieve that task. Prior research include Yang, Pierce & Carbonell (1998) research where they looked at the problem of identifying news events exploiting chronology order, Nallapati, et al (2004) research who built a dependency model for news events and Shah & Elbahesh (2004) research where they used Jaccard coefficient to generate a flat list of topics. This research will identify training and testing datasets, and it will train and evaluate (Pera & Ng) algorithm. The chosen algorithm is a hierarchical clustering algorithm that incorporates many of the ideas researched earlier. In evaluation phase, error will be measured in the ratio of miss-categorized documents to the total number of documents. The research will show error can be as low as 0.03 with a model built on a single node processing 1000 random distinct documents. In evaluation of the algorithm, the experiments will show that (Pera & Ng)’s fuzzy equivalence algorithm does produce acceptable results when compared to Google News as a reference. The algorithm, however, requires a huge amount of memory to hold the trained model. This renders it not suitable to run on portable devices. 2015-04-01T07:00:00Z text application/pdf http://nsuworks.nova.edu/gscis_etd/52 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd CEC Theses and Dissertations NSUWorks Computer Science Information technology Clustering News feeds Computer Sciences
collection	NDLTD
format	Others
sources	NDLTD
topic	Computer Science Information technology Clustering News feeds Computer Sciences
spellingShingle	Computer Science Information technology Clustering News feeds Computer Sciences Abuel-Futuh, Haytham News Feeds Clustering Research Study
description	With over 0.25 billion web pages hosted in the World Wide Web, it is virtually impossible to navigate through the Internet. Many applications try to help users achieve this task. For example, search engines build indexes to make the entire World Wide Web searchable, and news curators allow users to browse topics of interest on different structured sites. One problem that arises for these applications and others with similar goals is identifying documents with similar contents. This helps the applications show users documents with unique contents as well as group various similar documents under similar topics. There has been a lot of effort into algorithms that can achieve that task. Prior research include Yang, Pierce & Carbonell (1998) research where they looked at the problem of identifying news events exploiting chronology order, Nallapati, et al (2004) research who built a dependency model for news events and Shah & Elbahesh (2004) research where they used Jaccard coefficient to generate a flat list of topics. This research will identify training and testing datasets, and it will train and evaluate (Pera & Ng) algorithm. The chosen algorithm is a hierarchical clustering algorithm that incorporates many of the ideas researched earlier. In evaluation phase, error will be measured in the ratio of miss-categorized documents to the total number of documents. The research will show error can be as low as 0.03 with a model built on a single node processing 1000 random distinct documents. In evaluation of the algorithm, the experiments will show that (Pera & Ng)’s fuzzy equivalence algorithm does produce acceptable results when compared to Google News as a reference. The algorithm, however, requires a huge amount of memory to hold the trained model. This renders it not suitable to run on portable devices.
author	Abuel-Futuh, Haytham
author_facet	Abuel-Futuh, Haytham
author_sort	Abuel-Futuh, Haytham
title	News Feeds Clustering Research Study
title_short	News Feeds Clustering Research Study
title_full	News Feeds Clustering Research Study
title_fullStr	News Feeds Clustering Research Study
title_full_unstemmed	News Feeds Clustering Research Study
title_sort	news feeds clustering research study
publisher	NSUWorks
publishDate	2015
url	http://nsuworks.nova.edu/gscis_etd/52 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1051&context=gscis_etd
work_keys_str_mv	AT abuelfutuhhaytham newsfeedsclusteringresearchstudy
_version_	1718248471699390464

News Feeds Clustering Research Study

Similar Items