Profile-Based Focused Crawling for Social Media-Sharing Websites

<p/> <p>We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user's profile into two...

Full description

Bibliographic Details
Main Authors: Zhang Zhiyong, Nasraoui Olfa
Format: Article
Language:English
Published: SpringerOpen 2009-01-01
Series:EURASIP Journal on Image and Video Processing
Online Access:http://jivp.eurasipjournals.com/content/2009/856037
id doaj-3bd5c2fa3d764c44aff14b7cf85365ab
record_format Article
spelling doaj-3bd5c2fa3d764c44aff14b7cf85365ab2020-11-25T00:54:37ZengSpringerOpenEURASIP Journal on Image and Video Processing1687-51761687-52812009-01-0120091856037Profile-Based Focused Crawling for Social Media-Sharing WebsitesZhang ZhiyongNasraoui Olfa<p/> <p>We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user's profile into two parts, an <it>internal part</it>, which comes from the user's own contribution, and an <it>external part</it>, which comes from the user's social contacts. In order to expand the crawling topic, a cotagging topic-discovery scheme was adopted for social media-sharing websites. In order to efficiently and effectively extract data for the focused crawling, a <it>path string</it>-based page classification method is first developed for identifying <it>list pages, detail pages</it>, and <it>profile pages</it>. The identification of the correct type of page is essential for our crawling, since we want to distinguish between list, profile, and detail pages in order to extract the correct information from each type of page, and subsequently estimate a reasonable ranking for each link that is encountered while crawling. Our experiments prove the robustness of our profile-based focused crawler, as well as a significant improvement in harvest ratio, compared to breadth-first and online page importance computation (OPIC) crawlers, when crawling the Flickr website for two different topics.</p>http://jivp.eurasipjournals.com/content/2009/856037
collection DOAJ
language English
format Article
sources DOAJ
author Zhang Zhiyong
Nasraoui Olfa
spellingShingle Zhang Zhiyong
Nasraoui Olfa
Profile-Based Focused Crawling for Social Media-Sharing Websites
EURASIP Journal on Image and Video Processing
author_facet Zhang Zhiyong
Nasraoui Olfa
author_sort Zhang Zhiyong
title Profile-Based Focused Crawling for Social Media-Sharing Websites
title_short Profile-Based Focused Crawling for Social Media-Sharing Websites
title_full Profile-Based Focused Crawling for Social Media-Sharing Websites
title_fullStr Profile-Based Focused Crawling for Social Media-Sharing Websites
title_full_unstemmed Profile-Based Focused Crawling for Social Media-Sharing Websites
title_sort profile-based focused crawling for social media-sharing websites
publisher SpringerOpen
series EURASIP Journal on Image and Video Processing
issn 1687-5176
1687-5281
publishDate 2009-01-01
description <p/> <p>We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user's profile into two parts, an <it>internal part</it>, which comes from the user's own contribution, and an <it>external part</it>, which comes from the user's social contacts. In order to expand the crawling topic, a cotagging topic-discovery scheme was adopted for social media-sharing websites. In order to efficiently and effectively extract data for the focused crawling, a <it>path string</it>-based page classification method is first developed for identifying <it>list pages, detail pages</it>, and <it>profile pages</it>. The identification of the correct type of page is essential for our crawling, since we want to distinguish between list, profile, and detail pages in order to extract the correct information from each type of page, and subsequently estimate a reasonable ranking for each link that is encountered while crawling. Our experiments prove the robustness of our profile-based focused crawler, as well as a significant improvement in harvest ratio, compared to breadth-first and online page importance computation (OPIC) crawlers, when crawling the Flickr website for two different topics.</p>
url http://jivp.eurasipjournals.com/content/2009/856037
work_keys_str_mv AT zhangzhiyong profilebasedfocusedcrawlingforsocialmediasharingwebsites
AT nasraouiolfa profilebasedfocusedcrawlingforsocialmediasharingwebsites
_version_ 1725233607474151424