RCrawler: An R package for parallel web crawling and scraping

RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applicat...

Full description

Bibliographic Details
Main Authors: Salim Khalil, Mohamed Fakir
Format: Article
Language:English
Published: Elsevier 2017-01-01
Series:SoftwareX
Online Access:http://www.sciencedirect.com/science/article/pii/S2352711017300110
id doaj-dc7250627556498dba12dd3f2ce84e13
record_format Article
spelling doaj-dc7250627556498dba12dd3f2ce84e132020-11-25T01:33:53ZengElsevierSoftwareX2352-71102017-01-01698106RCrawler: An R package for parallel web crawling and scrapingSalim Khalil0Mohamed Fakir1Corresponding author.; Department of Informatics, Faculty of Sciences and Technics Beni Mellal, MoroccoDepartment of Informatics, Faculty of Sciences and Technics Beni Mellal, MoroccoRCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results. Keywords: Web crawler, Web scraper, R package, Parallel crawling, Web mining, Data collectionhttp://www.sciencedirect.com/science/article/pii/S2352711017300110
collection DOAJ
language English
format Article
sources DOAJ
author Salim Khalil
Mohamed Fakir
spellingShingle Salim Khalil
Mohamed Fakir
RCrawler: An R package for parallel web crawling and scraping
SoftwareX
author_facet Salim Khalil
Mohamed Fakir
author_sort Salim Khalil
title RCrawler: An R package for parallel web crawling and scraping
title_short RCrawler: An R package for parallel web crawling and scraping
title_full RCrawler: An R package for parallel web crawling and scraping
title_fullStr RCrawler: An R package for parallel web crawling and scraping
title_full_unstemmed RCrawler: An R package for parallel web crawling and scraping
title_sort rcrawler: an r package for parallel web crawling and scraping
publisher Elsevier
series SoftwareX
issn 2352-7110
publishDate 2017-01-01
description RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results. Keywords: Web crawler, Web scraper, R package, Parallel crawling, Web mining, Data collection
url http://www.sciencedirect.com/science/article/pii/S2352711017300110
work_keys_str_mv AT salimkhalil rcrawleranrpackageforparallelwebcrawlingandscraping
AT mohamedfakir rcrawleranrpackageforparallelwebcrawlingandscraping
_version_ 1725075171327344640