RCrawler: An R package for parallel web crawling and scraping

RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applicat...

Full description

Bibliographic Details
Main Authors:	Salim Khalil, Mohamed Fakir
Format:	Article
Language:	English
Published:	Elsevier 2017-01-01
Series:	SoftwareX
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352711017300110

id	doaj-dc7250627556498dba12dd3f2ce84e13
record_format	Article
spelling	doaj-dc7250627556498dba12dd3f2ce84e132020-11-25T01:33:53ZengElsevierSoftwareX2352-71102017-01-01698106RCrawler: An R package for parallel web crawling and scrapingSalim Khalil0Mohamed Fakir1Corresponding author.; Department of Informatics, Faculty of Sciences and Technics Beni Mellal, MoroccoDepartment of Informatics, Faculty of Sciences and Technics Beni Mellal, MoroccoRCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results. Keywords: Web crawler, Web scraper, R package, Parallel crawling, Web mining, Data collectionhttp://www.sciencedirect.com/science/article/pii/S2352711017300110
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Salim Khalil Mohamed Fakir
spellingShingle	Salim Khalil Mohamed Fakir RCrawler: An R package for parallel web crawling and scraping SoftwareX
author_facet	Salim Khalil Mohamed Fakir
author_sort	Salim Khalil
title	RCrawler: An R package for parallel web crawling and scraping
title_short	RCrawler: An R package for parallel web crawling and scraping
title_full	RCrawler: An R package for parallel web crawling and scraping
title_fullStr	RCrawler: An R package for parallel web crawling and scraping
title_full_unstemmed	RCrawler: An R package for parallel web crawling and scraping
title_sort	rcrawler: an r package for parallel web crawling and scraping
publisher	Elsevier
series	SoftwareX
issn	2352-7110
publishDate	2017-01-01
description	RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results. Keywords: Web crawler, Web scraper, R package, Parallel crawling, Web mining, Data collection
url	http://www.sciencedirect.com/science/article/pii/S2352711017300110
work_keys_str_mv	AT salimkhalil rcrawleranrpackageforparallelwebcrawlingandscraping AT mohamedfakir rcrawleranrpackageforparallelwebcrawlingandscraping
_version_	1725075171327344640

RCrawler: An R package for parallel web crawling and scraping

Similar Items