RCrawler: An R package for parallel web crawling and scraping
RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applicat...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2017-01-01
|
Series: | SoftwareX |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352711017300110 |
id |
doaj-dc7250627556498dba12dd3f2ce84e13 |
---|---|
record_format |
Article |
spelling |
doaj-dc7250627556498dba12dd3f2ce84e132020-11-25T01:33:53ZengElsevierSoftwareX2352-71102017-01-01698106RCrawler: An R package for parallel web crawling and scrapingSalim Khalil0Mohamed Fakir1Corresponding author.; Department of Informatics, Faculty of Sciences and Technics Beni Mellal, MoroccoDepartment of Informatics, Faculty of Sciences and Technics Beni Mellal, MoroccoRCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results. Keywords: Web crawler, Web scraper, R package, Parallel crawling, Web mining, Data collectionhttp://www.sciencedirect.com/science/article/pii/S2352711017300110 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Salim Khalil Mohamed Fakir |
spellingShingle |
Salim Khalil Mohamed Fakir RCrawler: An R package for parallel web crawling and scraping SoftwareX |
author_facet |
Salim Khalil Mohamed Fakir |
author_sort |
Salim Khalil |
title |
RCrawler: An R package for parallel web crawling and scraping |
title_short |
RCrawler: An R package for parallel web crawling and scraping |
title_full |
RCrawler: An R package for parallel web crawling and scraping |
title_fullStr |
RCrawler: An R package for parallel web crawling and scraping |
title_full_unstemmed |
RCrawler: An R package for parallel web crawling and scraping |
title_sort |
rcrawler: an r package for parallel web crawling and scraping |
publisher |
Elsevier |
series |
SoftwareX |
issn |
2352-7110 |
publishDate |
2017-01-01 |
description |
RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results. Keywords: Web crawler, Web scraper, R package, Parallel crawling, Web mining, Data collection |
url |
http://www.sciencedirect.com/science/article/pii/S2352711017300110 |
work_keys_str_mv |
AT salimkhalil rcrawleranrpackageforparallelwebcrawlingandscraping AT mohamedfakir rcrawleranrpackageforparallelwebcrawlingandscraping |
_version_ |
1725075171327344640 |