Corpulyzer: A Novel Framework for Building Low Resource Language Corpora
The rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains. These systems heavily rely on high-quality dataset/corpora for the training of deep-learning algorithms to develop...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2021-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9316706/ |
id |
doaj-f69a413df6dc444faa54be072686a47b |
---|---|
record_format |
Article |
spelling |
doaj-f69a413df6dc444faa54be072686a47b2021-03-30T15:29:41ZengIEEEIEEE Access2169-35362021-01-0198546856310.1109/ACCESS.2021.30497939316706Corpulyzer: A Novel Framework for Building Low Resource Language CorporaBilal Tahir0https://orcid.org/0000-0002-4907-0988Muhammad Amir Mehmood1https://orcid.org/0000-0002-6652-5104Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, PakistanAl-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, PakistanThe rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains. These systems heavily rely on high-quality dataset/corpora for the training of deep-learning algorithms to develop precise models. The preparation of a high-quality gold standard corpus for natural language processing on a large scale is a challenging task due to the need of huge computational resources, accurate language identification models, and precise content parsing tools. This task is further exacerbated in case of regional languages due to the scarcity of web content. In this article, we propose a generic framework of Corpus Analyzer - Corpulyzer - a novel framework for building low resource language corpora. Our framework consists of corpus generation and corpus analyzer module. We demonstrate the efficacy of our framework by creating a high-quality large scale corpus for the Urdu language as a case study. Leveraging dataset from Common Crawl Corpus (CCC), first, we prepare a list of seed URLs by filtering the Urdu language webpages. Next, we use Corpulyzer to crawl the World-Wide-Web (WWW) over a period of four years (2016-2020). We build Urdu web corpus “UrduWeb20” that consists of 8.0 million Urdu webpages crawled from 6,590 websites. In addition, we propose Low-Resource Language (LRL) website scoring algorithm and content-size filter for language-focused crawling to achieve optimal use of computational resources. Moreover, we analyze UrduWeb20 using variety of traditional metrics such as web-traffic-rank, URL depth, duplicate documents, and vocabulary distribution along with our newly defined content-richness metrics. Furthermore, we compare different characteristics of our corpus with three datasets of CCC. In general, we observe that contrary to CCC that focuses on crawling the limited number of webpages from highly ranked Urdu websites, Corpulyzer performs an in-depth crawling of Urdu content-rich websites. Finally, we made available Corpulyzer framework for the research community for corpus building.https://ieeexplore.ieee.org/document/9316706/Common crawlweb crawlingtext corpuscorpus analysisregional languages corpora |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Bilal Tahir Muhammad Amir Mehmood |
spellingShingle |
Bilal Tahir Muhammad Amir Mehmood Corpulyzer: A Novel Framework for Building Low Resource Language Corpora IEEE Access Common crawl web crawling text corpus corpus analysis regional languages corpora |
author_facet |
Bilal Tahir Muhammad Amir Mehmood |
author_sort |
Bilal Tahir |
title |
Corpulyzer: A Novel Framework for Building Low Resource Language Corpora |
title_short |
Corpulyzer: A Novel Framework for Building Low Resource Language Corpora |
title_full |
Corpulyzer: A Novel Framework for Building Low Resource Language Corpora |
title_fullStr |
Corpulyzer: A Novel Framework for Building Low Resource Language Corpora |
title_full_unstemmed |
Corpulyzer: A Novel Framework for Building Low Resource Language Corpora |
title_sort |
corpulyzer: a novel framework for building low resource language corpora |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2021-01-01 |
description |
The rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains. These systems heavily rely on high-quality dataset/corpora for the training of deep-learning algorithms to develop precise models. The preparation of a high-quality gold standard corpus for natural language processing on a large scale is a challenging task due to the need of huge computational resources, accurate language identification models, and precise content parsing tools. This task is further exacerbated in case of regional languages due to the scarcity of web content. In this article, we propose a generic framework of Corpus Analyzer - Corpulyzer - a novel framework for building low resource language corpora. Our framework consists of corpus generation and corpus analyzer module. We demonstrate the efficacy of our framework by creating a high-quality large scale corpus for the Urdu language as a case study. Leveraging dataset from Common Crawl Corpus (CCC), first, we prepare a list of seed URLs by filtering the Urdu language webpages. Next, we use Corpulyzer to crawl the World-Wide-Web (WWW) over a period of four years (2016-2020). We build Urdu web corpus “UrduWeb20” that consists of 8.0 million Urdu webpages crawled from 6,590 websites. In addition, we propose Low-Resource Language (LRL) website scoring algorithm and content-size filter for language-focused crawling to achieve optimal use of computational resources. Moreover, we analyze UrduWeb20 using variety of traditional metrics such as web-traffic-rank, URL depth, duplicate documents, and vocabulary distribution along with our newly defined content-richness metrics. Furthermore, we compare different characteristics of our corpus with three datasets of CCC. In general, we observe that contrary to CCC that focuses on crawling the limited number of webpages from highly ranked Urdu websites, Corpulyzer performs an in-depth crawling of Urdu content-rich websites. Finally, we made available Corpulyzer framework for the research community for corpus building. |
topic |
Common crawl web crawling text corpus corpus analysis regional languages corpora |
url |
https://ieeexplore.ieee.org/document/9316706/ |
work_keys_str_mv |
AT bilaltahir corpulyzeranovelframeworkforbuildinglowresourcelanguagecorpora AT muhammadamirmehmood corpulyzeranovelframeworkforbuildinglowresourcelanguagecorpora |
_version_ |
1724179478242918400 |