Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessi...

Full description

Bibliographic Details
Main Authors:	Israr Haneef, Rao Muhammad Adeel Nawab, Ehsan Ullah Munir, Imran Sarwar Bajwa
Format:	Article
Language:	English
Published:	Hindawi Limited 2019-01-01
Series:	Scientific Programming
Online Access:	http://dx.doi.org/10.1155/2019/2962040

id	doaj-532e6f315cc44e72a9a401da52591432
record_format	Article
spelling	doaj-532e6f315cc44e72a9a401da525914322021-07-02T10:49:34ZengHindawi LimitedScientific Programming1058-92441875-919X2019-01-01201910.1155/2019/29620402962040Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language PairIsrar Haneef0Rao Muhammad Adeel Nawab1Ehsan Ullah Munir2Imran Sarwar Bajwa3Department of Computer Science, COMSATS Institute of Information Technology, Wah Campus, Wah Cantonment, PakistanDepartment of Computer Science, COMSATS Institute of Information Technology, Lahore Campus, Lahore, PakistanDepartment of Computer Science, COMSATS Institute of Information Technology, Wah Campus, Wah Cantonment, PakistanDepartment of Computer Science, The Islamia University of Bahawalpur, Bahawalpur, PakistanCross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using n-gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.http://dx.doi.org/10.1155/2019/2962040
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Israr Haneef Rao Muhammad Adeel Nawab Ehsan Ullah Munir Imran Sarwar Bajwa
spellingShingle	Israr Haneef Rao Muhammad Adeel Nawab Ehsan Ullah Munir Imran Sarwar Bajwa Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair Scientific Programming
author_facet	Israr Haneef Rao Muhammad Adeel Nawab Ehsan Ullah Munir Imran Sarwar Bajwa
author_sort	Israr Haneef
title	Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
title_short	Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
title_full	Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
title_fullStr	Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
title_full_unstemmed	Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
title_sort	design and development of a large cross-lingual plagiarism corpus for urdu-english language pair
publisher	Hindawi Limited
series	Scientific Programming
issn	1058-9244 1875-919X
publishDate	2019-01-01
description	Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using n-gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.
url	http://dx.doi.org/10.1155/2019/2962040
work_keys_str_mv	AT israrhaneef designanddevelopmentofalargecrosslingualplagiarismcorpusforurduenglishlanguagepair AT raomuhammadadeelnawab designanddevelopmentofalargecrosslingualplagiarismcorpusforurduenglishlanguagepair AT ehsanullahmunir designanddevelopmentofalargecrosslingualplagiarismcorpusforurduenglishlanguagepair AT imransarwarbajwa designanddevelopmentofalargecrosslingualplagiarismcorpusforurduenglishlanguagepair
_version_	1721331662529232896

Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Similar Items