A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES

Text classification is the process of automatically tagging a textual document with the most relevant set of labels. The aim of this work is to automatically tag an input document based on its vocabulary features. To achieve this goal, two large datasets have been constructed from various Arabic new...

Full description

Bibliographic Details
Main Authors: Leen Al Qadi, Hozayfa El Rifai, Safa Obaid, Ashraf Elnagar
Format: Article
Language:English
Published: Scientific Research Support Fund of Jordan (SRSF) and Princess Sumaya University for Technology (PSUT) 2020-09-01
Series:Jordanian Journal of Computers and Information Technology
Subjects:
Online Access:https://jjcit.org/downloadfile/106
id doaj-b6cdb70484fd43088067f27595edbaf3
record_format Article
spelling doaj-b6cdb70484fd43088067f27595edbaf32020-11-25T03:50:07ZengScientific Research Support Fund of Jordan (SRSF) and Princess Sumaya University for Technology (PSUT)Jordanian Journal of Computers and Information Technology 2413-93512415-10762020-09-01060326328010.5455/jjcit.71-1585409230 A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLESLeen Al Qadi0Hozayfa El Rifai1Safa Obaid2Ashraf Elnagar3Computer Science Department, University of Sharjah, UAE.Computer Science Department, University of Sharjah, UAE.Computer Science Department, University of Sharjah, UAE.Computer Science Department, University of Sharjah, UAE.Text classification is the process of automatically tagging a textual document with the most relevant set of labels. The aim of this work is to automatically tag an input document based on its vocabulary features. To achieve this goal, two large datasets have been constructed from various Arabic news portals. The first dataset consists of 90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset has over 290k multi-tagged articles. The datasets shall be made freely available to the research community on Arabic computational linguistics. To examine the usefulness of both datasets, we implemented an array of ten shallow learning classifiers. In addition, we implemented an ensemble model to combine best classifiers together in a majority-voting classifier. The performance of the classifiers on the first dataset ranged between 87.7% (Ada-Boost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label opposed to single-label categorization for better classification results. We used classifiers that were compatible with multi-labeling tasks, such as Logistic Regression and XGBoost. We tested the multi-label classifiers on the second larger dataset. A custom accuracy metric, designed for the multi-labeling task, has been developed for performance evaluation along with hamming loss metric. XGBoost proved to be the best multi-labeling classifier, scoring an accuracy of 91.3%, higher than the Logistic Regression score of 87.6%.https://jjcit.org/downloadfile/106arabic text classificationsingle-label classificationmulti-label classificationarabic datasetsshallow learning classifiers
collection DOAJ
language English
format Article
sources DOAJ
author Leen Al Qadi
Hozayfa El Rifai
Safa Obaid
Ashraf Elnagar
spellingShingle Leen Al Qadi
Hozayfa El Rifai
Safa Obaid
Ashraf Elnagar
A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES
Jordanian Journal of Computers and Information Technology
arabic text classification
single-label classification
multi-label classification
arabic datasets
shallow learning classifiers
author_facet Leen Al Qadi
Hozayfa El Rifai
Safa Obaid
Ashraf Elnagar
author_sort Leen Al Qadi
title A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES
title_short A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES
title_full A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES
title_fullStr A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES
title_full_unstemmed A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES
title_sort scalable shallow learning approach for tagging arabic news articles
publisher Scientific Research Support Fund of Jordan (SRSF) and Princess Sumaya University for Technology (PSUT)
series Jordanian Journal of Computers and Information Technology
issn 2413-9351
2415-1076
publishDate 2020-09-01
description Text classification is the process of automatically tagging a textual document with the most relevant set of labels. The aim of this work is to automatically tag an input document based on its vocabulary features. To achieve this goal, two large datasets have been constructed from various Arabic news portals. The first dataset consists of 90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset has over 290k multi-tagged articles. The datasets shall be made freely available to the research community on Arabic computational linguistics. To examine the usefulness of both datasets, we implemented an array of ten shallow learning classifiers. In addition, we implemented an ensemble model to combine best classifiers together in a majority-voting classifier. The performance of the classifiers on the first dataset ranged between 87.7% (Ada-Boost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label opposed to single-label categorization for better classification results. We used classifiers that were compatible with multi-labeling tasks, such as Logistic Regression and XGBoost. We tested the multi-label classifiers on the second larger dataset. A custom accuracy metric, designed for the multi-labeling task, has been developed for performance evaluation along with hamming loss metric. XGBoost proved to be the best multi-labeling classifier, scoring an accuracy of 91.3%, higher than the Logistic Regression score of 87.6%.
topic arabic text classification
single-label classification
multi-label classification
arabic datasets
shallow learning classifiers
url https://jjcit.org/downloadfile/106
work_keys_str_mv AT leenalqadi ascalableshallowlearningapproachfortaggingarabicnewsarticles
AT hozayfaelrifai ascalableshallowlearningapproachfortaggingarabicnewsarticles
AT safaobaid ascalableshallowlearningapproachfortaggingarabicnewsarticles
AT ashrafelnagar ascalableshallowlearningapproachfortaggingarabicnewsarticles
AT leenalqadi scalableshallowlearningapproachfortaggingarabicnewsarticles
AT hozayfaelrifai scalableshallowlearningapproachfortaggingarabicnewsarticles
AT safaobaid scalableshallowlearningapproachfortaggingarabicnewsarticles
AT ashrafelnagar scalableshallowlearningapproachfortaggingarabicnewsarticles
_version_ 1724492232173551616