Features for Forming Text Corpus of Kazakhstan Electronic News

The culture of online-news consumption continues to take shape and is gaining popularity, increasing the audience of readers. At the same time, the number of those who fall under the negative influence of false news is growing. Researchers are faced with the task of analyzing mass media. One of the...

Full description

Bibliographic Details
Main Authors: Ulzhan Ospanova, Mukhit Baimakhanbetov, Inessa Akoyeva, Timur Buldybayev, Miraim Atanayeva
Format: Article
Language:Russian
Published: The Fund for Promotion of Internet media, IT education, human development «League Internet Media» 2020-05-01
Series:Современные информационные технологии и IT-образование
Subjects:
Online Access:http://sitito.cs.msu.ru/index.php/SITITO/article/view/612
Description
Summary:The culture of online-news consumption continues to take shape and is gaining popularity, increasing the audience of readers. At the same time, the number of those who fall under the negative influence of false news is growing. Researchers are faced with the task of analyzing mass media. One of the areas of news content analysis is thematic modelling, recognition of fake news, sentiment analysis. However, to research these areas, there is a need in a labelled corpus. This paper presents the methodological foundations of the corpus formation. It describes the process of data collection and the selection of sources to form the corpus. It also presents a description of the theoretical foundations of representativeness and balance and explains compliance of the corpus with the requirements. In the course of the composite work, authors gained a corpus of 1.9 million news texts from 22 news sources. They conducted corpus markup and carried-up the analysis of the thematic structure of the formed corps using the LDA model. The formed corpus will allow testing machine learning algorithms aimed at recognizing individual informative features and identifying patterns that are present in the array of news publications. Also, the corpus will be useful to machine learning and NLP researchers to test machine learning algorithms according to their own goals.
ISSN:2411-1473