METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS

The paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutua...

Full description

Bibliographic Details
Main Authors: I. A. Bessmertny, A. B. Nugumanova, M. Y. Mansurova, Y. M. Baiburin
Format: Article
Language:English
Published: Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University) 2017-01-01
Series:Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
Subjects:
Online Access:http://ntv.ifmo.ru/file/article/16409.pdf
id doaj-a4913997103449ae83edd69c5d54b71e
record_format Article
spelling doaj-a4913997103449ae83edd69c5d54b71e2020-11-24T22:56:43ZengSaint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki2226-14942500-03732017-01-01171819110.17586/2226-1494-2017-17-1-81-91METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTSI. A. Bessmertny0A. B. Nugumanova1 M. Y. Mansurova2Y. M. Baiburin3D.Sc., Associate Professor, ITMO University, Saint Petersburg, 197101, Russian FederationPhD, Senior lecturer, Amanzholov East Kazakhstan State University, Ust Kamenogorsk, 070004, the Republic of Kazakhstan PhD, Associate professor, Al-Farabi Kazakh National University, Almaty, 050040, Republic of Kazakhstansoftwere engineer, S. Amanzholov East Kazakhstan State University, Ust Kamenogorsk, 070004, the Republic of KazakhstanThe paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutual information is one of the known statistical methods of term extraction and it finds rare terms successfully. Although, it extracts many false terms at that. The proposed approach consists of point-wise mutual information application for rare terms extraction and filtering of candidates by criterion of joint occurrence with the other candidates. We build “documents-by-terms” matrix that is subjected to singular value decomposition to eliminate noise and reveal strong interconnections. Then we pass on to the resulting matrix “terms-by-terms” that reproduces strength of interconnections between words. This approach was approved on a documents collection from “Geology” domain with the use of contrast documents from such topics as “Politics”, “Culture”, “Economics” and “Accidents” on some Internet resources. The experimental results demonstrate operability of this method for rare terms extraction.http://ntv.ifmo.ru/file/article/16409.pdfcontrastive term extractiontermhoodmutual informationsemantic connectionsrare term extraction
collection DOAJ
language English
format Article
sources DOAJ
author I. A. Bessmertny
A. B. Nugumanova
M. Y. Mansurova
Y. M. Baiburin
spellingShingle I. A. Bessmertny
A. B. Nugumanova
M. Y. Mansurova
Y. M. Baiburin
METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS
Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
contrastive term extraction
termhood
mutual information
semantic connections
rare term extraction
author_facet I. A. Bessmertny
A. B. Nugumanova
M. Y. Mansurova
Y. M. Baiburin
author_sort I. A. Bessmertny
title METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS
title_short METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS
title_full METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS
title_fullStr METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS
title_full_unstemmed METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS
title_sort method of rare term contrastive extraction from natural language texts
publisher Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
series Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
issn 2226-1494
2500-0373
publishDate 2017-01-01
description The paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutual information is one of the known statistical methods of term extraction and it finds rare terms successfully. Although, it extracts many false terms at that. The proposed approach consists of point-wise mutual information application for rare terms extraction and filtering of candidates by criterion of joint occurrence with the other candidates. We build “documents-by-terms” matrix that is subjected to singular value decomposition to eliminate noise and reveal strong interconnections. Then we pass on to the resulting matrix “terms-by-terms” that reproduces strength of interconnections between words. This approach was approved on a documents collection from “Geology” domain with the use of contrast documents from such topics as “Politics”, “Culture”, “Economics” and “Accidents” on some Internet resources. The experimental results demonstrate operability of this method for rare terms extraction.
topic contrastive term extraction
termhood
mutual information
semantic connections
rare term extraction
url http://ntv.ifmo.ru/file/article/16409.pdf
work_keys_str_mv AT iabessmertny methodofraretermcontrastiveextractionfromnaturallanguagetexts
AT abnugumanova methodofraretermcontrastiveextractionfromnaturallanguagetexts
AT mymansurova methodofraretermcontrastiveextractionfromnaturallanguagetexts
AT ymbaiburin methodofraretermcontrastiveextractionfromnaturallanguagetexts
_version_ 1725653608660205568