METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS
The paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutua...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
2017-01-01
|
Series: | Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki |
Subjects: | |
Online Access: | http://ntv.ifmo.ru/file/article/16409.pdf |
id |
doaj-a4913997103449ae83edd69c5d54b71e |
---|---|
record_format |
Article |
spelling |
doaj-a4913997103449ae83edd69c5d54b71e2020-11-24T22:56:43ZengSaint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki2226-14942500-03732017-01-01171819110.17586/2226-1494-2017-17-1-81-91METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTSI. A. Bessmertny0A. B. Nugumanova1 M. Y. Mansurova2Y. M. Baiburin3D.Sc., Associate Professor, ITMO University, Saint Petersburg, 197101, Russian FederationPhD, Senior lecturer, Amanzholov East Kazakhstan State University, Ust Kamenogorsk, 070004, the Republic of Kazakhstan PhD, Associate professor, Al-Farabi Kazakh National University, Almaty, 050040, Republic of Kazakhstansoftwere engineer, S. Amanzholov East Kazakhstan State University, Ust Kamenogorsk, 070004, the Republic of KazakhstanThe paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutual information is one of the known statistical methods of term extraction and it finds rare terms successfully. Although, it extracts many false terms at that. The proposed approach consists of point-wise mutual information application for rare terms extraction and filtering of candidates by criterion of joint occurrence with the other candidates. We build “documents-by-terms” matrix that is subjected to singular value decomposition to eliminate noise and reveal strong interconnections. Then we pass on to the resulting matrix “terms-by-terms” that reproduces strength of interconnections between words. This approach was approved on a documents collection from “Geology” domain with the use of contrast documents from such topics as “Politics”, “Culture”, “Economics” and “Accidents” on some Internet resources. The experimental results demonstrate operability of this method for rare terms extraction.http://ntv.ifmo.ru/file/article/16409.pdfcontrastive term extractiontermhoodmutual informationsemantic connectionsrare term extraction |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
I. A. Bessmertny A. B. Nugumanova M. Y. Mansurova Y. M. Baiburin |
spellingShingle |
I. A. Bessmertny A. B. Nugumanova M. Y. Mansurova Y. M. Baiburin METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki contrastive term extraction termhood mutual information semantic connections rare term extraction |
author_facet |
I. A. Bessmertny A. B. Nugumanova M. Y. Mansurova Y. M. Baiburin |
author_sort |
I. A. Bessmertny |
title |
METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS |
title_short |
METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS |
title_full |
METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS |
title_fullStr |
METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS |
title_full_unstemmed |
METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS |
title_sort |
method of rare term contrastive extraction from natural language texts |
publisher |
Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University) |
series |
Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki |
issn |
2226-1494 2500-0373 |
publishDate |
2017-01-01 |
description |
The paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutual information is one of the known statistical methods of term extraction and it finds rare terms successfully. Although, it extracts many false terms at that. The proposed approach consists of point-wise mutual information application for rare terms extraction and filtering of candidates by criterion of joint occurrence with the other candidates. We build “documents-by-terms” matrix that is subjected to singular value decomposition to eliminate noise and reveal strong interconnections. Then we pass on to the resulting matrix “terms-by-terms” that reproduces strength of interconnections between words. This approach was approved on a documents collection from “Geology” domain with the use of contrast documents from such topics as “Politics”, “Culture”, “Economics” and “Accidents” on some Internet resources. The experimental results demonstrate operability of this method for rare terms extraction. |
topic |
contrastive term extraction termhood mutual information semantic connections rare term extraction |
url |
http://ntv.ifmo.ru/file/article/16409.pdf |
work_keys_str_mv |
AT iabessmertny methodofraretermcontrastiveextractionfromnaturallanguagetexts AT abnugumanova methodofraretermcontrastiveextractionfromnaturallanguagetexts AT mymansurova methodofraretermcontrastiveextractionfromnaturallanguagetexts AT ymbaiburin methodofraretermcontrastiveextractionfromnaturallanguagetexts |
_version_ |
1725653608660205568 |