Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents
Biomedicine is a pillar of the collective, scientific effort of human self-discovery, as well as a major source of humanistic data codified primarily in biomedical documents. Despite their rigid structure, maintaining and updating a considerably-sized collection of such documents is a task of overwh...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2017-07-01
|
Series: | Computation |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-3197/5/3/34 |
id |
doaj-2d1caeaada5c4b8e8ec495f1c9f03b4f |
---|---|
record_format |
Article |
spelling |
doaj-2d1caeaada5c4b8e8ec495f1c9f03b4f2020-11-24T23:40:14ZengMDPI AGComputation2079-31972017-07-01533410.3390/computation5030034computation5030034Tensor-Based Semantically-Aware Topic Clustering of Biomedical DocumentsGeorgios Drakopoulos0Andreas Kanavos1Ioannis Karydis2Spyros Sioutas3Aristidis G. Vrahatis4Department of Informatics, Ionian University, Tsirigoti Square 7, Kerkyra 49100, GreeceComputer Engineering and Informatics Department, University of Patras, Patras 26504, GreeceDepartment of Informatics, Ionian University, Tsirigoti Square 7, Kerkyra 49100, GreeceDepartment of Informatics, Ionian University, Tsirigoti Square 7, Kerkyra 49100, GreeceComputer Engineering and Informatics Department, University of Patras, Patras 26504, GreeceBiomedicine is a pillar of the collective, scientific effort of human self-discovery, as well as a major source of humanistic data codified primarily in biomedical documents. Despite their rigid structure, maintaining and updating a considerably-sized collection of such documents is a task of overwhelming complexity mandating efficient information retrieval for the purpose of the integration of clustering schemes. The latter should work natively with inherently multidimensional data and higher order interdependencies. Additionally, past experience indicates that clustering should be semantically enhanced. Tensor algebra is the key to extending the current term-document model to more dimensions. In this article, an alternative keyword-term-document strategy, based on scientometric observations that keywords typically possess more expressive power than ordinary text terms, whose algorithmic cornerstones are third order tensors and MeSH ontological functions, is proposed. This strategy has been compared against a baseline using two different biomedical datasets, the TREC (Text REtrieval Conference) genomics benchmark and a large custom set of cognitive science articles from PubMed.https://www.mdpi.com/2079-3197/5/3/34humanistic datahigher order datamedical information retrievaltopic clusteringPubMedMeSH Ontologytensor algebratucker factorization |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Georgios Drakopoulos Andreas Kanavos Ioannis Karydis Spyros Sioutas Aristidis G. Vrahatis |
spellingShingle |
Georgios Drakopoulos Andreas Kanavos Ioannis Karydis Spyros Sioutas Aristidis G. Vrahatis Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents Computation humanistic data higher order data medical information retrieval topic clustering PubMed MeSH Ontology tensor algebra tucker factorization |
author_facet |
Georgios Drakopoulos Andreas Kanavos Ioannis Karydis Spyros Sioutas Aristidis G. Vrahatis |
author_sort |
Georgios Drakopoulos |
title |
Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents |
title_short |
Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents |
title_full |
Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents |
title_fullStr |
Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents |
title_full_unstemmed |
Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents |
title_sort |
tensor-based semantically-aware topic clustering of biomedical documents |
publisher |
MDPI AG |
series |
Computation |
issn |
2079-3197 |
publishDate |
2017-07-01 |
description |
Biomedicine is a pillar of the collective, scientific effort of human self-discovery, as well as a major source of humanistic data codified primarily in biomedical documents. Despite their rigid structure, maintaining and updating a considerably-sized collection of such documents is a task of overwhelming complexity mandating efficient information retrieval for the purpose of the integration of clustering schemes. The latter should work natively with inherently multidimensional data and higher order interdependencies. Additionally, past experience indicates that clustering should be semantically enhanced. Tensor algebra is the key to extending the current term-document model to more dimensions. In this article, an alternative keyword-term-document strategy, based on scientometric observations that keywords typically possess more expressive power than ordinary text terms, whose algorithmic cornerstones are third order tensors and MeSH ontological functions, is proposed. This strategy has been compared against a baseline using two different biomedical datasets, the TREC (Text REtrieval Conference) genomics benchmark and a large custom set of cognitive science articles from PubMed. |
topic |
humanistic data higher order data medical information retrieval topic clustering PubMed MeSH Ontology tensor algebra tucker factorization |
url |
https://www.mdpi.com/2079-3197/5/3/34 |
work_keys_str_mv |
AT georgiosdrakopoulos tensorbasedsemanticallyawaretopicclusteringofbiomedicaldocuments AT andreaskanavos tensorbasedsemanticallyawaretopicclusteringofbiomedicaldocuments AT ioanniskarydis tensorbasedsemanticallyawaretopicclusteringofbiomedicaldocuments AT spyrossioutas tensorbasedsemanticallyawaretopicclusteringofbiomedicaldocuments AT aristidisgvrahatis tensorbasedsemanticallyawaretopicclusteringofbiomedicaldocuments |
_version_ |
1725510503333101568 |