Hierarchical Dirichlet Multinomial Allocation Model for Multi-Source Document Clustering

Mining a document structure from multiple data sources in terms of their underlying topics has become an important task of document clustering. The traditional document clustering approach cannot be applied directly to the multi-source document clustering problem. There are three typical difficultie...

Full description

Bibliographic Details
Main Authors: Ruizhang Huang, Weijia Xu, Yongbin Qin, Yanping Chen
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9115621/
Description
Summary:Mining a document structure from multiple data sources in terms of their underlying topics has become an important task of document clustering. The traditional document clustering approach cannot be applied directly to the multi-source document clustering problem. There are three typical difficulties: 1) The topics of different data sources are related but not the same. 2) Usually, each data source has its own focus on topics. 3) The number of clusters of the data sources are not necessarily the same and are not known beforehand. In this paper, based on our previous research, we design a novel multi-source document clustering model, namely, the hierarchical Dirichlet multinomial allocation (HDMA) model, to solve all the above problems. The HDMA model is investigated with a two-step hierarchical topic generation process. Topics are learnt to share their general characteristics across data source, while at the same time preserve the local characteristics of the data source. Each data source is applied with an exclusive topic partition to learn the source-level topic emphasis. A Gibbs sampling algorithm is then used to learn the number of clusters for each data source as well as the parameters of the HDMA model at the same time. Experimental results demonstrate that the HDMA model is effective.
ISSN:2169-3536