BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic Discovery

Microblog hot topic discovery is one of the research hotspots in the field of text mining. The distance function of traditional K-means leads to low clustering accuracy, which leads to poor hot topic discovery. Three definitions are proposed in this paper: title words and body words, positional cont...

Full description

Bibliographic Details
Main Authors: Di Wu, Mengtian Zhang, Chao Shen, Zhuyun Huang, Mingxing Gu
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
BTM
WMD
Online Access:https://ieeexplore.ieee.org/document/8995502/
id doaj-b63adf97f3b9413c81629b5030fd4cc2
record_format Article
spelling doaj-b63adf97f3b9413c81629b5030fd4cc22021-03-30T01:09:45ZengIEEEIEEE Access2169-35362020-01-018322153222510.1109/ACCESS.2020.29734308995502BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic DiscoveryDi Wu0https://orcid.org/0000-0002-1118-4736Mengtian Zhang1https://orcid.org/0000-0001-6812-4753Chao Shen2https://orcid.org/0000-0002-8443-1224Zhuyun Huang3https://orcid.org/0000-0002-9161-7764Mingxing Gu4https://orcid.org/0000-0002-5635-2262Department of Information and Electronic Engineering, Hebei University of Engineering, Handan, ChinaDepartment of Information and Electronic Engineering, Hebei University of Engineering, Handan, ChinaDepartment of Information and Electronic Engineering, Hebei University of Engineering, Handan, ChinaDepartment of Information and Electronic Engineering, Hebei University of Engineering, Handan, ChinaDepartment of Information and Electronic Engineering, Hebei University of Engineering, Handan, ChinaMicroblog hot topic discovery is one of the research hotspots in the field of text mining. The distance function of traditional K-means leads to low clustering accuracy, which leads to poor hot topic discovery. Three definitions are proposed in this paper: title words and body words, positional contribution-based weight and fusion similarity-based distance. The short text clustering algorithm based on BTM and GloVe similarity linear fusion (BG & SLF-Kmeans) is further proposed. BTM and GloVe are used to model the preprocessed microblog short texts. JS divergence is adopted to calculate the text similarity based on BTM topic modeling. WMD of improved word weight (IWMD) is used to calculate the text similarity based on GloVe word vector modeling. Finally, the two similarities are linearly fused and used as the distance function to realize K-means clustering. Specific word sets of 6 hot topics can be obtained, and microblog hot topics can be discovered. The experimental results show that BG & SLF-Kmeans significantly improves clustering accuracy compared with TF-IDF & K-means, BTM & K-means, and BTF & SLF-Kmeans.https://ieeexplore.ieee.org/document/8995502/BTMGloVemicroblog hot topic discoverysimilarity linear fusionWMD
collection DOAJ
language English
format Article
sources DOAJ
author Di Wu
Mengtian Zhang
Chao Shen
Zhuyun Huang
Mingxing Gu
spellingShingle Di Wu
Mengtian Zhang
Chao Shen
Zhuyun Huang
Mingxing Gu
BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic Discovery
IEEE Access
BTM
GloVe
microblog hot topic discovery
similarity linear fusion
WMD
author_facet Di Wu
Mengtian Zhang
Chao Shen
Zhuyun Huang
Mingxing Gu
author_sort Di Wu
title BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic Discovery
title_short BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic Discovery
title_full BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic Discovery
title_fullStr BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic Discovery
title_full_unstemmed BTM and GloVe Similarity Linear Fusion-Based Short Text Clustering Algorithm for Microblog Hot Topic Discovery
title_sort btm and glove similarity linear fusion-based short text clustering algorithm for microblog hot topic discovery
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Microblog hot topic discovery is one of the research hotspots in the field of text mining. The distance function of traditional K-means leads to low clustering accuracy, which leads to poor hot topic discovery. Three definitions are proposed in this paper: title words and body words, positional contribution-based weight and fusion similarity-based distance. The short text clustering algorithm based on BTM and GloVe similarity linear fusion (BG & SLF-Kmeans) is further proposed. BTM and GloVe are used to model the preprocessed microblog short texts. JS divergence is adopted to calculate the text similarity based on BTM topic modeling. WMD of improved word weight (IWMD) is used to calculate the text similarity based on GloVe word vector modeling. Finally, the two similarities are linearly fused and used as the distance function to realize K-means clustering. Specific word sets of 6 hot topics can be obtained, and microblog hot topics can be discovered. The experimental results show that BG & SLF-Kmeans significantly improves clustering accuracy compared with TF-IDF & K-means, BTM & K-means, and BTF & SLF-Kmeans.
topic BTM
GloVe
microblog hot topic discovery
similarity linear fusion
WMD
url https://ieeexplore.ieee.org/document/8995502/
work_keys_str_mv AT diwu btmandglovesimilaritylinearfusionbasedshorttextclusteringalgorithmformicrobloghottopicdiscovery
AT mengtianzhang btmandglovesimilaritylinearfusionbasedshorttextclusteringalgorithmformicrobloghottopicdiscovery
AT chaoshen btmandglovesimilaritylinearfusionbasedshorttextclusteringalgorithmformicrobloghottopicdiscovery
AT zhuyunhuang btmandglovesimilaritylinearfusionbasedshorttextclusteringalgorithmformicrobloghottopicdiscovery
AT mingxinggu btmandglovesimilaritylinearfusionbasedshorttextclusteringalgorithmformicrobloghottopicdiscovery
_version_ 1724187624727379968