Short Text Embedding Autoencoders With Attention-Based Neighborhood Preservation

Shortness and sparsity often plague short text representation for clustering and classification. A popular solution is to extract meaningful low-dimensional embeddings as short text representation via various Dimensionality Reduction technology. However, the existing methods, such as topic models an...

Full description

Bibliographic Details
Main Authors: Chao Wei, Lijun Zhu, Jiaoxiang Shi
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9284562/
Description
Summary:Shortness and sparsity often plague short text representation for clustering and classification. A popular solution is to extract meaningful low-dimensional embeddings as short text representation via various Dimensionality Reduction technology. However, the existing methods, such as topic models and neural networks, discover low-dimensional embeddings from the whole training sets without considering the geometrical information of short text manifold, resulting in an inability to provide a discriminative embedding of short text. In this paper, we propose a manifold-regularized method, namely Short Texts Embedding AutoEncoders (STE-AEs), aiming to incorporate the semantics from the neighborhood into a regularization training of AutoEncoders (AEs) to extract discriminative low-dimensional short text embeddings. STE-AEs first determines semantics neighborhood via an attention-based weighted matching distance and then preserves the local geometrical structure by incorporating a minimization of the weighted cross-entropy of nearby texts' embeddings into a regularization training of AEs. Finally, the encoder can act as a parametrized mapping function between observations and embeddings. Furthermore, based on the activation values of the encoder for the training set, STE-AEs employs a regression model of Random Forest (RF) to determine the feature importance so as to find certain informative and readable words for embeddings interpretation. Through extensive experiments on three real-world short text corpuses, the evidence demonstrate that STE-AEs can capture the intrinsic discriminative explanatory factors, improving the performance of short text clustering and classification. Moreover, some understandable words can be efficiently discovered to promote the interpretability of low-dimensional embeddings.
ISSN:2169-3536