Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks

The problem of identifying clusters arising in the context of topic models and related approaches is important in the area of machine learning. The problem concerning traversals on Concept Association Networks is of great interest in the area of cognitive modelling. Cluster identification is the pro...

Full description

Bibliographic Details
Main Author:	Arun, R
Other Authors:	Veni Madhavan, C E
Language:	en_US
Published:	2013
Subjects:	Machine Learning Clustering (Concepts) Association Networks Concept Association Networks (CAN) Latent Dirichlet Allocation (LDA) Matrix Factorization Concept Association Networks - Clustering Concept Association Networks - Traversals Entropy (Information Theory) Cognitive Clustering Cluster Identification Non-negative Matrix Factorization (NMF) Computer Science
Online Access:	http://etd.iisc.ernet.in/handle/2005/2247 http://etd.ncsi.iisc.ernet.in/abstracts/2861/G24683-Abs.pdf

id	ndltd-IISc-oai-etd.ncsi.iisc.ernet.in-2005-2247
record_format	oai_dc
spelling	ndltd-IISc-oai-etd.ncsi.iisc.ernet.in-2005-22472018-01-10T03:36:31ZCluster Identification : Topic Models, Matrix Factorization And Concept Association NetworksArun, RMachine LearningClustering (Concepts)Association NetworksConcept Association Networks (CAN)Latent Dirichlet Allocation (LDA)Matrix FactorizationConcept Association Networks - ClusteringConcept Association Networks - TraversalsEntropy (Information Theory)Cognitive ClusteringCluster IdentificationNon-negative Matrix Factorization (NMF)Computer ScienceThe problem of identifying clusters arising in the context of topic models and related approaches is important in the area of machine learning. The problem concerning traversals on Concept Association Networks is of great interest in the area of cognitive modelling. Cluster identification is the problem of finding the right number of clusters in a given set of points(or a dataset) in different settings including topic models and matrix factorization algorithms. Traversals in Concept Association Networks provide useful insights into cognitive modelling and performance. First, We consider the problem of authorship attribution of stylometry and the problem of cluster identification for topic models. For the problem of authorship attribution we show empirically that by using stop-words as stylistic features of an author, vectors obtained from the Latent Dirichlet Allocation (LDA) , outperforms other classifiers. Topics obtained by this method are generally abstract and it may not be possible to identify the cohesiveness of words falling in the same topic by mere manual inspection. Hence it is difficult to determine if the chosen number of topics is optimal. We next address this issue. We propose a new measure for topics arising out of LDA based on the divergence between the singular value distribution and the L1 norm distribution of the document-topic and topic-word matrices, respectively. It is shown that under certain assumptions, this measure can be used to find the right number of topics. Next we consider the Non-negative Matrix Factorization(NMF) approach for clustering documents. We propose entropy based regularization for a variant of the NMF with row-stochastic constraints on the component matrices. It is shown that when topic-splitting occurs, (i.e when an extra topic is required) an existing topic vector splits into two and the divergence term in the cost function decreases whereas the entropy term increases leading to a regularization. Next we consider the problem of clustering in Concept Association Networks(CAN). The CAN are generic graph models of relationships between abstract concepts. We propose a simple clustering algorithm which takes into account the complex network properties of CAN. The performance of the algorithm is compared with that of the graph-cut based spectral clustering algorithm. In addition, we study the properties of traversals by human participants on CAN. We obtain experimental results contrasting these traversals with those obtained from (i) random walk simulations and (ii) shortest path algorithms.Veni Madhavan, C E2013-09-17T07:31:21Z2013-09-17T07:31:21Z2013-09-172010-07Thesishttp://etd.iisc.ernet.in/handle/2005/2247http://etd.ncsi.iisc.ernet.in/abstracts/2861/G24683-Abs.pdfen_USG24683
collection	NDLTD
language	en_US
sources	NDLTD
topic	Machine Learning Clustering (Concepts) Association Networks Concept Association Networks (CAN) Latent Dirichlet Allocation (LDA) Matrix Factorization Concept Association Networks - Clustering Concept Association Networks - Traversals Entropy (Information Theory) Cognitive Clustering Cluster Identification Non-negative Matrix Factorization (NMF) Computer Science
spellingShingle	Machine Learning Clustering (Concepts) Association Networks Concept Association Networks (CAN) Latent Dirichlet Allocation (LDA) Matrix Factorization Concept Association Networks - Clustering Concept Association Networks - Traversals Entropy (Information Theory) Cognitive Clustering Cluster Identification Non-negative Matrix Factorization (NMF) Computer Science Arun, R Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks
description	The problem of identifying clusters arising in the context of topic models and related approaches is important in the area of machine learning. The problem concerning traversals on Concept Association Networks is of great interest in the area of cognitive modelling. Cluster identification is the problem of finding the right number of clusters in a given set of points(or a dataset) in different settings including topic models and matrix factorization algorithms. Traversals in Concept Association Networks provide useful insights into cognitive modelling and performance. First, We consider the problem of authorship attribution of stylometry and the problem of cluster identification for topic models. For the problem of authorship attribution we show empirically that by using stop-words as stylistic features of an author, vectors obtained from the Latent Dirichlet Allocation (LDA) , outperforms other classifiers. Topics obtained by this method are generally abstract and it may not be possible to identify the cohesiveness of words falling in the same topic by mere manual inspection. Hence it is difficult to determine if the chosen number of topics is optimal. We next address this issue. We propose a new measure for topics arising out of LDA based on the divergence between the singular value distribution and the L1 norm distribution of the document-topic and topic-word matrices, respectively. It is shown that under certain assumptions, this measure can be used to find the right number of topics. Next we consider the Non-negative Matrix Factorization(NMF) approach for clustering documents. We propose entropy based regularization for a variant of the NMF with row-stochastic constraints on the component matrices. It is shown that when topic-splitting occurs, (i.e when an extra topic is required) an existing topic vector splits into two and the divergence term in the cost function decreases whereas the entropy term increases leading to a regularization. Next we consider the problem of clustering in Concept Association Networks(CAN). The CAN are generic graph models of relationships between abstract concepts. We propose a simple clustering algorithm which takes into account the complex network properties of CAN. The performance of the algorithm is compared with that of the graph-cut based spectral clustering algorithm. In addition, we study the properties of traversals by human participants on CAN. We obtain experimental results contrasting these traversals with those obtained from (i) random walk simulations and (ii) shortest path algorithms.
author2	Veni Madhavan, C E
author_facet	Veni Madhavan, C E Arun, R
author	Arun, R
author_sort	Arun, R
title	Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks
title_short	Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks
title_full	Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks
title_fullStr	Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks
title_full_unstemmed	Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks
title_sort	cluster identification : topic models, matrix factorization and concept association networks
publishDate	2013
url	http://etd.iisc.ernet.in/handle/2005/2247 http://etd.ncsi.iisc.ernet.in/abstracts/2861/G24683-Abs.pdf
work_keys_str_mv	AT arunr clusteridentificationtopicmodelsmatrixfactorizationandconceptassociationnetworks
_version_	1718603707518550016

Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks

Similar Items