Scalable Community Detection in Massive Networks using Aggregated Relational Data

The analysis of networks is used in many fields of study including statistics, social science, computer sciences, physics, and biology. The interest in networks is diverse as it usually depends on the field of study. For instance, social scientists are interested in interpreting how edges arise, whi...

Full description

Bibliographic Details
Main Author:	Jones, Timothy
Language:	English
Published:	2019
Subjects:	Statistics System analysis Relational databases
Online Access:	https://doi.org/10.7916/d8-pxvq-j271

id	ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-pxvq-j271
record_format	oai_dc
spelling	ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-pxvq-j2712019-08-31T03:08:33ZScalable Community Detection in Massive Networks using Aggregated Relational DataJones, Timothy2019ThesesStatisticsSystem analysisRelational databasesThe analysis of networks is used in many fields of study including statistics, social science, computer sciences, physics, and biology. The interest in networks is diverse as it usually depends on the field of study. For instance, social scientists are interested in interpreting how edges arise, while biologists seek to understand underlying biological processes. Among the problems being explored in network analysis, community detection stands out as being one of the most important. Community detection seeks to find groups of nodes with a large concentration of links within but few between. Inferring groups are important in many applications as they are used for further downstream analysis. For example, identifying clusters of consumers with similar purchasing behavior in a customer and product network can be used to create better recommendation systems. Finding a node with a high concentration of its edges to other nodes in the community may give insight into how the community formed. Many statistical models for networks implicitly define the notion of a community. Statistical inference aims to fit a model that posits how vertices are connected to each other. One of the most common models for community detection is the stochastic block model (SBM) [Holland et al., 1983]. Although simple, it is a highly expressive family of random graphs. However, it does have its drawbacks. First, it does not capture the degree distribution of real-world networks. Second, it allows nodes to only belong to one community. In many applications, it is useful to consider overlapping communities. The Mixed Membership Stochastic Blockmodel (MMSB) is a Bayesian extension of the SBM that allows nodes to belong to multiple communities. Fitting large Bayesian network models quickly become computationally infeasible when the number of nodes grows into the hundred of thousands and millions. In particular, the number of parameters in the MMSB grows as the number of nodes squared. This thesis introduces an efficient method for fitting a Bayesian model to massive networks through use of aggregated relational data. Our inference method converges faster than existing methods by leveraging nodal information that often accompany real world networks. Conditioning on this extra information leads to a model that admits a parallel variational inference algorithm. We apply our method to a citation network with over three million nodes and 25 million edges. Our method converges faster than existing posterior inference algorithms for the MMSB and recovers parameters better on simulated networks generated according to the MMSB.Englishhttps://doi.org/10.7916/d8-pxvq-j271
collection	NDLTD
language	English
sources	NDLTD
topic	Statistics System analysis Relational databases
spellingShingle	Statistics System analysis Relational databases Jones, Timothy Scalable Community Detection in Massive Networks using Aggregated Relational Data
description	The analysis of networks is used in many fields of study including statistics, social science, computer sciences, physics, and biology. The interest in networks is diverse as it usually depends on the field of study. For instance, social scientists are interested in interpreting how edges arise, while biologists seek to understand underlying biological processes. Among the problems being explored in network analysis, community detection stands out as being one of the most important. Community detection seeks to find groups of nodes with a large concentration of links within but few between. Inferring groups are important in many applications as they are used for further downstream analysis. For example, identifying clusters of consumers with similar purchasing behavior in a customer and product network can be used to create better recommendation systems. Finding a node with a high concentration of its edges to other nodes in the community may give insight into how the community formed. Many statistical models for networks implicitly define the notion of a community. Statistical inference aims to fit a model that posits how vertices are connected to each other. One of the most common models for community detection is the stochastic block model (SBM) [Holland et al., 1983]. Although simple, it is a highly expressive family of random graphs. However, it does have its drawbacks. First, it does not capture the degree distribution of real-world networks. Second, it allows nodes to only belong to one community. In many applications, it is useful to consider overlapping communities. The Mixed Membership Stochastic Blockmodel (MMSB) is a Bayesian extension of the SBM that allows nodes to belong to multiple communities. Fitting large Bayesian network models quickly become computationally infeasible when the number of nodes grows into the hundred of thousands and millions. In particular, the number of parameters in the MMSB grows as the number of nodes squared. This thesis introduces an efficient method for fitting a Bayesian model to massive networks through use of aggregated relational data. Our inference method converges faster than existing methods by leveraging nodal information that often accompany real world networks. Conditioning on this extra information leads to a model that admits a parallel variational inference algorithm. We apply our method to a citation network with over three million nodes and 25 million edges. Our method converges faster than existing posterior inference algorithms for the MMSB and recovers parameters better on simulated networks generated according to the MMSB.
author	Jones, Timothy
author_facet	Jones, Timothy
author_sort	Jones, Timothy
title	Scalable Community Detection in Massive Networks using Aggregated Relational Data
title_short	Scalable Community Detection in Massive Networks using Aggregated Relational Data
title_full	Scalable Community Detection in Massive Networks using Aggregated Relational Data
title_fullStr	Scalable Community Detection in Massive Networks using Aggregated Relational Data
title_full_unstemmed	Scalable Community Detection in Massive Networks using Aggregated Relational Data
title_sort	scalable community detection in massive networks using aggregated relational data
publishDate	2019
url	https://doi.org/10.7916/d8-pxvq-j271
work_keys_str_mv	AT jonestimothy scalablecommunitydetectioninmassivenetworksusingaggregatedrelationaldata
_version_	1719241307545665536

Scalable Community Detection in Massive Networks using Aggregated Relational Data

Similar Items