Statistical analysis of network data motivated by problems in online social media

Networks have been widely used to represent and analyze a system of connected elements. Online social media networks, as a result of the expansion of the Internet and increased need of communication, have become an increasingly important part of people's lives. This thesis focuses on the statis...

Full description

Bibliographic Details
Main Author: Zhang, Yaonan
Language:en_US
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/2144/16270
Description
Summary:Networks have been widely used to represent and analyze a system of connected elements. Online social media networks, as a result of the expansion of the Internet and increased need of communication, have become an increasingly important part of people's lives. This thesis focuses on the statistical analysis of network data motivated by problems in online social media. It discusses problems arising from both explicit network data and implicit network data. Explicit network data are data where network structures are observable, implicit network data are those that do not have a network structure but occur under the influence of an underlying network. For the explicit network data analysis, we develop a novel method of recovering a fundamental characteristic -- network degree distributions -- under sampling. We formulate the problem of estimating degree distribution as an inverse problem. We show that this problem is ill-conditioned for many sampling methods in practice, and accordingly propose a constrained, penalized weighted least-squares approach to solve this problem. We demonstrate the ability of our method to accurately reconstruct the degree distributions from simulated network data and real world social network data. We also propose practical usage of the estimates relevant to marketing and advertising. For the implicit network data analysis, we look at review data from the popular review websites. Motivated by articles from the popular press and the research community which publicized that the average rating for top review sites is above 4 out of 5 stars, we study the phenomena of review rating trends and convergence using restaurant review data from TripAdvisor. We analyze the trend on different levels -- a rough analysis of the characteristics of the ratings, and a subtler statistical modeling with ordinal logistic regressions. Taking into account the implicit network underlying the review data, we suggest the upward trend observed in restaurant review ratings may be explained by social influence on an individual's perception of qualities. We use the intensity of review postings as an indicator of how popular a restaurant is and to test to what extent the increase in review intensity explains increases in average rating. After that, we consider a more nuanced approach to the joint modeling of ratings and review intensity which would allow for interaction between the two, rather than intensity serving only as an explanatory variable to ratings. Specifically, a state-space model is used to test the interaction between review intensity and review ratings.