Complexity penalized methods for structured and unstructured data

A fundamental goal of statisticians is to make inferences from the sample about characteristics of the underlying population. This is an inverse problem, since we are trying to recover a feature of the input with the availability of observations on an output. Towards this end, we consider complexi...

Full description

Bibliographic Details
Main Author:	Goeva, Aleksandrina
Language:	en_US
Published:	2018
Subjects:	Statistics Complexity penalized Entropy Inverse problem Network inference Stochastic simulation Text mining
Online Access:	https://hdl.handle.net/2144/27072

id	ndltd-bu.edu-oai-open.bu.edu-2144-27072
record_format	oai_dc
spelling	ndltd-bu.edu-oai-open.bu.edu-2144-270722019-12-22T15:11:40Z Complexity penalized methods for structured and unstructured data Goeva, Aleksandrina Statistics Complexity penalized Entropy Inverse problem Network inference Stochastic simulation Text mining A fundamental goal of statisticians is to make inferences from the sample about characteristics of the underlying population. This is an inverse problem, since we are trying to recover a feature of the input with the availability of observations on an output. Towards this end, we consider complexity penalized methods, because they balance goodness of fit and generalizability of the solution. The data from the underlying population may come in diverse formats - structured or unstructured - such as probability distributions, text tokens, or graph characteristics. Depending on the defining features of the problem we can chose the appropriate complexity penalized approach, and assess the quality of the estimate produced by it. Favorable characteristics are strong theoretical guarantees of closeness to the true value and interpretability. Our work fits within this framework and spans the areas of simulation optimization, text mining and network inference. The first problem we consider is model calibration under the assumption that given a hypothesized input model, we can use stochastic simulation to obtain its corresponding output observations. We formulate it as a stochastic program by maximizing the entropy of the input distribution subject to moment matching. We then propose an iterative scheme via simulation to approximately solve it. We prove convergence of the proposed algorithm under appropriate conditions and demonstrate the performance via numerical studies. The second problem we consider is summarizing text documents through an inferred set of topics. We propose a frequentist reformulation of a Bayesian regularization scheme. Through our complexity-penalized perspective we lend further insight into the nature of the loss function and the regularization achieved through the priors in the Bayesian formulation. The third problem is concerned with the impact of sampling on the degree distribution of a network. Under many sampling designs, we have a linear inverse problem characterized by an ill-conditioned matrix. We investigate the theoretical properties of an approximate solution for the degree distribution found by regularizing the solution of the ill-conditioned least squares objective. Particularly, we study the rate at which the penalized solution tends to the true value as a function of network size and sampling rate. 2018-02-16T18:19:16Z 2018-02-16T18:19:16Z 2017 2017-11-08T20:16:46Z Thesis/Dissertation https://hdl.handle.net/2144/27072 en_US Attribution 4.0 International http://creativecommons.org/licenses/by/4.0/
collection	NDLTD
language	en_US
sources	NDLTD
topic	Statistics Complexity penalized Entropy Inverse problem Network inference Stochastic simulation Text mining
spellingShingle	Statistics Complexity penalized Entropy Inverse problem Network inference Stochastic simulation Text mining Goeva, Aleksandrina Complexity penalized methods for structured and unstructured data
description	A fundamental goal of statisticians is to make inferences from the sample about characteristics of the underlying population. This is an inverse problem, since we are trying to recover a feature of the input with the availability of observations on an output. Towards this end, we consider complexity penalized methods, because they balance goodness of fit and generalizability of the solution. The data from the underlying population may come in diverse formats - structured or unstructured - such as probability distributions, text tokens, or graph characteristics. Depending on the defining features of the problem we can chose the appropriate complexity penalized approach, and assess the quality of the estimate produced by it. Favorable characteristics are strong theoretical guarantees of closeness to the true value and interpretability. Our work fits within this framework and spans the areas of simulation optimization, text mining and network inference. The first problem we consider is model calibration under the assumption that given a hypothesized input model, we can use stochastic simulation to obtain its corresponding output observations. We formulate it as a stochastic program by maximizing the entropy of the input distribution subject to moment matching. We then propose an iterative scheme via simulation to approximately solve it. We prove convergence of the proposed algorithm under appropriate conditions and demonstrate the performance via numerical studies. The second problem we consider is summarizing text documents through an inferred set of topics. We propose a frequentist reformulation of a Bayesian regularization scheme. Through our complexity-penalized perspective we lend further insight into the nature of the loss function and the regularization achieved through the priors in the Bayesian formulation. The third problem is concerned with the impact of sampling on the degree distribution of a network. Under many sampling designs, we have a linear inverse problem characterized by an ill-conditioned matrix. We investigate the theoretical properties of an approximate solution for the degree distribution found by regularizing the solution of the ill-conditioned least squares objective. Particularly, we study the rate at which the penalized solution tends to the true value as a function of network size and sampling rate.
author	Goeva, Aleksandrina
author_facet	Goeva, Aleksandrina
author_sort	Goeva, Aleksandrina
title	Complexity penalized methods for structured and unstructured data
title_short	Complexity penalized methods for structured and unstructured data
title_full	Complexity penalized methods for structured and unstructured data
title_fullStr	Complexity penalized methods for structured and unstructured data
title_full_unstemmed	Complexity penalized methods for structured and unstructured data
title_sort	complexity penalized methods for structured and unstructured data
publishDate	2018
url	https://hdl.handle.net/2144/27072
work_keys_str_mv	AT goevaaleksandrina complexitypenalizedmethodsforstructuredandunstructureddata
_version_	1719306384592338944

Complexity penalized methods for structured and unstructured data

Similar Items