Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning

Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, there is a large disconnect between the copula modeling and the machine learning communities. Copulas are stochastic models that capture the full dependence...

Full description

Bibliographic Details
Main Author: Karra, Kiran
Other Authors: Electrical Engineering
Format: Others
Published: Virginia Tech 2018
Subjects:
Online Access:http://hdl.handle.net/10919/85110
id ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-85110
record_format oai_dc
collection NDLTD
format Others
sources NDLTD
topic copula
machine learning
big data
stochastic
probability
spellingShingle copula
machine learning
big data
stochastic
probability
Karra, Kiran
Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning
description Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, there is a large disconnect between the copula modeling and the machine learning communities. Copulas are stochastic models that capture the full dependence structure between random variables and allow flexible modeling of multivariate joint distributions. Elidan was the first to recognize this disconnect, and introduced copula based models to the ML community that demonstrated magnitudes of order better performance than the non copula-based models Elidan [2013]. However, the limitation of these is that they are only applicable for continuous random variables and real world data is often naturally modeled jointly as continuous and discrete. This report details our work in bridging this gap of modeling and analyzing data that is jointly continuous and discrete using copulas. Our first research contribution details modeling of jointly continuous and discrete random variables using the copula framework with Bayesian networks, termed Hybrid Copula Bayesian Networks (HCBN) [Karra and Mili, 2016], a continuation of Elidan’s work on Copula Bayesian Networks Elidan [2010]. In this work, we extend the theorems proved by Neslehov ˇ a [2007] from bivariate ´ to multivariate copulas with discrete and continuous marginal distributions. Using the multivariate copula with discrete and continuous marginal distributions as a theoretical basis, we construct an HCBN that can model all possible permutations of discrete and continuous random variables for parent and child nodes, unlike the popular conditional linear Gaussian network model. Finally, we demonstrate on numerous synthetic datasets and a real life dataset that our HCBN compares favorably, from a modeling and flexibility viewpoint, to other hybrid models including the conditional linear Gaussian and the mixture of truncated exponentials models. Our second research contribution then deals with the analysis side, and discusses how one may use copulas for exploratory data analysis. To this end, we introduce a nonparametric copulabased index for detecting the strength and monotonicity structure of linear and nonlinear statistical dependence between pairs of random variables or stochastic signals. Our index, termed Copula Index for Detecting Dependence and Monotonicity (CIM), satisfies several desirable properties of measures of association, including Renyi’s properties, the data processing inequality (DPI), and ´ consequently self-equitability. Synthetic data simulations reveal that the statistical power of CIM compares favorably to other state-of-the-art measures of association that are proven to satisfy the DPI. Simulation results with real-world data reveal CIM’s unique ability to detect the monotonicity structure among stochastic signals to find interesting dependencies in large datasets. Additionally, simulations show that CIM shows favorable performance to estimators of mutual information when discovering Markov network structure. Our third research contribution deals with how to assess an estimator’s performance, in the scenario where multiple estimates of the strength of association between random variables need to be rank ordered. More specifically, we introduce a new property of estimators of the strength of statistical association, which helps characterize how well an estimator will perform in scenarios where dependencies between continuous and discrete random variables need to be rank ordered. The new property, termed the estimator response curve, is easily computable and provides a marginal distribution agnostic way to assess an estimator’s performance. It overcomes notable drawbacks of current metrics of assessment, including statistical power, bias, and consistency. We utilize the estimator response curve to test various measures of the strength of association that satisfy the data processing inequality (DPI), and show that the CIM estimator’s performance compares favorably to kNN, vME, AP, and HMI estimators of mutual information. The estimators which were identified to be suboptimal, according to the estimator response curve, perform worse than the more optimal estimators when tested with real-world data from four different areas of science, all with varying dimensionalities and sizes. === Ph. D.
author2 Electrical Engineering
author_facet Electrical Engineering
Karra, Kiran
author Karra, Kiran
author_sort Karra, Kiran
title Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning
title_short Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning
title_full Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning
title_fullStr Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning
title_full_unstemmed Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning
title_sort modeling and analysis of non-linear dependencies using copulas, with applications to machine learning
publisher Virginia Tech
publishDate 2018
url http://hdl.handle.net/10919/85110
work_keys_str_mv AT karrakiran modelingandanalysisofnonlineardependenciesusingcopulaswithapplicationstomachinelearning
_version_ 1719417474752970752
spelling ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-851102021-07-17T05:27:54Z Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning Karra, Kiran Electrical Engineering Mili, Lamine M. Clancy, Thomas Charles III Ramakrishnan, Naren Yu, Guoqiang Raman, Sanjay copula machine learning big data stochastic probability Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, there is a large disconnect between the copula modeling and the machine learning communities. Copulas are stochastic models that capture the full dependence structure between random variables and allow flexible modeling of multivariate joint distributions. Elidan was the first to recognize this disconnect, and introduced copula based models to the ML community that demonstrated magnitudes of order better performance than the non copula-based models Elidan [2013]. However, the limitation of these is that they are only applicable for continuous random variables and real world data is often naturally modeled jointly as continuous and discrete. This report details our work in bridging this gap of modeling and analyzing data that is jointly continuous and discrete using copulas. Our first research contribution details modeling of jointly continuous and discrete random variables using the copula framework with Bayesian networks, termed Hybrid Copula Bayesian Networks (HCBN) [Karra and Mili, 2016], a continuation of Elidan’s work on Copula Bayesian Networks Elidan [2010]. In this work, we extend the theorems proved by Neslehov ˇ a [2007] from bivariate ´ to multivariate copulas with discrete and continuous marginal distributions. Using the multivariate copula with discrete and continuous marginal distributions as a theoretical basis, we construct an HCBN that can model all possible permutations of discrete and continuous random variables for parent and child nodes, unlike the popular conditional linear Gaussian network model. Finally, we demonstrate on numerous synthetic datasets and a real life dataset that our HCBN compares favorably, from a modeling and flexibility viewpoint, to other hybrid models including the conditional linear Gaussian and the mixture of truncated exponentials models. Our second research contribution then deals with the analysis side, and discusses how one may use copulas for exploratory data analysis. To this end, we introduce a nonparametric copulabased index for detecting the strength and monotonicity structure of linear and nonlinear statistical dependence between pairs of random variables or stochastic signals. Our index, termed Copula Index for Detecting Dependence and Monotonicity (CIM), satisfies several desirable properties of measures of association, including Renyi’s properties, the data processing inequality (DPI), and ´ consequently self-equitability. Synthetic data simulations reveal that the statistical power of CIM compares favorably to other state-of-the-art measures of association that are proven to satisfy the DPI. Simulation results with real-world data reveal CIM’s unique ability to detect the monotonicity structure among stochastic signals to find interesting dependencies in large datasets. Additionally, simulations show that CIM shows favorable performance to estimators of mutual information when discovering Markov network structure. Our third research contribution deals with how to assess an estimator’s performance, in the scenario where multiple estimates of the strength of association between random variables need to be rank ordered. More specifically, we introduce a new property of estimators of the strength of statistical association, which helps characterize how well an estimator will perform in scenarios where dependencies between continuous and discrete random variables need to be rank ordered. The new property, termed the estimator response curve, is easily computable and provides a marginal distribution agnostic way to assess an estimator’s performance. It overcomes notable drawbacks of current metrics of assessment, including statistical power, bias, and consistency. We utilize the estimator response curve to test various measures of the strength of association that satisfy the data processing inequality (DPI), and show that the CIM estimator’s performance compares favorably to kNN, vME, AP, and HMI estimators of mutual information. The estimators which were identified to be suboptimal, according to the estimator response curve, perform worse than the more optimal estimators when tested with real-world data from four different areas of science, all with varying dimensionalities and sizes. Ph. D. 2018-09-22T08:01:11Z 2018-09-22T08:01:11Z 2018-09-21 Dissertation vt_gsexam:16853 http://hdl.handle.net/10919/85110 In Copyright http://rightsstatements.org/vocab/InC/1.0/ ETD application/pdf Virginia Tech