Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring
This dissertation is the compilation of two major innovations that rely on a common technique known as multidimensional scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data and creates low-dimensional versions. Project 1: Visualizations are useful when learning from h...
Main Author: | |
---|---|
Other Authors: | |
Format: | Others |
Published: |
Virginia Tech
2020
|
Subjects: | |
Online Access: | http://hdl.handle.net/10919/99911 |
id |
ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-99911 |
---|---|
record_format |
oai_dc |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
Uncertainty Bayesian Multidimensional Scaling Visualizations Anomaly Detection Dynamic Networks |
spellingShingle |
Uncertainty Bayesian Multidimensional Scaling Visualizations Anomaly Detection Dynamic Networks Kodali, Lata Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring |
description |
This dissertation is the compilation of two major innovations that rely on a common technique known as multidimensional scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data and creates low-dimensional versions.
Project 1: Visualizations are useful when learning from high-dimensional data. However, visualizations, just as any data summary, can be misleading when they do not incorporate measures of uncertainty; e.g., uncertainty from the data or the dimension reduction algorithm used to create the visual display. We incorporate uncertainty into visualizations created by a weighted version of MDS called WMDS. Uncertainty exists in these visualizations on the variable weights, the coordinates of the display, and the fit of WMDS. We quantify these uncertainties using Bayesian models in a method we call Informative Probabilistic WMDS (IP-WMDS). Visually, we display estimated uncertainty in the form of color and ellipses, and practically, these uncertainties reflect trust in WMDS. Our results show that these displays of uncertainty highlight different aspects of the visualization, which can help inform analysts.
Project 2: Analysis of network data has emerged as an active research area in statistics. Much of the focus of ongoing research has been on static networks that represent a single snapshot or aggregated historical data unchanging over time. However, most networks result from temporally-evolving systems that exhibit intrinsic dynamic behavior. Monitoring such temporally-varying networks to detect anomalous changes has applications in both social and physical sciences. In this work, we simulate data from models that rely on MDS, and we perform an evaluation study of the use of summary statistics for anomaly detection by incorporating principles from statistical process monitoring. In contrast to most previous studies, we deliberately incorporate temporal auto-correlation in our study. Other considerations in our comprehensive assessment include types and duration of anomaly, model type, and sparsity in temporally-evolving networks. We conclude that the use of summary statistics can be valuable tools for network monitoring and often perform better than more involved techniques. === Doctor of Philosophy === In this work, two main ideas in data visualization and anomaly detection in dynamic networks are further explored. For both ideas, a connecting theme is extensions of a method called Multidimensional Scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data (all $p$ dimensions) and creates a low-dimensional projection of the data. That is, relationships in a dataset with presumably a large number of dimensions or variables can be summarized into a lower number of, e.g., two, dimensions. For a given data,
an analyst could use a scatterplot to observe the relationship between 2 variables initially. Then, by coloring points, changing the size of the points, or using different shapes for the points, perhaps another 3 to 4 more variables (in total around 7 variables) may be shown in the scatterplot. An advantage of MDS (or any dimension-reduction technique) is that relationships among the data can be viewed easily in a scatterplot regardless of the number of variables in the data. The interpretation of any MDS plot is that observations that are close together are relatively more similar than observations that are farther apart, i.e., proximity in the scatterplot indicates relative similarity.
In the first project, we use a weighted version of MDS called Weighted Multidimensional Scaling (WMDS) where weights, which indicate a sense of importance, are placed on the variables of the data. The problem with any WMDS plot is that inaccuracies of the method are not included in the plot. For example, is an observation that appears to be an outlier, really an outlier? An analyst cannot confirm this without further context. Thus, we created a model to calculate, visualize, and interpret such inaccuracy or uncertainty in WMDS plots. Such modeling efforts help analysts facilitate exploratory data analysis.
In the second project, the theme of MDS is extended to an application with dynamic networks. Dynamic networks are multiple snapshots of pairwise interactions (represented as edges) among a set of nodes (observations). Over time, changes may appear in some of the snapshots. We aim to detect such changes using a process monitoring approach on dynamic networks. Statistical monitoring approaches determine thresholds for in-control or expected behavior that are calculated from data with no signal. Then, the in-control thresholds are used to monitor newly collected data. We applied this approach on dynamic network data, and we utilized a detailed simulation study to better understand the performance of such monitoring. For the simulation study, data are generated from dynamic network models that use MDS. We found that monitoring summary statistics of the network were quite effective on data generated from these models. Thus, simple tools may be used as a first step to anomaly detection in dynamic networks. |
author2 |
Statistics |
author_facet |
Statistics Kodali, Lata |
author |
Kodali, Lata |
author_sort |
Kodali, Lata |
title |
Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring |
title_short |
Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring |
title_full |
Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring |
title_fullStr |
Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring |
title_full_unstemmed |
Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring |
title_sort |
extensions of weighted multidimensional scaling with statistics for data visualization and process monitoring |
publisher |
Virginia Tech |
publishDate |
2020 |
url |
http://hdl.handle.net/10919/99911 |
work_keys_str_mv |
AT kodalilata extensionsofweightedmultidimensionalscalingwithstatisticsfordatavisualizationandprocessmonitoring |
_version_ |
1719341410366259200 |
spelling |
ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-999112020-09-26T05:32:32Z Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring Kodali, Lata Statistics House, Leanna L. Sengupta, Srijan Woodall, William H. Higdon, David Uncertainty Bayesian Multidimensional Scaling Visualizations Anomaly Detection Dynamic Networks This dissertation is the compilation of two major innovations that rely on a common technique known as multidimensional scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data and creates low-dimensional versions. Project 1: Visualizations are useful when learning from high-dimensional data. However, visualizations, just as any data summary, can be misleading when they do not incorporate measures of uncertainty; e.g., uncertainty from the data or the dimension reduction algorithm used to create the visual display. We incorporate uncertainty into visualizations created by a weighted version of MDS called WMDS. Uncertainty exists in these visualizations on the variable weights, the coordinates of the display, and the fit of WMDS. We quantify these uncertainties using Bayesian models in a method we call Informative Probabilistic WMDS (IP-WMDS). Visually, we display estimated uncertainty in the form of color and ellipses, and practically, these uncertainties reflect trust in WMDS. Our results show that these displays of uncertainty highlight different aspects of the visualization, which can help inform analysts. Project 2: Analysis of network data has emerged as an active research area in statistics. Much of the focus of ongoing research has been on static networks that represent a single snapshot or aggregated historical data unchanging over time. However, most networks result from temporally-evolving systems that exhibit intrinsic dynamic behavior. Monitoring such temporally-varying networks to detect anomalous changes has applications in both social and physical sciences. In this work, we simulate data from models that rely on MDS, and we perform an evaluation study of the use of summary statistics for anomaly detection by incorporating principles from statistical process monitoring. In contrast to most previous studies, we deliberately incorporate temporal auto-correlation in our study. Other considerations in our comprehensive assessment include types and duration of anomaly, model type, and sparsity in temporally-evolving networks. We conclude that the use of summary statistics can be valuable tools for network monitoring and often perform better than more involved techniques. Doctor of Philosophy In this work, two main ideas in data visualization and anomaly detection in dynamic networks are further explored. For both ideas, a connecting theme is extensions of a method called Multidimensional Scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data (all $p$ dimensions) and creates a low-dimensional projection of the data. That is, relationships in a dataset with presumably a large number of dimensions or variables can be summarized into a lower number of, e.g., two, dimensions. For a given data, an analyst could use a scatterplot to observe the relationship between 2 variables initially. Then, by coloring points, changing the size of the points, or using different shapes for the points, perhaps another 3 to 4 more variables (in total around 7 variables) may be shown in the scatterplot. An advantage of MDS (or any dimension-reduction technique) is that relationships among the data can be viewed easily in a scatterplot regardless of the number of variables in the data. The interpretation of any MDS plot is that observations that are close together are relatively more similar than observations that are farther apart, i.e., proximity in the scatterplot indicates relative similarity. In the first project, we use a weighted version of MDS called Weighted Multidimensional Scaling (WMDS) where weights, which indicate a sense of importance, are placed on the variables of the data. The problem with any WMDS plot is that inaccuracies of the method are not included in the plot. For example, is an observation that appears to be an outlier, really an outlier? An analyst cannot confirm this without further context. Thus, we created a model to calculate, visualize, and interpret such inaccuracy or uncertainty in WMDS plots. Such modeling efforts help analysts facilitate exploratory data analysis. In the second project, the theme of MDS is extended to an application with dynamic networks. Dynamic networks are multiple snapshots of pairwise interactions (represented as edges) among a set of nodes (observations). Over time, changes may appear in some of the snapshots. We aim to detect such changes using a process monitoring approach on dynamic networks. Statistical monitoring approaches determine thresholds for in-control or expected behavior that are calculated from data with no signal. Then, the in-control thresholds are used to monitor newly collected data. We applied this approach on dynamic network data, and we utilized a detailed simulation study to better understand the performance of such monitoring. For the simulation study, data are generated from dynamic network models that use MDS. We found that monitoring summary statistics of the network were quite effective on data generated from these models. Thus, simple tools may be used as a first step to anomaly detection in dynamic networks. 2020-09-05T08:01:07Z 2020-09-05T08:01:07Z 2020-09-04 Dissertation vt_gsexam:27455 http://hdl.handle.net/10919/99911 In Copyright http://rightsstatements.org/vocab/InC/1.0/ ETD application/pdf Virginia Tech |