Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring

This dissertation is the compilation of two major innovations that rely on a common technique known as multidimensional scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data and creates low-dimensional versions. Project 1: Visualizations are useful when learning from h...

Full description

Bibliographic Details
Main Author: Kodali, Lata
Other Authors: Statistics
Format: Others
Published: Virginia Tech 2020
Subjects:
Online Access:http://hdl.handle.net/10919/99911
id ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-99911
record_format oai_dc
collection NDLTD
format Others
sources NDLTD
topic Uncertainty
Bayesian
Multidimensional Scaling
Visualizations
Anomaly Detection
Dynamic Networks
spellingShingle Uncertainty
Bayesian
Multidimensional Scaling
Visualizations
Anomaly Detection
Dynamic Networks
Kodali, Lata
Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring
description This dissertation is the compilation of two major innovations that rely on a common technique known as multidimensional scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data and creates low-dimensional versions. Project 1: Visualizations are useful when learning from high-dimensional data. However, visualizations, just as any data summary, can be misleading when they do not incorporate measures of uncertainty; e.g., uncertainty from the data or the dimension reduction algorithm used to create the visual display. We incorporate uncertainty into visualizations created by a weighted version of MDS called WMDS. Uncertainty exists in these visualizations on the variable weights, the coordinates of the display, and the fit of WMDS. We quantify these uncertainties using Bayesian models in a method we call Informative Probabilistic WMDS (IP-WMDS). Visually, we display estimated uncertainty in the form of color and ellipses, and practically, these uncertainties reflect trust in WMDS. Our results show that these displays of uncertainty highlight different aspects of the visualization, which can help inform analysts. Project 2: Analysis of network data has emerged as an active research area in statistics. Much of the focus of ongoing research has been on static networks that represent a single snapshot or aggregated historical data unchanging over time. However, most networks result from temporally-evolving systems that exhibit intrinsic dynamic behavior. Monitoring such temporally-varying networks to detect anomalous changes has applications in both social and physical sciences. In this work, we simulate data from models that rely on MDS, and we perform an evaluation study of the use of summary statistics for anomaly detection by incorporating principles from statistical process monitoring. In contrast to most previous studies, we deliberately incorporate temporal auto-correlation in our study. Other considerations in our comprehensive assessment include types and duration of anomaly, model type, and sparsity in temporally-evolving networks. We conclude that the use of summary statistics can be valuable tools for network monitoring and often perform better than more involved techniques. === Doctor of Philosophy === In this work, two main ideas in data visualization and anomaly detection in dynamic networks are further explored. For both ideas, a connecting theme is extensions of a method called Multidimensional Scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data (all $p$ dimensions) and creates a low-dimensional projection of the data. That is, relationships in a dataset with presumably a large number of dimensions or variables can be summarized into a lower number of, e.g., two, dimensions. For a given data, an analyst could use a scatterplot to observe the relationship between 2 variables initially. Then, by coloring points, changing the size of the points, or using different shapes for the points, perhaps another 3 to 4 more variables (in total around 7 variables) may be shown in the scatterplot. An advantage of MDS (or any dimension-reduction technique) is that relationships among the data can be viewed easily in a scatterplot regardless of the number of variables in the data. The interpretation of any MDS plot is that observations that are close together are relatively more similar than observations that are farther apart, i.e., proximity in the scatterplot indicates relative similarity. In the first project, we use a weighted version of MDS called Weighted Multidimensional Scaling (WMDS) where weights, which indicate a sense of importance, are placed on the variables of the data. The problem with any WMDS plot is that inaccuracies of the method are not included in the plot. For example, is an observation that appears to be an outlier, really an outlier? An analyst cannot confirm this without further context. Thus, we created a model to calculate, visualize, and interpret such inaccuracy or uncertainty in WMDS plots. Such modeling efforts help analysts facilitate exploratory data analysis. In the second project, the theme of MDS is extended to an application with dynamic networks. Dynamic networks are multiple snapshots of pairwise interactions (represented as edges) among a set of nodes (observations). Over time, changes may appear in some of the snapshots. We aim to detect such changes using a process monitoring approach on dynamic networks. Statistical monitoring approaches determine thresholds for in-control or expected behavior that are calculated from data with no signal. Then, the in-control thresholds are used to monitor newly collected data. We applied this approach on dynamic network data, and we utilized a detailed simulation study to better understand the performance of such monitoring. For the simulation study, data are generated from dynamic network models that use MDS. We found that monitoring summary statistics of the network were quite effective on data generated from these models. Thus, simple tools may be used as a first step to anomaly detection in dynamic networks.
author2 Statistics
author_facet Statistics
Kodali, Lata
author Kodali, Lata
author_sort Kodali, Lata
title Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring
title_short Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring
title_full Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring
title_fullStr Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring
title_full_unstemmed Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring
title_sort extensions of weighted multidimensional scaling with statistics for data visualization and process monitoring
publisher Virginia Tech
publishDate 2020
url http://hdl.handle.net/10919/99911
work_keys_str_mv AT kodalilata extensionsofweightedmultidimensionalscalingwithstatisticsfordatavisualizationandprocessmonitoring
_version_ 1719341410366259200
spelling ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-999112020-09-26T05:32:32Z Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process Monitoring Kodali, Lata Statistics House, Leanna L. Sengupta, Srijan Woodall, William H. Higdon, David Uncertainty Bayesian Multidimensional Scaling Visualizations Anomaly Detection Dynamic Networks This dissertation is the compilation of two major innovations that rely on a common technique known as multidimensional scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data and creates low-dimensional versions. Project 1: Visualizations are useful when learning from high-dimensional data. However, visualizations, just as any data summary, can be misleading when they do not incorporate measures of uncertainty; e.g., uncertainty from the data or the dimension reduction algorithm used to create the visual display. We incorporate uncertainty into visualizations created by a weighted version of MDS called WMDS. Uncertainty exists in these visualizations on the variable weights, the coordinates of the display, and the fit of WMDS. We quantify these uncertainties using Bayesian models in a method we call Informative Probabilistic WMDS (IP-WMDS). Visually, we display estimated uncertainty in the form of color and ellipses, and practically, these uncertainties reflect trust in WMDS. Our results show that these displays of uncertainty highlight different aspects of the visualization, which can help inform analysts. Project 2: Analysis of network data has emerged as an active research area in statistics. Much of the focus of ongoing research has been on static networks that represent a single snapshot or aggregated historical data unchanging over time. However, most networks result from temporally-evolving systems that exhibit intrinsic dynamic behavior. Monitoring such temporally-varying networks to detect anomalous changes has applications in both social and physical sciences. In this work, we simulate data from models that rely on MDS, and we perform an evaluation study of the use of summary statistics for anomaly detection by incorporating principles from statistical process monitoring. In contrast to most previous studies, we deliberately incorporate temporal auto-correlation in our study. Other considerations in our comprehensive assessment include types and duration of anomaly, model type, and sparsity in temporally-evolving networks. We conclude that the use of summary statistics can be valuable tools for network monitoring and often perform better than more involved techniques. Doctor of Philosophy In this work, two main ideas in data visualization and anomaly detection in dynamic networks are further explored. For both ideas, a connecting theme is extensions of a method called Multidimensional Scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data (all $p$ dimensions) and creates a low-dimensional projection of the data. That is, relationships in a dataset with presumably a large number of dimensions or variables can be summarized into a lower number of, e.g., two, dimensions. For a given data, an analyst could use a scatterplot to observe the relationship between 2 variables initially. Then, by coloring points, changing the size of the points, or using different shapes for the points, perhaps another 3 to 4 more variables (in total around 7 variables) may be shown in the scatterplot. An advantage of MDS (or any dimension-reduction technique) is that relationships among the data can be viewed easily in a scatterplot regardless of the number of variables in the data. The interpretation of any MDS plot is that observations that are close together are relatively more similar than observations that are farther apart, i.e., proximity in the scatterplot indicates relative similarity. In the first project, we use a weighted version of MDS called Weighted Multidimensional Scaling (WMDS) where weights, which indicate a sense of importance, are placed on the variables of the data. The problem with any WMDS plot is that inaccuracies of the method are not included in the plot. For example, is an observation that appears to be an outlier, really an outlier? An analyst cannot confirm this without further context. Thus, we created a model to calculate, visualize, and interpret such inaccuracy or uncertainty in WMDS plots. Such modeling efforts help analysts facilitate exploratory data analysis. In the second project, the theme of MDS is extended to an application with dynamic networks. Dynamic networks are multiple snapshots of pairwise interactions (represented as edges) among a set of nodes (observations). Over time, changes may appear in some of the snapshots. We aim to detect such changes using a process monitoring approach on dynamic networks. Statistical monitoring approaches determine thresholds for in-control or expected behavior that are calculated from data with no signal. Then, the in-control thresholds are used to monitor newly collected data. We applied this approach on dynamic network data, and we utilized a detailed simulation study to better understand the performance of such monitoring. For the simulation study, data are generated from dynamic network models that use MDS. We found that monitoring summary statistics of the network were quite effective on data generated from these models. Thus, simple tools may be used as a first step to anomaly detection in dynamic networks. 2020-09-05T08:01:07Z 2020-09-05T08:01:07Z 2020-09-04 Dissertation vt_gsexam:27455 http://hdl.handle.net/10919/99911 In Copyright http://rightsstatements.org/vocab/InC/1.0/ ETD application/pdf Virginia Tech