A scale-independent clustering method with automatic variable selection based on trees

Approved for public release; distribution is unlimited. === Clustering is the process of putting observations into groups based on their distance, or dissimilarity, from one another. Measuring distance for continuous variables often requires scaling or monotonic transformation. Determining dissimila...

Full description

Bibliographic Details
Main Author: Lynch, Sarah K.
Other Authors: Buttrey, Samuel E.
Published: Monterey, California: Naval Postgraduate School 2014
Online Access:http://hdl.handle.net/10945/41412
id ndltd-nps.edu-oai-calhoun.nps.edu-10945-41412
record_format oai_dc
spelling ndltd-nps.edu-oai-calhoun.nps.edu-10945-414122014-11-27T16:19:44Z A scale-independent clustering method with automatic variable selection based on trees Lynch, Sarah K. Buttrey, Samuel E. Whitaker, Lyn R. Operations Research Approved for public release; distribution is unlimited. Clustering is the process of putting observations into groups based on their distance, or dissimilarity, from one another. Measuring distance for continuous variables often requires scaling or monotonic transformation. Determining dissimilarity when observations have both continuous and categorical measurements can be difficult because each type of measurement must be approached differently. We introduce a new clustering method that uses one of three new distance metrics. In a dataset with p variables, we create p trees, one with each variable as the response. Distance is measured by determining on which leaf an observation falls in each tree. Two observations are similar if they tend to fall on the same leaf and dissimilar if they are usually on different leaves. The distance metrics are not affected by scaling or transformations of the variables and easily determine distances in datasets with both continuous and categorical variables. This method is tested on several well-known datasets, both with and without added noise variables, and performs very well in the presence of noise due in part to automatic variable selection. The new distance metrics outperform several existing clustering methods in a large number of scenarios. 2014-05-23T15:19:34Z 2014-05-23T15:19:34Z 2014-03 Thesis http://hdl.handle.net/10945/41412 This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. As such, it is in the public domain, and under the provisions of Title 17, United States Code, Section 105, it may not be copyrighted. Monterey, California: Naval Postgraduate School
collection NDLTD
sources NDLTD
description Approved for public release; distribution is unlimited. === Clustering is the process of putting observations into groups based on their distance, or dissimilarity, from one another. Measuring distance for continuous variables often requires scaling or monotonic transformation. Determining dissimilarity when observations have both continuous and categorical measurements can be difficult because each type of measurement must be approached differently. We introduce a new clustering method that uses one of three new distance metrics. In a dataset with p variables, we create p trees, one with each variable as the response. Distance is measured by determining on which leaf an observation falls in each tree. Two observations are similar if they tend to fall on the same leaf and dissimilar if they are usually on different leaves. The distance metrics are not affected by scaling or transformations of the variables and easily determine distances in datasets with both continuous and categorical variables. This method is tested on several well-known datasets, both with and without added noise variables, and performs very well in the presence of noise due in part to automatic variable selection. The new distance metrics outperform several existing clustering methods in a large number of scenarios.
author2 Buttrey, Samuel E.
author_facet Buttrey, Samuel E.
Lynch, Sarah K.
author Lynch, Sarah K.
spellingShingle Lynch, Sarah K.
A scale-independent clustering method with automatic variable selection based on trees
author_sort Lynch, Sarah K.
title A scale-independent clustering method with automatic variable selection based on trees
title_short A scale-independent clustering method with automatic variable selection based on trees
title_full A scale-independent clustering method with automatic variable selection based on trees
title_fullStr A scale-independent clustering method with automatic variable selection based on trees
title_full_unstemmed A scale-independent clustering method with automatic variable selection based on trees
title_sort scale-independent clustering method with automatic variable selection based on trees
publisher Monterey, California: Naval Postgraduate School
publishDate 2014
url http://hdl.handle.net/10945/41412
work_keys_str_mv AT lynchsarahk ascaleindependentclusteringmethodwithautomaticvariableselectionbasedontrees
AT lynchsarahk scaleindependentclusteringmethodwithautomaticvariableselectionbasedontrees
_version_ 1716725658895253504