Nearest neighbor classification using a density sensitive distance measurement [electronic resource]

Approved for public release, distribution unlimited === This work proposes a density sensitive distance measurement that takes into account the density of an underlying dataset to better represent the shape of the data when measuring distance. Kernel density estimation, using kernel bandwidths det...

Full description

Bibliographic Details
Main Author: Burkholder, Joshua Jeremy.
Other Authors: Squire, Kevin
Published: Monterey, California. Naval Postgraduate School 2012
Online Access:http://hdl.handle.net/10945/4315
Description
Summary:Approved for public release, distribution unlimited === This work proposes a density sensitive distance measurement that takes into account the density of an underlying dataset to better represent the shape of the data when measuring distance. Kernel density estimation, using kernel bandwidths determined by k -nearest neighbor distances, is used to approximate the density of the underlying dataset. A scale is applied to the resulting kernel density estimate and a line integral is performed along its surface resulting in a density sensitive distance. This work tests the utility of the proposed density sensitive distance measurement using supervised learning. k -Nearest Neighbor classification using both the proposed density sensitive distance measurement and Euclidean distance are compared on the Wisconsin Diagnostic Breast Cancer dataset and the MNIST Database of Handwritten Digits. For perspective, these classifiers are also compared to Support Vector Machine and Random Forests classifiers. Stratified 10-fold cross validation is used to determine the generalization error of each classifier. In all comparisons, k -Nearest Neighbor classification using the proposed density sensitive distance measurement had less generalization error than k -Nearest Neighbor classification using Euclidean distance. For the MNIST dataset, k -Nearest Neighbor classification using the density sensitive distance measurement also had less generalization error than both Support Vector Machine and Random Forests classification.