Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques
Annotating large unlabeled datasets can be a major bottleneck for machine learning applications. We introduce a scheme for inferring labels of unlabeled data at a fraction of the cost of labeling the entire dataset. We refer to the scheme as Bounded Expectation of Label Assignment (BELA). BELA greed...
Main Author: | |
---|---|
Other Authors: | |
Format: | Others |
Published: |
Virginia Tech
2020
|
Subjects: | |
Online Access: | http://hdl.handle.net/10919/96517 |
id |
ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-96517 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-965172020-09-26T05:38:03Z Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques Herbst, Alyssa Kathryn Computer Science Huang, Bert Raghvendra, Sharath Barnette, Noah D. Active Learning Machine Learning Dataset Annotation Annotating large unlabeled datasets can be a major bottleneck for machine learning applications. We introduce a scheme for inferring labels of unlabeled data at a fraction of the cost of labeling the entire dataset. We refer to the scheme as Bounded Expectation of Label Assignment (BELA). BELA greedily queries an oracle (or human labeler) and partitions a dataset to find data subsets that have mostly the same label. BELA can then infer labels by majority vote of the known labels in each subset. BELA makes the decision to split or label from a subset by maximizing a lower bound on the expected number of correctly labeled examples. BELA improves upon existing hierarchical labeling schemes by using supervised models to partition the data, therefore avoiding reliance on unsupervised clustering methods that may not accurately group data by label. We design BELA with strategies to avoid bias that could be introduced through this adaptive partitioning. We evaluate BELA on labeling of four datasets and find that it outperforms existing strategies for adaptive labeling. Master of Science Most machine learning classifiers require data with both features and labels. The features of the data may be the pixel values for an image, the words in a text sample, the audio of a voice clip, and more. The labels of a dataset define the data. They place the data into one of several categories, such as determining whether a image is of a cat or dog, or adding subtitles to Youtube videos. The labeling of a dataset can be expensive, and usually requires a human to annotate. Human labeled data can be moreso expensive if the data requires an expert labeler, as in the labeling of medical images, or when labeling data is particularly time consuming. We introduce a scheme for labeling data that aims to lessen the cost of human labeled data by labeling a subset of an entire dataset and making an educated guess on the labels of the remaining unlabeled data. The labeled data generated from our approach may be then used towards the training of a classifier, or an algorithm that maps the features of data to a guessed label. This is based off of the intuition that data with similar features will also have similar labels. Our approach uses a game-like process of, at any point, choosing between one of two possible actions: we may either label a new data point, thus learning more about the dataset, or we may split apart the dataset into multiple subsets of data. We will eventually guess the labels of the unlabeled data by assigning each unlabeled data point the majority label of the data subset that it belongs to. The novelty in our approach is that we use supervised classifiers, or splitting techniques that use both the features and the labels of data, to split a dataset into new subsets. We use bias reduction techniques that enable us to use supervised splitting. 2020-01-21T09:01:02Z 2020-01-21T09:01:02Z 2020-01-20 Thesis vt_gsexam:23583 http://hdl.handle.net/10919/96517 In Copyright http://rightsstatements.org/vocab/InC/1.0/ ETD application/pdf Virginia Tech |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
Active Learning Machine Learning Dataset Annotation |
spellingShingle |
Active Learning Machine Learning Dataset Annotation Herbst, Alyssa Kathryn Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques |
description |
Annotating large unlabeled datasets can be a major bottleneck for machine learning applications. We introduce a scheme for inferring labels of unlabeled data at a fraction of the
cost of labeling the entire dataset. We refer to the scheme as Bounded Expectation of Label
Assignment (BELA). BELA greedily queries an oracle (or human labeler) and partitions a
dataset to find data subsets that have mostly the same label. BELA can then infer labels by
majority vote of the known labels in each subset. BELA makes the decision to split or label
from a subset by maximizing a lower bound on the expected number of correctly labeled
examples. BELA improves upon existing hierarchical labeling schemes by using supervised
models to partition the data, therefore avoiding reliance on unsupervised clustering methods
that may not accurately group data by label. We design BELA with strategies to avoid bias
that could be introduced through this adaptive partitioning. We evaluate BELA on labeling
of four datasets and find that it outperforms existing strategies for adaptive labeling. === Master of Science === Most machine learning classifiers require data with both features and labels. The features
of the data may be the pixel values for an image, the words in a text sample, the audio of
a voice clip, and more. The labels of a dataset define the data. They place the data into
one of several categories, such as determining whether a image is of a cat or dog, or adding
subtitles to Youtube videos. The labeling of a dataset can be expensive, and usually requires
a human to annotate. Human labeled data can be moreso expensive if the data requires an
expert labeler, as in the labeling of medical images, or when labeling data is particularly time
consuming. We introduce a scheme for labeling data that aims to lessen the cost of human
labeled data by labeling a subset of an entire dataset and making an educated guess on the
labels of the remaining unlabeled data. The labeled data generated from our approach may
be then used towards the training of a classifier, or an algorithm that maps the features of
data to a guessed label. This is based off of the intuition that data with similar features will
also have similar labels. Our approach uses a game-like process of, at any point, choosing
between one of two possible actions: we may either label a new data point, thus learning
more about the dataset, or we may split apart the dataset into multiple subsets of data. We
will eventually guess the labels of the unlabeled data by assigning each unlabeled data point
the majority label of the data subset that it belongs to. The novelty in our approach is that
we use supervised classifiers, or splitting techniques that use both the features and the labels
of data, to split a dataset into new subsets. We use bias reduction techniques that enable
us to use supervised splitting. |
author2 |
Computer Science |
author_facet |
Computer Science Herbst, Alyssa Kathryn |
author |
Herbst, Alyssa Kathryn |
author_sort |
Herbst, Alyssa Kathryn |
title |
Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques |
title_short |
Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques |
title_full |
Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques |
title_fullStr |
Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques |
title_full_unstemmed |
Bounded Expectation of Label Assignment: Dataset Annotation by Supervised Splitting with Bias-Reduction Techniques |
title_sort |
bounded expectation of label assignment: dataset annotation by supervised splitting with bias-reduction techniques |
publisher |
Virginia Tech |
publishDate |
2020 |
url |
http://hdl.handle.net/10919/96517 |
work_keys_str_mv |
AT herbstalyssakathryn boundedexpectationoflabelassignmentdatasetannotationbysupervisedsplittingwithbiasreductiontechniques |
_version_ |
1719342895540994048 |