Active learning for data streams

With the exponential growth of data amount and sources, access to large collections of data has become easier and cheaper. However, data is generally unlabelled and labels are often difficult, expensive, and time consuming to obtain. Two learning paradigms have been used by machine learning communit...

Full description

Bibliographic Details
Main Author: Mohamad, Saad
Published: Bournemouth University 2017
Subjects:
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.725478
id ndltd-bl.uk-oai-ethos.bl.uk-725478
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7254782019-04-03T06:15:23ZActive learning for data streamsMohamad, Saad2017With the exponential growth of data amount and sources, access to large collections of data has become easier and cheaper. However, data is generally unlabelled and labels are often difficult, expensive, and time consuming to obtain. Two learning paradigms have been used by machine learning community to diminish the need for labels in training data: semi-supervised learning (SSL) and active learning (AL). AL is a reliable way to efficiently building up training sets with minimal supervision. By querying the class (label) of the most interesting samples based upon previously seen data and some selection criteria, AL can produce a nearly optimal hypothesis, while requiring the minimum possible quantity of labelled data. SSL, on the other hand, takes the advantage of both labelled and unlabelled data to address the challenge of learning from a small number of labelled samples and large amount of unlabelled data. In this thesis, we borrow the concept of SSL by allowing AL algorithms to make use of redundant unlabelled data so that both labelled and unlabelled data are used in their querying criteria. Another common tradition within the AL community is to assume that data samples are already gathered in a pool and AL has the luxury to exhaustively search in that pool for the samples worth labelling. In this thesis, we go beyond that by applying AL to data streams. In a stream, data may grow infinitely making its storage prior to processing impractical. Due to its dynamic nature, the underlying distribution of the data stream may change over time resulting in the so-called concept drift or possibly emergence and fading of classes, known as concept evolution. Another challenge associated with AL, in general, is the sampling bias where the sampled training set does not reflect on the underlying data distribution. In presence of concept drift, sampling bias is more likely to occur as the training set needs to represent the underlying distribution of the evolving data. Given these challenges, the research questions that the thesis addresses are: can AL improve learning given that data comes in streams? Is it possible to harness AL to handle changes in streams (i.e., concept drift and concept evolution by querying selected samples)? How can sampling bias be attenuated, while maintaining AL advantages? Finally, applying AL for sequential data steams (like time series) requires new approaches especially in the presence of concept drift and concept evolution. Hence, the question is how to handle concept drift and concept evolution in sequential data online and can AL be useful in such case? In this thesis, we develop a set of stream-based AL algorithms to answer these questions in line with the aforementioned challenges. The core idea of these algorithms is to query samples that give the largest reduction of an expected loss function that measures the learning performance. Two types of AL are proposed: decision theory based AL whose losses involve the prediction error and information theory based AL whose losses involve the model parameters. Although, our work focuses on classification problems, AL algorithms for other problems such as regression and parameter estimation can be derived from the proposed AL algorithms. Several experiments have been performed in order to evaluate the performance of the proposed algorithms. The obtained results show that our algorithms outperform other state-of-the-art algorithms.371.39Bournemouth Universityhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.725478http://eprints.bournemouth.ac.uk/29901/Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 371.39
spellingShingle 371.39
Mohamad, Saad
Active learning for data streams
description With the exponential growth of data amount and sources, access to large collections of data has become easier and cheaper. However, data is generally unlabelled and labels are often difficult, expensive, and time consuming to obtain. Two learning paradigms have been used by machine learning community to diminish the need for labels in training data: semi-supervised learning (SSL) and active learning (AL). AL is a reliable way to efficiently building up training sets with minimal supervision. By querying the class (label) of the most interesting samples based upon previously seen data and some selection criteria, AL can produce a nearly optimal hypothesis, while requiring the minimum possible quantity of labelled data. SSL, on the other hand, takes the advantage of both labelled and unlabelled data to address the challenge of learning from a small number of labelled samples and large amount of unlabelled data. In this thesis, we borrow the concept of SSL by allowing AL algorithms to make use of redundant unlabelled data so that both labelled and unlabelled data are used in their querying criteria. Another common tradition within the AL community is to assume that data samples are already gathered in a pool and AL has the luxury to exhaustively search in that pool for the samples worth labelling. In this thesis, we go beyond that by applying AL to data streams. In a stream, data may grow infinitely making its storage prior to processing impractical. Due to its dynamic nature, the underlying distribution of the data stream may change over time resulting in the so-called concept drift or possibly emergence and fading of classes, known as concept evolution. Another challenge associated with AL, in general, is the sampling bias where the sampled training set does not reflect on the underlying data distribution. In presence of concept drift, sampling bias is more likely to occur as the training set needs to represent the underlying distribution of the evolving data. Given these challenges, the research questions that the thesis addresses are: can AL improve learning given that data comes in streams? Is it possible to harness AL to handle changes in streams (i.e., concept drift and concept evolution by querying selected samples)? How can sampling bias be attenuated, while maintaining AL advantages? Finally, applying AL for sequential data steams (like time series) requires new approaches especially in the presence of concept drift and concept evolution. Hence, the question is how to handle concept drift and concept evolution in sequential data online and can AL be useful in such case? In this thesis, we develop a set of stream-based AL algorithms to answer these questions in line with the aforementioned challenges. The core idea of these algorithms is to query samples that give the largest reduction of an expected loss function that measures the learning performance. Two types of AL are proposed: decision theory based AL whose losses involve the prediction error and information theory based AL whose losses involve the model parameters. Although, our work focuses on classification problems, AL algorithms for other problems such as regression and parameter estimation can be derived from the proposed AL algorithms. Several experiments have been performed in order to evaluate the performance of the proposed algorithms. The obtained results show that our algorithms outperform other state-of-the-art algorithms.
author Mohamad, Saad
author_facet Mohamad, Saad
author_sort Mohamad, Saad
title Active learning for data streams
title_short Active learning for data streams
title_full Active learning for data streams
title_fullStr Active learning for data streams
title_full_unstemmed Active learning for data streams
title_sort active learning for data streams
publisher Bournemouth University
publishDate 2017
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.725478
work_keys_str_mv AT mohamadsaad activelearningfordatastreams
_version_ 1719012332137349120