Sampling Algorithms for Evolving Datasets

Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database...

Full description

Bibliographic Details
Main Author: Gemulla, Rainer
Other Authors: Technische Universität Dresden, Informatik
Format: Doctoral Thesis
Language:English
Published: Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden 2008
Subjects:
Online Access:http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644
http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644
http://www.qucosa.de/fileadmin/data/qucosa/documents/4/1224861856184-1164.pdf
http://www.qucosa.de/fileadmin/data/qucosa/documents/4/kurzfassung.pdf
id ndltd-DRESDEN-oai-qucosa.de-bsz-14-ds-1224861856184-11644
record_format oai_dc
spelling ndltd-DRESDEN-oai-qucosa.de-bsz-14-ds-1224861856184-116442013-01-07T19:48:01Z Sampling Algorithms for Evolving Datasets Gemulla, Rainer Uniform sampling incremental sample maintenance set sampling multiset sampling distinct-item sampling data stream sampling Einfache Zufallsstichproben inkrementelle Stichprobenwartung Stichprobenerhebung von Mengen/Multimengen/Projektionen/Datenströmen ddc:004 rvk:ST 274 Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing. Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden Technische Universität Dresden, Informatik Prof. Dr.-Ing. Wolfgang Lehner Dr. Peter Haas Prof. Dr.-Ing. Dr. h.c. Theo Härder 2008-10-24 doc-type:doctoralThesis application/pdf application/pdf application/zip http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644 urn:nbn:de:bsz:14-ds-1224861856184-11644 PPN293790795 PPN293790795 http://www.qucosa.de/fileadmin/data/qucosa/documents/4/1224861856184-1164.pdf http://www.qucosa.de/fileadmin/data/qucosa/documents/4/kurzfassung.pdf eng
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Uniform sampling
incremental sample maintenance
set sampling
multiset sampling
distinct-item sampling
data stream sampling
Einfache Zufallsstichproben
inkrementelle Stichprobenwartung
Stichprobenerhebung von Mengen/Multimengen/Projektionen/Datenströmen
ddc:004
rvk:ST 274
spellingShingle Uniform sampling
incremental sample maintenance
set sampling
multiset sampling
distinct-item sampling
data stream sampling
Einfache Zufallsstichproben
inkrementelle Stichprobenwartung
Stichprobenerhebung von Mengen/Multimengen/Projektionen/Datenströmen
ddc:004
rvk:ST 274
Gemulla, Rainer
Sampling Algorithms for Evolving Datasets
description Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing.
author2 Technische Universität Dresden, Informatik
author_facet Technische Universität Dresden, Informatik
Gemulla, Rainer
author Gemulla, Rainer
author_sort Gemulla, Rainer
title Sampling Algorithms for Evolving Datasets
title_short Sampling Algorithms for Evolving Datasets
title_full Sampling Algorithms for Evolving Datasets
title_fullStr Sampling Algorithms for Evolving Datasets
title_full_unstemmed Sampling Algorithms for Evolving Datasets
title_sort sampling algorithms for evolving datasets
publisher Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden
publishDate 2008
url http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644
http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644
http://www.qucosa.de/fileadmin/data/qucosa/documents/4/1224861856184-1164.pdf
http://www.qucosa.de/fileadmin/data/qucosa/documents/4/kurzfassung.pdf
work_keys_str_mv AT gemullarainer samplingalgorithmsforevolvingdatasets
_version_ 1716470594036301824