Improving Histogram Approximation with Line Representation
碩士 === 國立清華大學 === 資訊工程學系 === 94 === Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimati...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2006
|
Online Access: | http://ndltd.ncl.edu.tw/handle/96749204682141130073 |
id |
ndltd-TW-094NTHU5392151 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-094NTHU53921512015-12-16T04:42:37Z http://ndltd.ncl.edu.tw/handle/96749204682141130073 Improving Histogram Approximation with Line Representation 使用線段估計改進長條圖近似 Hung-Yuan Chen 陳鴻元 碩士 國立清華大學 資訊工程學系 94 Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years. The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed. Arbee L. P. Chen 陳良弼 2006 學位論文 ; thesis 42 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立清華大學 === 資訊工程學系 === 94 === Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years.
The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed.
|
author2 |
Arbee L. P. Chen |
author_facet |
Arbee L. P. Chen Hung-Yuan Chen 陳鴻元 |
author |
Hung-Yuan Chen 陳鴻元 |
spellingShingle |
Hung-Yuan Chen 陳鴻元 Improving Histogram Approximation with Line Representation |
author_sort |
Hung-Yuan Chen |
title |
Improving Histogram Approximation with Line Representation |
title_short |
Improving Histogram Approximation with Line Representation |
title_full |
Improving Histogram Approximation with Line Representation |
title_fullStr |
Improving Histogram Approximation with Line Representation |
title_full_unstemmed |
Improving Histogram Approximation with Line Representation |
title_sort |
improving histogram approximation with line representation |
publishDate |
2006 |
url |
http://ndltd.ncl.edu.tw/handle/96749204682141130073 |
work_keys_str_mv |
AT hungyuanchen improvinghistogramapproximationwithlinerepresentation AT chénhóngyuán improvinghistogramapproximationwithlinerepresentation AT hungyuanchen shǐyòngxiànduàngūjìgǎijìnzhǎngtiáotújìnshì AT chénhóngyuán shǐyòngxiànduàngūjìgǎijìnzhǎngtiáotújìnshì |
_version_ |
1718152353771683840 |