Improving Histogram Approximation with Line Representation

碩士 === 國立清華大學 === 資訊工程學系 === 94 === Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimati...

Full description

Bibliographic Details
Main Authors:	Hung-Yuan Chen, 陳鴻元
Other Authors:	Arbee L. P. Chen
Format:	Others
Language:	en_US
Published:	2006
Online Access:	http://ndltd.ncl.edu.tw/handle/96749204682141130073

id	ndltd-TW-094NTHU5392151
record_format	oai_dc
spelling	ndltd-TW-094NTHU53921512015-12-16T04:42:37Z http://ndltd.ncl.edu.tw/handle/96749204682141130073 Improving Histogram Approximation with Line Representation 使用線段估計改進長條圖近似 Hung-Yuan Chen 陳鴻元碩士國立清華大學資訊工程學系 94 Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years. The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed. Arbee L. P. Chen 陳良弼 2006 學位論文 ; thesis 42 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立清華大學 === 資訊工程學系 === 94 === Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years. The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed.
author2	Arbee L. P. Chen
author_facet	Arbee L. P. Chen Hung-Yuan Chen 陳鴻元
author	Hung-Yuan Chen 陳鴻元
spellingShingle	Hung-Yuan Chen 陳鴻元 Improving Histogram Approximation with Line Representation
author_sort	Hung-Yuan Chen
title	Improving Histogram Approximation with Line Representation
title_short	Improving Histogram Approximation with Line Representation
title_full	Improving Histogram Approximation with Line Representation
title_fullStr	Improving Histogram Approximation with Line Representation
title_full_unstemmed	Improving Histogram Approximation with Line Representation
title_sort	improving histogram approximation with line representation
publishDate	2006
url	http://ndltd.ncl.edu.tw/handle/96749204682141130073
work_keys_str_mv	AT hungyuanchen improvinghistogramapproximationwithlinerepresentation AT chénhóngyuán improvinghistogramapproximationwithlinerepresentation AT hungyuanchen shǐyòngxiànduàngūjìgǎijìnzhǎngtiáotújìnshì AT chénhóngyuán shǐyòngxiànduàngūjìgǎijìnzhǎngtiáotújìnshì
_version_	1718152353771683840

Improving Histogram Approximation with Line Representation

Similar Items