Improving Histogram Approximation with Line Representation

碩士 === 國立清華大學 === 資訊工程學系 === 94 === Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimati...

Full description

Bibliographic Details
Main Authors: Hung-Yuan Chen, 陳鴻元
Other Authors: Arbee L. P. Chen
Format: Others
Language:en_US
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/96749204682141130073
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系 === 94 === Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years. The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed.