Selectivity estimation of approximate predicates on text

This dissertation studies selectivity estimation of approximate predicates on text. Intuitively, we aim to count the number of strings that are similar to a given query string. This type of problem is crucial in handling text in RDBMSs in an error-tolerant way. A common difficulty in handling te...

Full description

Bibliographic Details
Main Author: Lee, Hongrae
Language:English
Published: University of British Columbia 2010
Online Access:http://hdl.handle.net/2429/28645
Description
Summary:This dissertation studies selectivity estimation of approximate predicates on text. Intuitively, we aim to count the number of strings that are similar to a given query string. This type of problem is crucial in handling text in RDBMSs in an error-tolerant way. A common difficulty in handling textual data is that they may contain typographical errors, or use similar but different textual representations for the same real-world entity. To handle such data in databases, approximate text processing has gained extensive interest and commercial databases have begun to incorporate such functionalities. One of the key components in successful integration of approximate text processing in RDBMSs is the selectivity estimation module, which is central in optimizing queries involving such predicates. However, these developments are relatively new and ad-hoc approaches, e.g., using a constant, have been employed. This dissertation studies reliable selectivity estimation techniques for approximate predicates on text. Among many possible predicates, we focus on two types of predicates which are fundamental building blocks of SQL queries: selections and joins. We study two different semantics for each type of operator. We propose a set of related summary structures and algorithms to estimate selectivity of selection and join operators with approximate matching. A common challenge is that there can be a huge number of variants to consider. The proposed data structures enable efficient counting by considering a group of similar variants together rather than each and every one separately. A lattice-based framework is proposed to consider overlapping counts among the groups. We performed extensive evaluation of proposed techniques using real-world and synthetic data sets. Our techniques support popular similarity measures including edit distance, Jaccard similarity and cosine similarity and show how to extend the techniques to other measures. Proposed solutions are compared with state-of-the-arts and baseline methods. Experimental results show that the proposed techniques are able to deliver accurate estimates with small space overhead.