Clustering by Correlations and Similarity Search overMultiple Data Streams

博士 === 國立臺灣大學 === 電機工程學研究所 === 97 === Processing data streams has become increasingly important as more and more emerging applications are required to handle a large amount of data in the form of rapidly arriving streams. The huge amount and evolving properties make them more challenging to be proce...

Full description

Bibliographic Details
Main Authors: Mi-Yen Yeh, 葉彌妍
Other Authors: Ming-Syan Chen
Format: Others
Language:en_US
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/75671437948173948854
Description
Summary:博士 === 國立臺灣大學 === 電機工程學研究所 === 97 === Processing data streams has become increasingly important as more and more emerging applications are required to handle a large amount of data in the form of rapidly arriving streams. The huge amount and evolving properties make them more challenging to be processed. Moreover, in many cases, more than one data stream needs to be analyzed simultaneously. To discover knowledge from multiple data streams, it is useful if we know the cross relationship among them first. Therefore, in this dissertation, we focus on how to find the relationship between streams, including clustering by correlations and similarity searches over many streams. First, we devise a framework for Clustering Over Multiple Evolving sTreams by CORrelations and Events, which, abbreviated as COMET-CORE, monitors the distribution of clusters over multiple data streams based on their correlations. In the multiple data stream environment, where streams are evolving as time advances, some might act similarly at this moment but dissimilarly at the next moment. The information of evolving clusters is valuable to support corresponding online decisions. Instead of directly clustering the multiple data streams periodically, COMET-CORE applies efficient cluster split and merge processes only when significant cluster evolution happens. Accordingly, we devise an event detection mechanism to signal the cluster adjustments. The coming streams are smoothed as sequences of end points by employing piecewise linear approximation. At the time when end points are generated, weighted correlations between streams are updated. End points are good indicators of significant change in streams, and this is a main cause of cluster evolution event. When an event occurs, through split and merge operations we can report the latest clustering results. In many real cases, streams are collected independently in a decentralized manner. Given a reference stream, searching its most similar streams, which might exist in more than one distributed databases, is helpful for many applications. Therefore, we present LEEWAVE − a bandwidth-efficient approach to searching range-specified k-nearest neighbors among distributed streams by LEvEl-wise distribution of WAVElet coefficients. This work focuses on that when all streams are summarized using wavelet-based synopses. To find the k most similar streams to a range-specified reference one, the relevant wavelet coefficients of the reference stream can be sent to the peer sites to compute the similarities. However, bandwidth can be unnecessarily wasted if the entire relevant coefficients are sent simultaneously. Instead, we present a level-wise approach by leveraging the multi-resolution property of the wavelet coefficients. Starting from the top and moving down one level at a time, the query initiator sends only the single-level coefficients to a progressively shrinking set of candidates. In addition, we derive and maintain a similarity range for each candidate and gradually tighten the bounds of this range as we move from one level to the next. The increasingly tightened similarity ranges enable the query initiator to effectively prune the candidates without causing any false dismissal. Finally, the case when each stream is composed of uncertain values is discussed. We present PROUD - A PRObabilistic approach to processing similarity queries over Uncertain Data streams. In contrast to streams with certainty, an uncertain stream is an ordered sequence of random variables. The distance between two uncertain streams is also a random variable. We use a general uncertain data model, where only the means and the deviations of the random variables in an uncertain stream are available. Under this model, we first derive mathematical conditions for progressively prune them to reduce the computation cost. We then apply PROUD to a streaming environment where only sketches of streams, like wavelet synopses, are vailable. PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost. This trade-off is important as in some applications false negatives are more costly, while in others, it is more critical to keep the false positives low.