Lazy Sampling for Weighted MinHash Algorithm

碩士 === 國立臺灣大學 === 資訊工程學研究所 === 107 === The computation of data similarity is a fundamental topic in data mining and machine learning. However, as data set grows larger, the exact computation becomes time-consuming and unrealistic. To ameliorate this situation, several locality sensitive hash (LSH) t...

Full description

Bibliographic Details
Main Authors: Yung-Hsien Chung, 鍾詠先
Other Authors: Pu-Jen Cheng
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/dy25vv
Description
Summary:碩士 === 國立臺灣大學 === 資訊工程學研究所 === 107 === The computation of data similarity is a fundamental topic in data mining and machine learning. However, as data set grows larger, the exact computation becomes time-consuming and unrealistic. To ameliorate this situation, several locality sensitive hash (LSH) techniques have been proposed, and among the most popular of them, minHash algorithm has been widely used in the Jaccard similarity sketch of sets. Recently many researches have abused minHash algorithm on other metric spaces, including weighted Jaccard distance and l1 distance. We propose lazy sampling to accelerate weighted minHash algorithm asymptotically for both offline and online cases, and make a brief discussion and some comparisons with past researches.