A Study on Efficient Mining of High Utility Sequential Patterns
博士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Sequential pattern mining is a fundamental research issue in data mining, which aims to find out all of the frequent subsequences in a sequence database. So far, many algorithms have been proposed to address this problem, and it has wide real world applicatio...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2016
|
Online Access: | http://ndltd.ncl.edu.tw/handle/gsm5at |
id |
ndltd-TW-105NCTU5394090 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-105NCTU53940902019-05-15T23:32:32Z http://ndltd.ncl.edu.tw/handle/gsm5at A Study on Efficient Mining of High Utility Sequential Patterns 高效益循序資料樣式探勘之研究 Wang, Jun-Zhe 王濬哲 博士 國立交通大學 資訊科學與工程研究所 105 Sequential pattern mining is a fundamental research issue in data mining, which aims to find out all of the frequent subsequences in a sequence database. So far, many algorithms have been proposed to address this problem, and it has wide real world applications. However, frequent sequential patterns may not be informative to some users or business. By contrast, they may be interested in some rare patterns with high revenue, cost, etc. In view of this, the concept of high utility sequential pattern mining has been introduced recently, which refers to identify sequences with high utilities (e.g., profits) but probably with low frequencies in a sequence database. To identify high utility sequential patterns, due to lack of downward closure property in this problem, most existing algorithms first generate candidate sequences with high Sequence-Weighted Utilities (SWUs), which is an upper bound of the utilities of a sequence and all its supersequences, and then calculate the actual utilities of these candidates. This causes a large number of candidates generated since SWU is usually much larger than the real utilities of a sequence and all its supersequences. In view of this, we propose two tight utility upper bounds, Prefix Extension Utility (PEU) and Reduced Sequence Utility (RSU), as well as two companion pruning strategies, and devise HUS-Span algorithm to identify high utility sequential patterns by employing these two pruning strategies. Experimental results on some real and synthetic datasets show that HUS-Span is able to generate less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency. In addition, since setting a proper utility threshold is usually difficult for users, we also propose algorithm TKHUS-Span to identify top-$k$ high utility sequential patterns by using these two pruning strategies. Three searching strategies, guided Depth-First Search (GDFS), Best-First Search (BFS), and hybrid search of BFS and GDFS, are also proposed to improve the efficiency of TKHUS-Span. Experimental results on some real and synthetic datasets show that TKHUS-Span with strategy BFS is able to explore less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency. In practice, most sequence databases usually grow over time, and it is inefficient for existing algorithms to mine HUSPs from scratch when databases grow with a small portion of updates. In view of this, we propose the IncUSP-Miner algorithm to mine HUSPs incrementally. Specifically, to avoid redundant re-computations, we propose a tighter upper bound of the utility of a sequence, called Tight Sequence Utility (TSU), and then design a novel data structure, called the candidate pattern tree, to buffer the sequences whose TSU values are greater than or equal to the minimum utility threshold in the original database. Accordingly, to avoid keeping a huge amount of utility information for each sequence, a set of concise utility information is designed to be kept in each tree node. Moreover, several strategies are also proposed to reduce the amount of computation for utility update and the scopes of database scans, thereby improving the mining efficiency. Experimental results on some real and synthetic datasets show that IncUSP-Miner is able to efficiently mine high utility sequential patterns incrementally. Huang, Jiun-Long Chen, Yi-Cheng 黃俊龍 陳以錚 2016 學位論文 ; thesis 96 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Sequential pattern mining is a fundamental research issue in data mining, which aims to find out all of the frequent subsequences in a sequence database. So far, many algorithms have been proposed to address this problem, and it has wide real world applications. However, frequent sequential patterns may not be informative to some users or business. By contrast, they may be interested in some rare patterns with high revenue, cost, etc.
In view of this, the concept of high utility sequential pattern mining has been introduced recently, which refers to identify sequences with high utilities (e.g., profits) but probably with low frequencies in a sequence database. To identify high utility sequential patterns, due to lack of downward closure property in this problem, most existing algorithms first generate candidate sequences with high Sequence-Weighted Utilities (SWUs), which is an upper bound of the utilities of a sequence and all its supersequences, and then calculate the actual utilities of these candidates. This causes a large number of candidates generated since SWU is usually much larger than the real utilities of a sequence and all its supersequences. In view of this, we propose two tight utility upper bounds, Prefix Extension Utility (PEU) and Reduced Sequence Utility (RSU), as well as two companion pruning strategies, and devise HUS-Span algorithm to identify high utility sequential patterns by employing these two pruning strategies. Experimental results on some real and synthetic datasets show that HUS-Span is able to generate less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency.
In addition, since setting a proper utility threshold is usually difficult for users, we also propose algorithm TKHUS-Span to identify top-$k$ high utility sequential patterns by using these two pruning strategies. Three searching strategies, guided Depth-First Search (GDFS), Best-First Search (BFS), and hybrid search of BFS and GDFS, are also proposed to improve the efficiency of TKHUS-Span. Experimental results on some real and synthetic datasets show that TKHUS-Span with strategy BFS is able to explore less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency.
In practice, most sequence databases usually grow over time, and it is inefficient for existing algorithms to mine HUSPs from scratch when databases grow with a small portion of updates. In view of this, we propose the IncUSP-Miner algorithm to mine HUSPs incrementally. Specifically, to avoid redundant re-computations, we propose a tighter upper bound of the utility of a sequence, called Tight Sequence Utility (TSU), and then design a novel data structure, called the candidate pattern tree, to buffer the sequences whose TSU values are greater than or equal to the minimum utility threshold in the original database. Accordingly, to avoid keeping a huge amount of utility information for each sequence, a set of concise utility information is designed to be kept in each tree node. Moreover, several strategies are also proposed to reduce the amount of computation for utility update and the scopes of database scans, thereby improving the mining efficiency. Experimental results on some real and synthetic datasets show that IncUSP-Miner is able to efficiently mine high utility sequential patterns incrementally.
|
author2 |
Huang, Jiun-Long |
author_facet |
Huang, Jiun-Long Wang, Jun-Zhe 王濬哲 |
author |
Wang, Jun-Zhe 王濬哲 |
spellingShingle |
Wang, Jun-Zhe 王濬哲 A Study on Efficient Mining of High Utility Sequential Patterns |
author_sort |
Wang, Jun-Zhe |
title |
A Study on Efficient Mining of High Utility Sequential Patterns |
title_short |
A Study on Efficient Mining of High Utility Sequential Patterns |
title_full |
A Study on Efficient Mining of High Utility Sequential Patterns |
title_fullStr |
A Study on Efficient Mining of High Utility Sequential Patterns |
title_full_unstemmed |
A Study on Efficient Mining of High Utility Sequential Patterns |
title_sort |
study on efficient mining of high utility sequential patterns |
publishDate |
2016 |
url |
http://ndltd.ncl.edu.tw/handle/gsm5at |
work_keys_str_mv |
AT wangjunzhe astudyonefficientminingofhighutilitysequentialpatterns AT wángjùnzhé astudyonefficientminingofhighutilitysequentialpatterns AT wangjunzhe gāoxiàoyìxúnxùzīliàoyàngshìtànkānzhīyánjiū AT wángjùnzhé gāoxiàoyìxúnxùzīliàoyàngshìtànkānzhīyánjiū AT wangjunzhe studyonefficientminingofhighutilitysequentialpatterns AT wángjùnzhé studyonefficientminingofhighutilitysequentialpatterns |
_version_ |
1719149565476601856 |