Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation
博士 === 國立交通大學 === 資訊科學與工程研究所 === 104 === With the exponential growth of web multimedia contents, the Internet is rife with near-duplicate videos, the video copies applied with visual/temporal transformations and/or post productions. The numerous videos lead to the issues of copyright infringement, s...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2015
|
Online Access: | http://ndltd.ncl.edu.tw/handle/09952688601791502832 |
id |
ndltd-TW-104NCTU5394077 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-104NCTU53940772017-09-06T04:22:12Z http://ndltd.ncl.edu.tw/handle/09952688601791502832 Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation 基於樣式之近似複製影片檢索、定位與註解 Chou, Chien-Li 周建利 博士 國立交通大學 資訊科學與工程研究所 104 With the exponential growth of web multimedia contents, the Internet is rife with near-duplicate videos, the video copies applied with visual/temporal transformations and/or post productions. The numerous videos lead to the issues of copyright infringement, search result redundancy, storage waste, etc. To address these issues, we propose a spatiotemporal pattern-based approach under the hierarchical filter-and-refine framework for efficient and effective near-duplicate video retrieval and localization. First, non-near-duplicate videos are fast filtered out through a computationally efficient Pattern-based Index Tree (PI-tree). Then, m-Pattern-based Dynamic Programming (mPDP) is designed to localize near-duplicate segments and to re-rank the videos retrieved. For more effective retrieval and localization, a multi-feature framework based on a pattern indexing technique is also proposed. A Modified Pattern-based prefix tree (MP-tree) is proposed to index patterns of reference videos for fast pattern matching. For calculating how likely a query video and a reference video are near-duplicates, a novel data structure, termed Multi-Feature Temporal Relation forest (MFTR-forest), is proposed to discover the temporal relation among matched patterns and to evaluate the near-duplicate degree between a query video and each reference video. Comprehensive experiments on public datasets are conducted to verify the effectiveness and efficiency of the two proposed frameworks of near-duplicate video retrieval and localization. Experimental results demonstrate that both the two proposed frameworks outperform the state-of-the-art approaches compared in terms of several evaluation criteria. In addition to considering the disadvantages of near-duplicate videos, we can utilize the characteristics of those videos to perform automatic video annotation. Traditional video annotation approaches focus on annotating keyframes/shots or whole videos with semantic keywords. However, the extraction processes of keyframes/shots might lack semantic meanings, and it is hard to use a few keywords to describe the content of a long video with multiple topics. In this dissertation, near-scenes, which contain similar concepts, topics, or semantic meanings, are defined for better video content understanding and annotation. We propose a novel framework of hierarchical video-to-near-scene (HV2NS) annotation not only to preserve but also to purify the semantic meanings of near-scenes. To detect near-scenes, a pattern-based prefix tree is first constructed to fast retrieve near-duplicate videos. Then, the videos containing similar near-duplicate segments and similar keywords are clustered with consideration of multi-modal features including visual and textual features. To enhance the precision of near-scene detection, a pattern-to-intensity-mark (PIM) method is proposed to perform precise frame-level near-duplicate segment alignment. For each near-scene, a video-to-concept distribution model is designed to analyze the representativeness of keywords and discriminations of clusters by the proposed potential term frequency and inverse document frequency (potential TFIDF) and entropy. Tags are ranked according to video-to-concept distribution scores, and the tags with the highest scores are propagated to near-scenes detected. Extensive experiments demonstrate that the proposed PIM outperforms state-of-the-art approaches compared in terms of quality segments (QS) and quality frames (QF) for near-scene detection. Furthermore, the proposed framework of hierarchical video-to-near-scene annotation can achieve high quality of near-scene annotation in terms of mean average precision (MAP). Lee, Suh-Yin Tsai, Wen-Jiin Chen, Hua-Tsung 李素瑛 蔡文錦 陳華總 2015 學位論文 ; thesis 103 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 國立交通大學 === 資訊科學與工程研究所 === 104 === With the exponential growth of web multimedia contents, the Internet is rife with near-duplicate videos, the video copies applied with visual/temporal transformations and/or post productions. The numerous videos lead to the issues of copyright infringement, search result redundancy, storage waste, etc. To address these issues, we propose a spatiotemporal pattern-based approach under the hierarchical filter-and-refine framework for efficient and effective near-duplicate video retrieval and localization. First, non-near-duplicate videos are fast filtered out through a computationally efficient Pattern-based Index Tree (PI-tree). Then, m-Pattern-based Dynamic Programming (mPDP) is designed to localize near-duplicate segments and to re-rank the videos retrieved. For more effective retrieval and localization, a multi-feature framework based on a pattern indexing technique is also proposed. A Modified Pattern-based prefix tree (MP-tree) is proposed to index patterns of reference videos for fast pattern matching. For calculating how likely a query video and a reference video are near-duplicates, a novel data structure, termed Multi-Feature Temporal Relation forest (MFTR-forest), is proposed to discover the temporal relation among matched patterns and to evaluate the near-duplicate degree between a query video and each reference video. Comprehensive experiments on public datasets are conducted to verify the effectiveness and efficiency of the two proposed frameworks of near-duplicate video retrieval and localization. Experimental results demonstrate that both the two proposed frameworks outperform the state-of-the-art approaches compared in terms of several evaluation criteria.
In addition to considering the disadvantages of near-duplicate videos, we can utilize the characteristics of those videos to perform automatic video annotation. Traditional video annotation approaches focus on annotating keyframes/shots or whole videos with semantic keywords. However, the extraction processes of keyframes/shots might lack semantic meanings, and it is hard to use a few keywords to describe the content of a long video with multiple topics. In this dissertation, near-scenes, which contain similar concepts, topics, or semantic meanings, are defined for better video content understanding and annotation. We propose a novel framework of hierarchical video-to-near-scene (HV2NS) annotation not only to preserve but also to purify the semantic meanings of near-scenes. To detect near-scenes, a pattern-based prefix tree is first constructed to fast retrieve near-duplicate videos. Then, the videos containing similar near-duplicate segments and similar keywords are clustered with consideration of multi-modal features including visual and textual features. To enhance the precision of near-scene detection, a pattern-to-intensity-mark (PIM) method is proposed to perform precise frame-level near-duplicate segment alignment. For each near-scene, a video-to-concept distribution model is designed to analyze the representativeness of keywords and discriminations of clusters by the proposed potential term frequency and inverse document frequency (potential TFIDF) and entropy. Tags are ranked according to video-to-concept distribution scores, and the tags with the highest scores are propagated to near-scenes detected. Extensive experiments demonstrate that the proposed PIM outperforms state-of-the-art approaches compared in terms of quality segments (QS) and quality frames (QF) for near-scene detection. Furthermore, the proposed framework of hierarchical video-to-near-scene annotation can achieve high quality of near-scene annotation in terms of mean average precision (MAP).
|
author2 |
Lee, Suh-Yin |
author_facet |
Lee, Suh-Yin Chou, Chien-Li 周建利 |
author |
Chou, Chien-Li 周建利 |
spellingShingle |
Chou, Chien-Li 周建利 Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation |
author_sort |
Chou, Chien-Li |
title |
Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation |
title_short |
Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation |
title_full |
Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation |
title_fullStr |
Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation |
title_full_unstemmed |
Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation |
title_sort |
pattern-based near-duplicate video retrieval, localization, and annotation |
publishDate |
2015 |
url |
http://ndltd.ncl.edu.tw/handle/09952688601791502832 |
work_keys_str_mv |
AT chouchienli patternbasednearduplicatevideoretrievallocalizationandannotation AT zhōujiànlì patternbasednearduplicatevideoretrievallocalizationandannotation AT chouchienli jīyúyàngshìzhījìnshìfùzhìyǐngpiànjiǎnsuǒdìngwèiyǔzhùjiě AT zhōujiànlì jīyúyàngshìzhījìnshìfùzhìyǐngpiànjiǎnsuǒdìngwèiyǔzhùjiě |
_version_ |
1718527182436827136 |