Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation

博士 === 國立交通大學 === 資訊科學與工程研究所 === 104 === With the exponential growth of web multimedia contents, the Internet is rife with near-duplicate videos, the video copies applied with visual/temporal transformations and/or post productions. The numerous videos lead to the issues of copyright infringement, s...

Full description

Bibliographic Details
Main Authors: Chou, Chien-Li, 周建利
Other Authors: Lee, Suh-Yin
Format: Others
Language:en_US
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/09952688601791502832
id ndltd-TW-104NCTU5394077
record_format oai_dc
spelling ndltd-TW-104NCTU53940772017-09-06T04:22:12Z http://ndltd.ncl.edu.tw/handle/09952688601791502832 Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation 基於樣式之近似複製影片檢索、定位與註解 Chou, Chien-Li 周建利 博士 國立交通大學 資訊科學與工程研究所 104 With the exponential growth of web multimedia contents, the Internet is rife with near-duplicate videos, the video copies applied with visual/temporal transformations and/or post productions. The numerous videos lead to the issues of copyright infringement, search result redundancy, storage waste, etc. To address these issues, we propose a spatiotemporal pattern-based approach under the hierarchical filter-and-refine framework for efficient and effective near-duplicate video retrieval and localization. First, non-near-duplicate videos are fast filtered out through a computationally efficient Pattern-based Index Tree (PI-tree). Then, m-Pattern-based Dynamic Programming (mPDP) is designed to localize near-duplicate segments and to re-rank the videos retrieved. For more effective retrieval and localization, a multi-feature framework based on a pattern indexing technique is also proposed. A Modified Pattern-based prefix tree (MP-tree) is proposed to index patterns of reference videos for fast pattern matching. For calculating how likely a query video and a reference video are near-duplicates, a novel data structure, termed Multi-Feature Temporal Relation forest (MFTR-forest), is proposed to discover the temporal relation among matched patterns and to evaluate the near-duplicate degree between a query video and each reference video. Comprehensive experiments on public datasets are conducted to verify the effectiveness and efficiency of the two proposed frameworks of near-duplicate video retrieval and localization. Experimental results demonstrate that both the two proposed frameworks outperform the state-of-the-art approaches compared in terms of several evaluation criteria. In addition to considering the disadvantages of near-duplicate videos, we can utilize the characteristics of those videos to perform automatic video annotation. Traditional video annotation approaches focus on annotating keyframes/shots or whole videos with semantic keywords. However, the extraction processes of keyframes/shots might lack semantic meanings, and it is hard to use a few keywords to describe the content of a long video with multiple topics. In this dissertation, near-scenes, which contain similar concepts, topics, or semantic meanings, are defined for better video content understanding and annotation. We propose a novel framework of hierarchical video-to-near-scene (HV2NS) annotation not only to preserve but also to purify the semantic meanings of near-scenes. To detect near-scenes, a pattern-based prefix tree is first constructed to fast retrieve near-duplicate videos. Then, the videos containing similar near-duplicate segments and similar keywords are clustered with consideration of multi-modal features including visual and textual features. To enhance the precision of near-scene detection, a pattern-to-intensity-mark (PIM) method is proposed to perform precise frame-level near-duplicate segment alignment. For each near-scene, a video-to-concept distribution model is designed to analyze the representativeness of keywords and discriminations of clusters by the proposed potential term frequency and inverse document frequency (potential TFIDF) and entropy. Tags are ranked according to video-to-concept distribution scores, and the tags with the highest scores are propagated to near-scenes detected. Extensive experiments demonstrate that the proposed PIM outperforms state-of-the-art approaches compared in terms of quality segments (QS) and quality frames (QF) for near-scene detection. Furthermore, the proposed framework of hierarchical video-to-near-scene annotation can achieve high quality of near-scene annotation in terms of mean average precision (MAP). Lee, Suh-Yin Tsai, Wen-Jiin Chen, Hua-Tsung 李素瑛 蔡文錦 陳華總 2015 學位論文 ; thesis 103 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立交通大學 === 資訊科學與工程研究所 === 104 === With the exponential growth of web multimedia contents, the Internet is rife with near-duplicate videos, the video copies applied with visual/temporal transformations and/or post productions. The numerous videos lead to the issues of copyright infringement, search result redundancy, storage waste, etc. To address these issues, we propose a spatiotemporal pattern-based approach under the hierarchical filter-and-refine framework for efficient and effective near-duplicate video retrieval and localization. First, non-near-duplicate videos are fast filtered out through a computationally efficient Pattern-based Index Tree (PI-tree). Then, m-Pattern-based Dynamic Programming (mPDP) is designed to localize near-duplicate segments and to re-rank the videos retrieved. For more effective retrieval and localization, a multi-feature framework based on a pattern indexing technique is also proposed. A Modified Pattern-based prefix tree (MP-tree) is proposed to index patterns of reference videos for fast pattern matching. For calculating how likely a query video and a reference video are near-duplicates, a novel data structure, termed Multi-Feature Temporal Relation forest (MFTR-forest), is proposed to discover the temporal relation among matched patterns and to evaluate the near-duplicate degree between a query video and each reference video. Comprehensive experiments on public datasets are conducted to verify the effectiveness and efficiency of the two proposed frameworks of near-duplicate video retrieval and localization. Experimental results demonstrate that both the two proposed frameworks outperform the state-of-the-art approaches compared in terms of several evaluation criteria. In addition to considering the disadvantages of near-duplicate videos, we can utilize the characteristics of those videos to perform automatic video annotation. Traditional video annotation approaches focus on annotating keyframes/shots or whole videos with semantic keywords. However, the extraction processes of keyframes/shots might lack semantic meanings, and it is hard to use a few keywords to describe the content of a long video with multiple topics. In this dissertation, near-scenes, which contain similar concepts, topics, or semantic meanings, are defined for better video content understanding and annotation. We propose a novel framework of hierarchical video-to-near-scene (HV2NS) annotation not only to preserve but also to purify the semantic meanings of near-scenes. To detect near-scenes, a pattern-based prefix tree is first constructed to fast retrieve near-duplicate videos. Then, the videos containing similar near-duplicate segments and similar keywords are clustered with consideration of multi-modal features including visual and textual features. To enhance the precision of near-scene detection, a pattern-to-intensity-mark (PIM) method is proposed to perform precise frame-level near-duplicate segment alignment. For each near-scene, a video-to-concept distribution model is designed to analyze the representativeness of keywords and discriminations of clusters by the proposed potential term frequency and inverse document frequency (potential TFIDF) and entropy. Tags are ranked according to video-to-concept distribution scores, and the tags with the highest scores are propagated to near-scenes detected. Extensive experiments demonstrate that the proposed PIM outperforms state-of-the-art approaches compared in terms of quality segments (QS) and quality frames (QF) for near-scene detection. Furthermore, the proposed framework of hierarchical video-to-near-scene annotation can achieve high quality of near-scene annotation in terms of mean average precision (MAP).
author2 Lee, Suh-Yin
author_facet Lee, Suh-Yin
Chou, Chien-Li
周建利
author Chou, Chien-Li
周建利
spellingShingle Chou, Chien-Li
周建利
Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation
author_sort Chou, Chien-Li
title Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation
title_short Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation
title_full Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation
title_fullStr Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation
title_full_unstemmed Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation
title_sort pattern-based near-duplicate video retrieval, localization, and annotation
publishDate 2015
url http://ndltd.ncl.edu.tw/handle/09952688601791502832
work_keys_str_mv AT chouchienli patternbasednearduplicatevideoretrievallocalizationandannotation
AT zhōujiànlì patternbasednearduplicatevideoretrievallocalizationandannotation
AT chouchienli jīyúyàngshìzhījìnshìfùzhìyǐngpiànjiǎnsuǒdìngwèiyǔzhùjiě
AT zhōujiànlì jīyúyàngshìzhījìnshìfùzhìyǐngpiànjiǎnsuǒdìngwèiyǔzhùjiě
_version_ 1718527182436827136