Multi-cue visual tracking: feature learning and fusion

As an important and active research topic in computer vision community, visual tracking is a key component in many applications ranging from video surveillance and robotics to human computer. In this thesis, we propose new appearance models based on multiple visual cues and address several research...

Full description

Bibliographic Details
Main Author: Lan, Xiangyuan
Format: Others
Language:English
Published: HKBU Institutional Repository 2016
Subjects:
Online Access:https://repository.hkbu.edu.hk/etd_oa/319
https://repository.hkbu.edu.hk/cgi/viewcontent.cgi?article=1319&context=etd_oa
id ndltd-hkbu.edu.hk-oai-repository.hkbu.edu.hk-etd_oa-1319
record_format oai_dc
collection NDLTD
language English
format Others
sources NDLTD
topic Computer vision;Machine learning
spellingShingle Computer vision;Machine learning
Lan, Xiangyuan
Multi-cue visual tracking: feature learning and fusion
description As an important and active research topic in computer vision community, visual tracking is a key component in many applications ranging from video surveillance and robotics to human computer. In this thesis, we propose new appearance models based on multiple visual cues and address several research issues in feature learning and fusion for visual tracking. Feature extraction and feature fusion are two key modules to construct the appearance model for the tracked target with multiple visual cues. Feature extraction aims to extract informative features for visual representation of the tracked target, and many kinds of hand-crafted feature descriptors which capture different types of visual information have been developed. However, since large appearance variations, e.g. occlusion, illumination may occur during tracking, the target samples may be contaminated/corrupted. As such, the extracted raw features may not be able to capture the intrinsic properties of the target appearance. Besides, without explicitly imposing the discriminability, the extracted features may potentially suffer background distraction problem. To extract uncontaminated discriminative features from multiple visual cues, this thesis proposes a novel robust joint discriminative feature learning framework which is capable of 1) simultaneously and optimally removing corrupted features and learning reliable classifiers, and 2) exploiting the consistent and feature-specific discriminative information of multiple feature. In this way, the features and classifiers learned from potentially corrupted tracking samples can be better utilized for target representation and foreground/background discrimination. As shown by the Data Processing Inequality, information fusion in feature level contains more information than that in classifier level. In addition, not all visual cues/features are reliable, and thereby combining all the features may not achieve a better tracking performance. As such, it is more reasonable to dynamically select and fuse multiple visual cues for visual tracking. Based on aforementioned considerations, this thesis proposes a novel joint sparse representation model in which feature selection, fusion, and representation are performed optimally in a unified framework. By taking advantages of sparse representation, unreliable features are detected and removed while reliable features are fused on feature level for target representation. In order to capture the non-linear similarity of features, the model is further extended to perform feature fusion in kernel space. Experimental results demonstrate the effectiveness of the proposed model. Since different visual cues extracted from the same object should share some commonalities in their representations and each feature should also have some diversities to reflect its complementarity in appearance modeling, another important problem in feature fusion is how to learn the commonality and diversity in the fused representations of multiple visual cues to enhance the tracking accuracy. Different from existing multi-cue sparse trackers which only consider the commonalities among the sparsity patterns of multiple visual cues, this thesis proposes a novel multiple sparse representation model for multi-cue visual tracking which jointly exploits the underlying commonalities and diversities of different visual cues by decomposing multiple sparsity patterns. Moreover, this thesis introduces a novel online multiple metric learning to efficiently and adaptively incorporate the appearance proximity constraint, which ensures that the learned commonalities of multiple visual cues are more representative. Experimental results on tracking benchmark videos and other challenging videos show that the proposed tracker achieves better performance than the existing sparsity-based trackers and other state-of-the-art trackers.
author Lan, Xiangyuan
author_facet Lan, Xiangyuan
author_sort Lan, Xiangyuan
title Multi-cue visual tracking: feature learning and fusion
title_short Multi-cue visual tracking: feature learning and fusion
title_full Multi-cue visual tracking: feature learning and fusion
title_fullStr Multi-cue visual tracking: feature learning and fusion
title_full_unstemmed Multi-cue visual tracking: feature learning and fusion
title_sort multi-cue visual tracking: feature learning and fusion
publisher HKBU Institutional Repository
publishDate 2016
url https://repository.hkbu.edu.hk/etd_oa/319
https://repository.hkbu.edu.hk/cgi/viewcontent.cgi?article=1319&context=etd_oa
work_keys_str_mv AT lanxiangyuan multicuevisualtrackingfeaturelearningandfusion
_version_ 1718804099803119616
spelling ndltd-hkbu.edu.hk-oai-repository.hkbu.edu.hk-etd_oa-13192018-12-20T15:53:08Z Multi-cue visual tracking: feature learning and fusion Lan, Xiangyuan As an important and active research topic in computer vision community, visual tracking is a key component in many applications ranging from video surveillance and robotics to human computer. In this thesis, we propose new appearance models based on multiple visual cues and address several research issues in feature learning and fusion for visual tracking. Feature extraction and feature fusion are two key modules to construct the appearance model for the tracked target with multiple visual cues. Feature extraction aims to extract informative features for visual representation of the tracked target, and many kinds of hand-crafted feature descriptors which capture different types of visual information have been developed. However, since large appearance variations, e.g. occlusion, illumination may occur during tracking, the target samples may be contaminated/corrupted. As such, the extracted raw features may not be able to capture the intrinsic properties of the target appearance. Besides, without explicitly imposing the discriminability, the extracted features may potentially suffer background distraction problem. To extract uncontaminated discriminative features from multiple visual cues, this thesis proposes a novel robust joint discriminative feature learning framework which is capable of 1) simultaneously and optimally removing corrupted features and learning reliable classifiers, and 2) exploiting the consistent and feature-specific discriminative information of multiple feature. In this way, the features and classifiers learned from potentially corrupted tracking samples can be better utilized for target representation and foreground/background discrimination. As shown by the Data Processing Inequality, information fusion in feature level contains more information than that in classifier level. In addition, not all visual cues/features are reliable, and thereby combining all the features may not achieve a better tracking performance. As such, it is more reasonable to dynamically select and fuse multiple visual cues for visual tracking. Based on aforementioned considerations, this thesis proposes a novel joint sparse representation model in which feature selection, fusion, and representation are performed optimally in a unified framework. By taking advantages of sparse representation, unreliable features are detected and removed while reliable features are fused on feature level for target representation. In order to capture the non-linear similarity of features, the model is further extended to perform feature fusion in kernel space. Experimental results demonstrate the effectiveness of the proposed model. Since different visual cues extracted from the same object should share some commonalities in their representations and each feature should also have some diversities to reflect its complementarity in appearance modeling, another important problem in feature fusion is how to learn the commonality and diversity in the fused representations of multiple visual cues to enhance the tracking accuracy. Different from existing multi-cue sparse trackers which only consider the commonalities among the sparsity patterns of multiple visual cues, this thesis proposes a novel multiple sparse representation model for multi-cue visual tracking which jointly exploits the underlying commonalities and diversities of different visual cues by decomposing multiple sparsity patterns. Moreover, this thesis introduces a novel online multiple metric learning to efficiently and adaptively incorporate the appearance proximity constraint, which ensures that the learned commonalities of multiple visual cues are more representative. Experimental results on tracking benchmark videos and other challenging videos show that the proposed tracker achieves better performance than the existing sparsity-based trackers and other state-of-the-art trackers. 2016-08-10T07:00:00Z text application/pdf https://repository.hkbu.edu.hk/etd_oa/319 https://repository.hkbu.edu.hk/cgi/viewcontent.cgi?article=1319&context=etd_oa Open Access Theses and Dissertations English HKBU Institutional Repository Computer vision;Machine learning