Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition

博士 === 臺灣大學 === 電子工程學研究所 === 98 === Video-based human action recognition technology provides important applications of computer vision, such as multimedia entertainment, surveillance systems, interactive environments, content-based video analysis, and behavioral biometrics. The major challenges of a...

Full description

Bibliographic Details
Main Authors: Jing-Ying Chang, 張靖瑩
Other Authors: Liang-Gee Chen
Format: Others
Language:en_US
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/65373878866792003632
id ndltd-TW-098NTU05428009
record_format oai_dc
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 臺灣大學 === 電子工程學研究所 === 98 === Video-based human action recognition technology provides important applications of computer vision, such as multimedia entertainment, surveillance systems, interactive environments, content-based video analysis, and behavioral biometrics. The major challenges of action recognition algorithms and systems lie in the semantic gap, which requires robust feature extraction and effective mathematical action/activity model to let computers interpret captured videos correctly. First challenge in action recognition is to choose features adopted in an approach that can well classify actions in a variable environment. Several factors that can severely limit the applicability of these features in real-world conditions include noise, occlusions, shadows, etc. Errors in feature extraction can easily propagate to higher levels. Second challenge is to build models to represent the characteristics of performers, and the models should describe actions precisely and distinctively. Effective methods to model the representation mathematically is always a key step to achieve intelligence of computers. How many regional and global characteristics of an object should be obtained for the models is significant to maintain discriminative power for complex activities or numerous action types. Meanwhile, the dimensions of the models should be limited to prevent suffering from the “curse of dimensionality.” Otherwise, a larger number of samples are required to train the models. In this dissertation, two scenarios, controller-free gaming applications and abandoned luggage detection systems, are used to analyze how to derive an effective approach for vision-based human action and activity recognition. This thesis contains two parts; the first part discusses processing modules for the extraction of the feature — trajectory, and the second part considers the model formation of these scenarios. In the first part, three modules including an image descriptor, a tracking algorithm, and an object corresponding method for multiple camera tracking environments, are discussed. The image descriptor is MPEG-7 color structure descriptor. The descriptor originally is used in image retrieval. It also can be modified to capture the trajectory of humans [1]. To adopt this descriptor in real-time multimedia applications (30 frames per second), several architecture design techniques are considered. With the analysis of histogram accumulation, local histogram observing (LHO) is used to buffer local structure window for data reuse, and three parallel LHOs is implemented to support real-time operations. The chip area is further saved from the color transformation and the non-linear quantization. The divider in the color transformation is implemented with a lookup table, which area is 36% of that of original divider. The 255 comparators in non-linear quantization are folded into one. The implemented result is 1.37×1.37 mm2, in UMC 0.18 um technology. A color-based particle filter is implemented for object tracking. Particle filtering is an effective algorithm for vision-based object tracking. With its probabilistic sampling approach, particle filter can easily predict the position of an object. However, the computational requirement to measure the similarity of samples is high, such that the particle filter is hard to be utilized for real-time applications. An effective approach using the estimation of the positions, sizes, and angles of objects, is proposed to generate a color histogram as tracking feature. Several design techniques, such as prioritized finite word length, particle-level parallel operation, and content addressable memory are employed. The architecture analysis shows the proposed architecture is very efficient for all vision-based real-time applications. The content-addressable-memory technique can reduce the storage requirement by 87.5% in terms of chip area. A prototype chip has been designed and verified using UMC’s 90 nm CMOS technology. Experimental results show the chip can support tracking three objects and provide 31.35 frames per second on average for all 720×480-size sequences, and the tracking accuracy is higher than 87%. The third module is an object corresponding method. This work proposes a global-optimization approach of spatial object correspondence in xxii distributed surveillance systems. All object correspondence systems need to be calibrated. However, since the environment is variable and differences exist between cameras, camera calibration may be imprecise and the results of the examination of the similarity between individuals may be incorrect. The concept of earth mover’s distance (EMD) is employed in an environment under imprecise feature measurement. The approach solves the problem of mutually exclusive object correspondence; finds the global optimum; allows partial matches, and is able to be used in combination with others’ approach of feature measurement. Global optimization is achieved by exploring all mutually exclusive match candidates and choosing those that generate the global minimum cost value. Applying EMD with a geometry-based feature to public surveillance datasets, the precision of the EMD-based method exceeds that of the greedy-based method by 4.3% to 20.8%, with an average of 10.5%. In the second part, a nonparametric approach and a parametric approach are adopted for controller-free gaming applications, and a knowledge-based approach is proposed for abandoned luggage detection systems. The nonparametric approach is a tile-based, motion-vector-based approach to provide a function, which allows people for using their whole body parts to mimic the real Volleyball/GoalKeeper actions to control the role in the game. The idea of introducing motion vector pattern to action recognition is based on the fact that video compression is now a common function of xxiii camera systems, which is an abundant source of motion vectors in the video. Motion vectors represent the dynamics information of an environment. By utilizing motion vectors in the region of an object, these dynamics information can be used to analyze actions of the object. The performance is compared with that of temporal-template-based approach. Because feature vectors of the temporal-template-based approach are generated by the entire foreground of one people, no regional information of the foreground is gathered. Therefore, the proposed motion-vector-based approach outperforms the temporal-template-based approach. Another approach for controllerfree gaming applications is a parametric time-series approach using joint trajectories extracted by particle filter as the action descriptors. Each trajectory is converted into a symbol sequence. The action recognition is accomplished using all combinations of two distance measuring methods and two dictionaries completed by fixed-size or adaptive-size segments. Abandoned luggage represents a potential threat to public safety. Identifying objects as luggage, identifying the owners of such objects, and identifying whether owners have left luggage behind, are the three main problems requiring solution. However, in crowded areas, solutions based on identifying what all objects are and tracking all objects, based on the possibility of their being abandoned luggage, are computationally extremely costly. Accordingly, such methods are difficult to utilize in real-time applications. The knowledge-based approach uses two techniques for effectively detecting abandoned luggage. “Foreground-mask sampling” detects luggage with arbitrary appearance and “selective tracking” locates and tracks owners based solely on looking only at the neighborhood of the luggage. A probability model using the maximum a posteriori is adopted to generate a confidence score and determine whether luggage has been abandoned deliberately. Experimental results demonstrate that once an owner abandons their luggage and leaves the scene, the alarm fires within few seconds. The processing speed of the proposed approach is approximately 15 to 20 frames per second, which is sufficient for real world applications.
author2 Liang-Gee Chen
author_facet Liang-Gee Chen
Jing-Ying Chang
張靖瑩
author Jing-Ying Chang
張靖瑩
spellingShingle Jing-Ying Chang
張靖瑩
Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition
author_sort Jing-Ying Chang
title Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition
title_short Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition
title_full Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition
title_fullStr Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition
title_full_unstemmed Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition
title_sort algorithm and architecture analysis of video-based human action and activity recognition
publishDate 2009
url http://ndltd.ncl.edu.tw/handle/65373878866792003632
work_keys_str_mv AT jingyingchang algorithmandarchitectureanalysisofvideobasedhumanactionandactivityrecognition
AT zhāngjìngyíng algorithmandarchitectureanalysisofvideobasedhumanactionandactivityrecognition
AT jingyingchang yǐshìjuéwèijīchǔzhīrénlèidòngzuòbiànshídeyǎnsuànfǎjíjiàgòufēnxī
AT zhāngjìngyíng yǐshìjuéwèijīchǔzhīrénlèidòngzuòbiànshídeyǎnsuànfǎjíjiàgòufēnxī
_version_ 1717740669780361216
spelling ndltd-TW-098NTU054280092015-10-13T13:40:19Z http://ndltd.ncl.edu.tw/handle/65373878866792003632 Algorithm and Architecture Analysis of Video-based Human Action and Activity Recognition 以視覺為基礎之人類動作辨識的演算法及架構分析 Jing-Ying Chang 張靖瑩 博士 臺灣大學 電子工程學研究所 98 Video-based human action recognition technology provides important applications of computer vision, such as multimedia entertainment, surveillance systems, interactive environments, content-based video analysis, and behavioral biometrics. The major challenges of action recognition algorithms and systems lie in the semantic gap, which requires robust feature extraction and effective mathematical action/activity model to let computers interpret captured videos correctly. First challenge in action recognition is to choose features adopted in an approach that can well classify actions in a variable environment. Several factors that can severely limit the applicability of these features in real-world conditions include noise, occlusions, shadows, etc. Errors in feature extraction can easily propagate to higher levels. Second challenge is to build models to represent the characteristics of performers, and the models should describe actions precisely and distinctively. Effective methods to model the representation mathematically is always a key step to achieve intelligence of computers. How many regional and global characteristics of an object should be obtained for the models is significant to maintain discriminative power for complex activities or numerous action types. Meanwhile, the dimensions of the models should be limited to prevent suffering from the “curse of dimensionality.” Otherwise, a larger number of samples are required to train the models. In this dissertation, two scenarios, controller-free gaming applications and abandoned luggage detection systems, are used to analyze how to derive an effective approach for vision-based human action and activity recognition. This thesis contains two parts; the first part discusses processing modules for the extraction of the feature — trajectory, and the second part considers the model formation of these scenarios. In the first part, three modules including an image descriptor, a tracking algorithm, and an object corresponding method for multiple camera tracking environments, are discussed. The image descriptor is MPEG-7 color structure descriptor. The descriptor originally is used in image retrieval. It also can be modified to capture the trajectory of humans [1]. To adopt this descriptor in real-time multimedia applications (30 frames per second), several architecture design techniques are considered. With the analysis of histogram accumulation, local histogram observing (LHO) is used to buffer local structure window for data reuse, and three parallel LHOs is implemented to support real-time operations. The chip area is further saved from the color transformation and the non-linear quantization. The divider in the color transformation is implemented with a lookup table, which area is 36% of that of original divider. The 255 comparators in non-linear quantization are folded into one. The implemented result is 1.37×1.37 mm2, in UMC 0.18 um technology. A color-based particle filter is implemented for object tracking. Particle filtering is an effective algorithm for vision-based object tracking. With its probabilistic sampling approach, particle filter can easily predict the position of an object. However, the computational requirement to measure the similarity of samples is high, such that the particle filter is hard to be utilized for real-time applications. An effective approach using the estimation of the positions, sizes, and angles of objects, is proposed to generate a color histogram as tracking feature. Several design techniques, such as prioritized finite word length, particle-level parallel operation, and content addressable memory are employed. The architecture analysis shows the proposed architecture is very efficient for all vision-based real-time applications. The content-addressable-memory technique can reduce the storage requirement by 87.5% in terms of chip area. A prototype chip has been designed and verified using UMC’s 90 nm CMOS technology. Experimental results show the chip can support tracking three objects and provide 31.35 frames per second on average for all 720×480-size sequences, and the tracking accuracy is higher than 87%. The third module is an object corresponding method. This work proposes a global-optimization approach of spatial object correspondence in xxii distributed surveillance systems. All object correspondence systems need to be calibrated. However, since the environment is variable and differences exist between cameras, camera calibration may be imprecise and the results of the examination of the similarity between individuals may be incorrect. The concept of earth mover’s distance (EMD) is employed in an environment under imprecise feature measurement. The approach solves the problem of mutually exclusive object correspondence; finds the global optimum; allows partial matches, and is able to be used in combination with others’ approach of feature measurement. Global optimization is achieved by exploring all mutually exclusive match candidates and choosing those that generate the global minimum cost value. Applying EMD with a geometry-based feature to public surveillance datasets, the precision of the EMD-based method exceeds that of the greedy-based method by 4.3% to 20.8%, with an average of 10.5%. In the second part, a nonparametric approach and a parametric approach are adopted for controller-free gaming applications, and a knowledge-based approach is proposed for abandoned luggage detection systems. The nonparametric approach is a tile-based, motion-vector-based approach to provide a function, which allows people for using their whole body parts to mimic the real Volleyball/GoalKeeper actions to control the role in the game. The idea of introducing motion vector pattern to action recognition is based on the fact that video compression is now a common function of xxiii camera systems, which is an abundant source of motion vectors in the video. Motion vectors represent the dynamics information of an environment. By utilizing motion vectors in the region of an object, these dynamics information can be used to analyze actions of the object. The performance is compared with that of temporal-template-based approach. Because feature vectors of the temporal-template-based approach are generated by the entire foreground of one people, no regional information of the foreground is gathered. Therefore, the proposed motion-vector-based approach outperforms the temporal-template-based approach. Another approach for controllerfree gaming applications is a parametric time-series approach using joint trajectories extracted by particle filter as the action descriptors. Each trajectory is converted into a symbol sequence. The action recognition is accomplished using all combinations of two distance measuring methods and two dictionaries completed by fixed-size or adaptive-size segments. Abandoned luggage represents a potential threat to public safety. Identifying objects as luggage, identifying the owners of such objects, and identifying whether owners have left luggage behind, are the three main problems requiring solution. However, in crowded areas, solutions based on identifying what all objects are and tracking all objects, based on the possibility of their being abandoned luggage, are computationally extremely costly. Accordingly, such methods are difficult to utilize in real-time applications. The knowledge-based approach uses two techniques for effectively detecting abandoned luggage. “Foreground-mask sampling” detects luggage with arbitrary appearance and “selective tracking” locates and tracks owners based solely on looking only at the neighborhood of the luggage. A probability model using the maximum a posteriori is adopted to generate a confidence score and determine whether luggage has been abandoned deliberately. Experimental results demonstrate that once an owner abandons their luggage and leaves the scene, the alarm fires within few seconds. The processing speed of the proposed approach is approximately 15 to 20 frames per second, which is sufficient for real world applications. Liang-Gee Chen 陳良基 2009 學位論文 ; thesis 179 en_US