Summary: | 博士 === 國立交通大學 === 資訊科學與工程研究所 === 106 === Video analytics plays a major role in academic research and industry due to its commercial benefits and practical applications. Rapid development of video analysis tools leads to the proliferation of content-related statistical data. Video analytics provides the users not only efficient browsing of videos but also comprehensive statistical information about video contents. However, typically the descriptive content-related data is manually annotated and interpreted by the video analyst. It is a laborious and arduous work to collect meaningful and relevant statistical data by watching through a whole video. Hence, in this dissertation, the preliminary construction of automatic systems for video analytics are proposed to tackle this challenge.
People locations are one of the most informative cues for obtaining and collecting video analytics. Hence, automatic systems capable of locating people with occlusions in the crowded scenes are proposed to tackle this challenge. In this dissertation, two kinds of scenarios for generating video analytics, including the sports videos, and surveillance videos are considered.
For sports videos, we propose an automatic system capable of localizing the players in the broadcast volleyball videos. Serve receive-to-attack (SR2A) is the most principal way to gain points in volleyball games. In addition, the positions of players on the court reveal informative clues about both offensive and defensive formations. However, state-of-the-art supervised learning-based methods for player localization require a large amount of labeled training data. The development of automatic systems for player localization becomes indispensable. Therefore, a novel 2D histogram-based player localization method capable of extracting SR2A periods from long broadcast volleyball videos and then locating players with occlusions is developed and presented in this dissertation. The proposed system is able to automatically detect the court lines for camera calibration, extract players by calculating both x and y histograms of extracted player masks, and visualize the team formations on real-world court model. The experiments on broadcast volleyball videos demonstrate efficient and effective results against a traditional object segmentation method (connected component analysis) and a supervised learning approach utilizing Histogram of Oriented Gradient features.
Besides, in volleyball matches, marvelous spiking such as delayed spiking or alternate position spiking always results from fantastic move and jump of players. Jump actions are typically accompanied by spiking and imply significant events in volleyball matches. In this dissertation, we also propose an effective system capable of jump pattern recognition in player moving trajectories from long broadcast volleyball videos (taken from a pan-tilt-zoom camera). First, the entire video is segmented into clips of rallies by shot segmentation and whistle detection. Then, camera calibration is adopted to find the correspondence between coordinates in the video frames and real-world coordinates. With the homographic transformation matrix computed, real-world player moving trajectories can be derived by a sequence of tracked player locations in video frames. We recognize jump patterns in the player moving trajectory using a sliding window scheme with physics-based validation and context constraint. Finally, the jump locations can be estimated and jump tracks can be separated from the planar moving tracks. The experiments conducted on broadcast volleyball videos show promising results.
Moreover, for surveillance videos, we investigate the possibility of locating people from multiple views. People locations bring rich information for a wide spectrum of applications in intelligent video surveillance systems, such as abnormal event detection, synopsis video generation, and behavior analysis. In addition to localization accuracy, computational efficiency is another significant issue to be highly concerned in people localization. As an essential early stage, people localization has to be accomplished in a very short time, enabling further semantic analysis. However, most state-of-the-art people localization methods pay little attention to computational efficiency. Hence, we are motivated to propose some mechanisms to improve the processing speed of people localization while keeping high localization accuracy. In this dissertation, we introduce a torso-high reference plane, on which foreground information from multiple cameras is projected to predict potential people locations, instead of using the ground reference plane as in some previous works. Since the torso part of a human body is usually more intact than the feet after foreground extraction, the usage of the torso-high reference plane can yield reliable potential people locations, especially after applying a foreground line sampling scheme for data reduction. Then, a novel and computationally efficient bitwise-operation scheme is proposed to predict people locations at the intersection regions of foreground line samples from multiple views. After rule-based validation, people locations can be accurately obtained and visualized on a real world plane. Experiments on multi-view surveillance videos not only validate the high accuracy of the proposed method in locating people under crowded scenes with serious occlusions, but also demonstrate an outstanding computational speed (over 300 frames per second on average), which is sufficient to meet the real-time requirement of many surveillance applications.
We also conduct experiments on multi-camera videos of volleyball matches to investigate the applicability of our proposed people localization approach. Jump action of a player can be also recognized by employing higher reference plane in our proposed people localization approach. Finally, we compare our proposed localization method to a deep learning-based detection approach.
|