Summary: | 碩士 === 國立臺灣科技大學 === 電子工程系 === 105 === This thesis presents an efficient convolutional neural network (CNN)-based approach to detect multiple spatial-temporal action tubes in videos. First, a new fusion strategy is employed, which combines the appearance and the flow information out of the two-stream CNN-based networks along with motion saliency to generate the action detection scores. Thereafter, an efficient multiple path search (MPS) algorithm, is developed to simultaneously
find multiple paths in a single run. In the forward message passing of MPS, each node stores information of a prescribed number of paths based on the accumulated scores determined in the previous stages. A backward path tracing is invoked afterward to find all multiple paths at the same time by fully reusing the information generated in the forward pass without repeating the search process. Thereby, the complexity incurred can be reduced. Moreover, to rectify the potentially inaccurate bounding boxes, a video localization refinement (VLR) scheme is also addressed to further boost the detection accuracy. Simulations show that the proposed MPS provides superior performance compared with the main state-of-the-art works on the widespread UCF-101 and J-HMDB datasets. Together with VLR, the performance of MPS can be further bolstered.
|