Summary: | In this paper, we present contextual relationship-based learning model using deep neural network for recognizing the activities performed by a group of people in a video sequence. The proposed model comprises of the context learning using a bottom-up approach, learning from individual human actions to group level activity as well as learning from the scene information. We build deep convolutional neural network model to capture human action-pose feature for a given input video sequence. To capture group level temporal flow changes, aggregated action-pose feature of persons within the context area are fed to deep recurrent neural network, which provides spatio-temporal group descriptor. Together with this, we build a scene level convolutional neural network, to extract scene level feature which improves the performance of group activity recognition. The probabilistic inference model, as an additional layer in deep neural network, added to ensemble the models and provide a unified deep learning framework. Experimental results show the efficiency of the proposed model on standard benchmark collective activity dataset in group activity recognition. We also present the evaluated results by varying different learning parameters, optimizers, especially recurrent neural network models long short-term memory and gated recurrent unit on the benchmark collective activity dataset. Keywords: Group activity recognition, Convolutional neural network, Long short-term memory, Gated recurrent unit, Context learning
|