Summary: | Video description technique has been widely used in the computer community for many applications. The typical approaches are mainly based on the encode-decode framework: the fixed-length video representation vectors are extracted by the encoder using the upper layer output of pre-trained convolutional neural networks (CNNs); The decoder uses the recurrent neural networks to generate a textual sentence. However, the upper layers of convolutional neural networks contain low-resolution, semantically strong, while the lower layers contain high-resolution, semantically weak features. In the existing method, the multi-scale information of CNNs is hardly considered to be used in the video description. Ignoring this information will lead to the problem that the video description is not detailed and comprehensive. This paper applies the hierarchical convolutional long short-term memory (ConvLSTM) in the encode-decode framework to conduct feature extraction of the upper and lower layers in CNNs. Moreover, multiple network structures are designed to explore the Spatio-temporal feature extraction performance of ConvLSTM, which can approach the optimal accuracy in the three-layer ConvLSTM. In order to efficiently improve the language quality of video description, the attention mechanism focuses on visual feature output by ConvLSTM. The extensive experimental results demonstrate that the proposed method outperforms the existing approaches.
|