Transforming Visual Attention into Video Summarization

碩士 === 國立臺灣大學 === 電信工程學研究所 === 107 === Video summarization is among challenging tasks in computer vision, which aims at identifying highlight frames or shots over lengthy video inputs. In this paper, we propose an attention-based model for video summarization and to handle complex video data. A nove...

Full description

Bibliographic Details
Main Authors: Yen-Ting Liu, 劉彥廷
Other Authors: Yu-Chiang Wang
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/qvk4m5
Description
Summary:碩士 === 國立臺灣大學 === 電信工程學研究所 === 107 === Video summarization is among challenging tasks in computer vision, which aims at identifying highlight frames or shots over lengthy video inputs. In this paper, we propose an attention-based model for video summarization and to handle complex video data. A novel deep learning the framework of multi-head multi-layer video self-attention (M2VSA) is presented to identify informative regions across spatial and temporal video features, which jointly exploit context diversity over space and time for summarization purposes. Together with visual concept consistency enforced in our framework, both video recovery and summarization can be preserved. More importantly, our developed model can be realized in both supervised/unsupervised settings. Finally, our experiments quantitative and qualitative results demonstrate the effectiveness of our model and our superiority over state-of-the-art approaches.