Transforming Visual Attention into Video Summarization

碩士 === 國立臺灣大學 === 電信工程學研究所 === 107 === Video summarization is among challenging tasks in computer vision, which aims at identifying highlight frames or shots over lengthy video inputs. In this paper, we propose an attention-based model for video summarization and to handle complex video data. A nove...

Full description

Bibliographic Details
Main Authors: Yen-Ting Liu, 劉彥廷
Other Authors: Yu-Chiang Wang
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/qvk4m5
id ndltd-TW-107NTU05435025
record_format oai_dc
spelling ndltd-TW-107NTU054350252019-11-16T05:27:55Z http://ndltd.ncl.edu.tw/handle/qvk4m5 Transforming Visual Attention into Video Summarization 藉由視覺注意力來處理視頻摘要 Yen-Ting Liu 劉彥廷 碩士 國立臺灣大學 電信工程學研究所 107 Video summarization is among challenging tasks in computer vision, which aims at identifying highlight frames or shots over lengthy video inputs. In this paper, we propose an attention-based model for video summarization and to handle complex video data. A novel deep learning the framework of multi-head multi-layer video self-attention (M2VSA) is presented to identify informative regions across spatial and temporal video features, which jointly exploit context diversity over space and time for summarization purposes. Together with visual concept consistency enforced in our framework, both video recovery and summarization can be preserved. More importantly, our developed model can be realized in both supervised/unsupervised settings. Finally, our experiments quantitative and qualitative results demonstrate the effectiveness of our model and our superiority over state-of-the-art approaches. Yu-Chiang Wang 王鈺強 2019 學位論文 ; thesis 32 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 電信工程學研究所 === 107 === Video summarization is among challenging tasks in computer vision, which aims at identifying highlight frames or shots over lengthy video inputs. In this paper, we propose an attention-based model for video summarization and to handle complex video data. A novel deep learning the framework of multi-head multi-layer video self-attention (M2VSA) is presented to identify informative regions across spatial and temporal video features, which jointly exploit context diversity over space and time for summarization purposes. Together with visual concept consistency enforced in our framework, both video recovery and summarization can be preserved. More importantly, our developed model can be realized in both supervised/unsupervised settings. Finally, our experiments quantitative and qualitative results demonstrate the effectiveness of our model and our superiority over state-of-the-art approaches.
author2 Yu-Chiang Wang
author_facet Yu-Chiang Wang
Yen-Ting Liu
劉彥廷
author Yen-Ting Liu
劉彥廷
spellingShingle Yen-Ting Liu
劉彥廷
Transforming Visual Attention into Video Summarization
author_sort Yen-Ting Liu
title Transforming Visual Attention into Video Summarization
title_short Transforming Visual Attention into Video Summarization
title_full Transforming Visual Attention into Video Summarization
title_fullStr Transforming Visual Attention into Video Summarization
title_full_unstemmed Transforming Visual Attention into Video Summarization
title_sort transforming visual attention into video summarization
publishDate 2019
url http://ndltd.ncl.edu.tw/handle/qvk4m5
work_keys_str_mv AT yentingliu transformingvisualattentionintovideosummarization
AT liúyàntíng transformingvisualattentionintovideosummarization
AT yentingliu jíyóushìjuézhùyìlìláichùlǐshìpínzhāiyào
AT liúyàntíng jíyóushìjuézhùyìlìláichùlǐshìpínzhāiyào
_version_ 1719292369327620096