Content-Based Spatio-Temporal Video Summarization for Content Adaptation

博士 === 國立中正大學 === 資訊工程研究所 === 101 === Universal Multimedia Access (UMA) calls for the provision of different presentations of the same multimedia content, with more or less complexity to satisfy the different usage environments in which the content is consumed. To support UMA, conventional technique...

Full description

Bibliographic Details
Main Authors: Chia-Ming Tsai, 蔡佳銘
Other Authors: Jin-Jang Leou
Format: Others
Language:en_US
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/71221475680957064569
Description
Summary:博士 === 國立中正大學 === 資訊工程研究所 === 101 === Universal Multimedia Access (UMA) calls for the provision of different presentations of the same multimedia content, with more or less complexity to satisfy the different usage environments in which the content is consumed. To support UMA, conventional techniques (including video transcoding and scalable video coding) use video cropping or uniform scaling to downscale the original higher-resolution video to a lower-resolution video, and uniformly downsample video frames with a fixed time interval to condense a full-length video to a significantly shortened version. However, the flexibility and performance of conventional content adaptation methods is still rather limited for high-quality video presentation, whereas these adaptation methods may cause critical spatio-temporal visual information loss. To this end, we propose efficient spatio-temporal video summarization schemes for content adaptation to well preserve critical spatio-temporal visual information, where the “summarization” term means to abstract video content in the spatial or temporal domain. We first propose a mosaic-guided video retargeting method to ensure good spatio-temporal coherence of the downscaled video. With the guide of shot-level panoramic mosaic, we embed the per-frame scaling budgets in a video shot to the constraints imposed in the iterative optimization process for determining the shot-level global scaling map. Our experimental results demonstrate the good performance of the proposed method in preserving visual information preservation and maintaining spatio-temporal coherence while resizing a video, even when the video contains significant camera motions and object motions. We then propose a low-overhead content-adaptive spatial scalability SVC coders (CASS-SVC) to extend the spatial scalability of scalable video coding. The proposed CASS-SVC coder consists of three main modules: a low-overhead video retargeter, a side-information coder, and a non-homogeneous inter-layer predictive coder. Based on the proposed mosaic-guided video retargeting method, the video retargeter set the pixels within the same column/row the same scaling factor value to reduce the bitrate for coding the global scaling map which is then used to derive the scaling maps of individual frames in the shot at both the encoder and decoder. The side information required for the non-homogeneous scaling, including the global scaling maps and the spatial corresponding positions of individual frames to the panoramic mosaic, are then efficiently coded by the side-information coder. The non-homogeneous interlayer prediction coding tools are used to provide good predictions to reduce the bitrates for coding the higher-resolution frames. Our experimental results demonstrate that, compared to existing CASS-SVC coders, our method cannot only well preserve subjective quality of important content in the lower-resolution sequence, but also significantly improves the coding efficiency of higher-resolution sequence. Finally, to preserve the highlights in a movie summary, we propose a two-stage scene-based movie summarization method based on mining the relationship between role-communities. Since the role-communities in earlier scenes are usually used to develop the role relationship in later scenes, in the analysis stage, we construct a social network to characterize the interactions between role-communities. As a result, the social power of each role-community is evaluated by the community’s centrality value and the role communities are clustered into relevant groups based on the centrality values. In the summarization stage, a set of feasible summary combinations of scenes is identified and an information-rich summary is selected from these candidates based on social power preservation. Our evaluation results show that in most test cases our method achieves better subjective performance than attention-based and role-based summarization methods in terms of semantic content preservation for a movie summary.