Generating Cross-domain Visual Description via Adversarial Learning

碩士 === 國立清華大學 === 電機工程學系所 === 105 === Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image ca...

Full description

Bibliographic Details
Main Authors: Chen, Tseng-Hung, 陳增鴻
Other Authors: Sun, Min
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/r8k45f
Description
Summary:碩士 === 國立清華大學 === 電機工程學系所 === 105 === Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain critic assesses whether the generated sentences are indistinguishable from sentences in the target domain. The multi-modal critic assesses whether an image and its generated sentence are a valid pair. During training, the critics and captioner act as adversaries -- captioner aims to generate indistinguishable sentences, whereas critics aim at distinguishing them. The assessment improves the captioner through policy gradient updates. During inference, we further propose a novel critic-based planning method to select high-quality sentences without additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k) as the target domains. Our method consistently performs well on all datasets. Utilizing the learned critic during inference further boosts the overall performance in CUB-200 and Oxford-102. Furthermore, we extend our method to the task of video captioning. We observe improvements for the adaptation between large-scale video captioning datasets such as MSR-VTT, M-VAD and MPII-MD.