Cross-Lingual Image Caption Generation Based on Visual Attention Model

As an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image featur...

Full description

Bibliographic Details
Main Authors: Bin Wang, Cungang Wang, Qian Zhang, Ying Su, Yang Wang, Yanyan Xu
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9107263/
id doaj-f8efd5d04e72431eb4a28a56b6eae432
record_format Article
spelling doaj-f8efd5d04e72431eb4a28a56b6eae4322021-03-30T02:58:29ZengIEEEIEEE Access2169-35362020-01-01810454310455410.1109/ACCESS.2020.29995689107263Cross-Lingual Image Caption Generation Based on Visual Attention ModelBin Wang0https://orcid.org/0000-0002-5860-3440Cungang Wang1https://orcid.org/0000-0002-7591-788XQian Zhang2https://orcid.org/0000-0003-0760-9241Ying Su3https://orcid.org/0000-0001-7824-5954Yang Wang4https://orcid.org/0000-0001-8100-9194Yanyan Xu5https://orcid.org/0000-0001-5429-3177College of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaSchool of Computer Science, Liaocheng University, Liaocheng, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaDepartment of City and Regional Planning, University of California at Berkeley, Berkeley, CA, USAAs an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image feature information at specific location every moment and generate the corresponding caption description through a semantic attention model. The end-to-end framework allows us to introduce an independent recurrent structure as an attention module, derived by calculating the similarity between image feature sequence and semantic word sequence. Additionally, our model is designed to transfer the knowledge representation obtained from the English portion into the Chinese portion to achieve the cross-lingual image captioning. We evaluate the proposed model on the most popular benchmark datasets. We report an improvement of 3.9% over existing state-of-the-art approaches for cross-lingual image captioning on the Flickr8k CN dataset on CIDEr metric. The experimental results demonstrate the effectiveness of our attention model.https://ieeexplore.ieee.org/document/9107263/Image caption generationattention modeldeep learning
collection DOAJ
language English
format Article
sources DOAJ
author Bin Wang
Cungang Wang
Qian Zhang
Ying Su
Yang Wang
Yanyan Xu
spellingShingle Bin Wang
Cungang Wang
Qian Zhang
Ying Su
Yang Wang
Yanyan Xu
Cross-Lingual Image Caption Generation Based on Visual Attention Model
IEEE Access
Image caption generation
attention model
deep learning
author_facet Bin Wang
Cungang Wang
Qian Zhang
Ying Su
Yang Wang
Yanyan Xu
author_sort Bin Wang
title Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_short Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_full Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_fullStr Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_full_unstemmed Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_sort cross-lingual image caption generation based on visual attention model
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description As an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image feature information at specific location every moment and generate the corresponding caption description through a semantic attention model. The end-to-end framework allows us to introduce an independent recurrent structure as an attention module, derived by calculating the similarity between image feature sequence and semantic word sequence. Additionally, our model is designed to transfer the knowledge representation obtained from the English portion into the Chinese portion to achieve the cross-lingual image captioning. We evaluate the proposed model on the most popular benchmark datasets. We report an improvement of 3.9% over existing state-of-the-art approaches for cross-lingual image captioning on the Flickr8k CN dataset on CIDEr metric. The experimental results demonstrate the effectiveness of our attention model.
topic Image caption generation
attention model
deep learning
url https://ieeexplore.ieee.org/document/9107263/
work_keys_str_mv AT binwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel
AT cungangwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel
AT qianzhang crosslingualimagecaptiongenerationbasedonvisualattentionmodel
AT yingsu crosslingualimagecaptiongenerationbasedonvisualattentionmodel
AT yangwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel
AT yanyanxu crosslingualimagecaptiongenerationbasedonvisualattentionmodel
_version_ 1724184367350153216