Cross-Lingual Image Caption Generation Based on Visual Attention Model

As an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image featur...

Full description

Bibliographic Details
Main Authors:	Bin Wang, Cungang Wang, Qian Zhang, Ying Su, Yang Wang, Yanyan Xu
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Image caption generation attention model deep learning
Online Access:	https://ieeexplore.ieee.org/document/9107263/

id	doaj-f8efd5d04e72431eb4a28a56b6eae432
record_format	Article
spelling	doaj-f8efd5d04e72431eb4a28a56b6eae4322021-03-30T02:58:29ZengIEEEIEEE Access2169-35362020-01-01810454310455410.1109/ACCESS.2020.29995689107263Cross-Lingual Image Caption Generation Based on Visual Attention ModelBin Wang0https://orcid.org/0000-0002-5860-3440Cungang Wang1https://orcid.org/0000-0002-7591-788XQian Zhang2https://orcid.org/0000-0003-0760-9241Ying Su3https://orcid.org/0000-0001-7824-5954Yang Wang4https://orcid.org/0000-0001-8100-9194Yanyan Xu5https://orcid.org/0000-0001-5429-3177College of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaSchool of Computer Science, Liaocheng University, Liaocheng, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaDepartment of City and Regional Planning, University of California at Berkeley, Berkeley, CA, USAAs an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image feature information at specific location every moment and generate the corresponding caption description through a semantic attention model. The end-to-end framework allows us to introduce an independent recurrent structure as an attention module, derived by calculating the similarity between image feature sequence and semantic word sequence. Additionally, our model is designed to transfer the knowledge representation obtained from the English portion into the Chinese portion to achieve the cross-lingual image captioning. We evaluate the proposed model on the most popular benchmark datasets. We report an improvement of 3.9% over existing state-of-the-art approaches for cross-lingual image captioning on the Flickr8k CN dataset on CIDEr metric. The experimental results demonstrate the effectiveness of our attention model.https://ieeexplore.ieee.org/document/9107263/Image caption generationattention modeldeep learning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Bin Wang Cungang Wang Qian Zhang Ying Su Yang Wang Yanyan Xu
spellingShingle	Bin Wang Cungang Wang Qian Zhang Ying Su Yang Wang Yanyan Xu Cross-Lingual Image Caption Generation Based on Visual Attention Model IEEE Access Image caption generation attention model deep learning
author_facet	Bin Wang Cungang Wang Qian Zhang Ying Su Yang Wang Yanyan Xu
author_sort	Bin Wang
title	Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_short	Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_full	Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_fullStr	Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_full_unstemmed	Cross-Lingual Image Caption Generation Based on Visual Attention Model
title_sort	cross-lingual image caption generation based on visual attention model
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2020-01-01
description	As an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image feature information at specific location every moment and generate the corresponding caption description through a semantic attention model. The end-to-end framework allows us to introduce an independent recurrent structure as an attention module, derived by calculating the similarity between image feature sequence and semantic word sequence. Additionally, our model is designed to transfer the knowledge representation obtained from the English portion into the Chinese portion to achieve the cross-lingual image captioning. We evaluate the proposed model on the most popular benchmark datasets. We report an improvement of 3.9% over existing state-of-the-art approaches for cross-lingual image captioning on the Flickr8k CN dataset on CIDEr metric. The experimental results demonstrate the effectiveness of our attention model.
topic	Image caption generation attention model deep learning
url	https://ieeexplore.ieee.org/document/9107263/
work_keys_str_mv	AT binwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT cungangwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT qianzhang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT yingsu crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT yangwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT yanyanxu crosslingualimagecaptiongenerationbasedonvisualattentionmodel
_version_	1724184367350153216

Cross-Lingual Image Caption Generation Based on Visual Attention Model

Similar Items