Cross-Lingual Image Caption Generation Based on Visual Attention Model
As an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image featur...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9107263/ |
id |
doaj-f8efd5d04e72431eb4a28a56b6eae432 |
---|---|
record_format |
Article |
spelling |
doaj-f8efd5d04e72431eb4a28a56b6eae4322021-03-30T02:58:29ZengIEEEIEEE Access2169-35362020-01-01810454310455410.1109/ACCESS.2020.29995689107263Cross-Lingual Image Caption Generation Based on Visual Attention ModelBin Wang0https://orcid.org/0000-0002-5860-3440Cungang Wang1https://orcid.org/0000-0002-7591-788XQian Zhang2https://orcid.org/0000-0003-0760-9241Ying Su3https://orcid.org/0000-0001-7824-5954Yang Wang4https://orcid.org/0000-0001-8100-9194Yanyan Xu5https://orcid.org/0000-0001-5429-3177College of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaSchool of Computer Science, Liaocheng University, Liaocheng, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical, and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaDepartment of City and Regional Planning, University of California at Berkeley, Berkeley, CA, USAAs an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image feature information at specific location every moment and generate the corresponding caption description through a semantic attention model. The end-to-end framework allows us to introduce an independent recurrent structure as an attention module, derived by calculating the similarity between image feature sequence and semantic word sequence. Additionally, our model is designed to transfer the knowledge representation obtained from the English portion into the Chinese portion to achieve the cross-lingual image captioning. We evaluate the proposed model on the most popular benchmark datasets. We report an improvement of 3.9% over existing state-of-the-art approaches for cross-lingual image captioning on the Flickr8k CN dataset on CIDEr metric. The experimental results demonstrate the effectiveness of our attention model.https://ieeexplore.ieee.org/document/9107263/Image caption generationattention modeldeep learning |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Bin Wang Cungang Wang Qian Zhang Ying Su Yang Wang Yanyan Xu |
spellingShingle |
Bin Wang Cungang Wang Qian Zhang Ying Su Yang Wang Yanyan Xu Cross-Lingual Image Caption Generation Based on Visual Attention Model IEEE Access Image caption generation attention model deep learning |
author_facet |
Bin Wang Cungang Wang Qian Zhang Ying Su Yang Wang Yanyan Xu |
author_sort |
Bin Wang |
title |
Cross-Lingual Image Caption Generation Based on Visual Attention Model |
title_short |
Cross-Lingual Image Caption Generation Based on Visual Attention Model |
title_full |
Cross-Lingual Image Caption Generation Based on Visual Attention Model |
title_fullStr |
Cross-Lingual Image Caption Generation Based on Visual Attention Model |
title_full_unstemmed |
Cross-Lingual Image Caption Generation Based on Visual Attention Model |
title_sort |
cross-lingual image caption generation based on visual attention model |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2020-01-01 |
description |
As an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image feature information at specific location every moment and generate the corresponding caption description through a semantic attention model. The end-to-end framework allows us to introduce an independent recurrent structure as an attention module, derived by calculating the similarity between image feature sequence and semantic word sequence. Additionally, our model is designed to transfer the knowledge representation obtained from the English portion into the Chinese portion to achieve the cross-lingual image captioning. We evaluate the proposed model on the most popular benchmark datasets. We report an improvement of 3.9% over existing state-of-the-art approaches for cross-lingual image captioning on the Flickr8k CN dataset on CIDEr metric. The experimental results demonstrate the effectiveness of our attention model. |
topic |
Image caption generation attention model deep learning |
url |
https://ieeexplore.ieee.org/document/9107263/ |
work_keys_str_mv |
AT binwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT cungangwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT qianzhang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT yingsu crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT yangwang crosslingualimagecaptiongenerationbasedonvisualattentionmodel AT yanyanxu crosslingualimagecaptiongenerationbasedonvisualattentionmodel |
_version_ |
1724184367350153216 |