Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer

Recent video captioning models aim at describing all events in a long video. However, their event descriptions do not fully exploit the contextual information included in a video because they lack the ability to remember information changes over time. To address this problem, we propose a novel cont...

Full description

Bibliographic Details
Main Authors:	Jonghong Kim, Inchul Choi, Minho Lee
Format:	Article
Language:	English
Published:	MDPI AG 2020-07-01
Series:	Electronics
Subjects:	deep neural network deep learning context understanding recurrent neural network action recognition memory
Online Access:	https://www.mdpi.com/2079-9292/9/7/1162

id	doaj-4b5e3f74ddaa46ef8e027bc3b762cc56
record_format	Article
spelling	doaj-4b5e3f74ddaa46ef8e027bc3b762cc562020-11-25T03:16:33ZengMDPI AGElectronics2079-92922020-07-0191162116210.3390/electronics9071162Context Aware Video Caption Generation with Consecutive Differentiable Neural ComputerJonghong Kim0Inchul Choi1Minho Lee2School of Electronics Engineering, College of IT Engineering, Kyungpook National University, 80 Daehakro, Bukgu, Daegu 41566, KoreaSchool of Electronics Engineering, College of IT Engineering, Kyungpook National University, 80 Daehakro, Bukgu, Daegu 41566, KoreaSchool of Electronics Engineering, College of IT Engineering, Kyungpook National University, 80 Daehakro, Bukgu, Daegu 41566, KoreaRecent video captioning models aim at describing all events in a long video. However, their event descriptions do not fully exploit the contextual information included in a video because they lack the ability to remember information changes over time. To address this problem, we propose a novel context-aware video captioning model that generates natural language descriptions based on the improved video context understanding. We introduce an external memory, differential neural computer (DNC), to improve video context understanding. DNC naturally learns to use its internal memory for context understanding and also provides contents of its memory as an output for additional connection. By sequentially connecting DNC-based caption models (DNC augmented LSTM) through this memory information, our consecutively connected DNC architecture can understand the context in a video without explicitly searching for event-wise correlation. Our consecutive DNC is sequentially trained with its language model (LSTM) for each video clip to generate context-aware captions with superior quality. In experiments, we demonstrate that our model provides more natural and coherent captions which reflect previous contextual information. Our model also shows superior quantitative performance on video captioning in terms of BLEU (BLEU@4 4.37), METEOR (9.57), and CIDEr-D (28.08).https://www.mdpi.com/2079-9292/9/7/1162deep neural networkdeep learningcontext understandingrecurrent neural networkaction recognitionmemory
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Jonghong Kim Inchul Choi Minho Lee
spellingShingle	Jonghong Kim Inchul Choi Minho Lee Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer Electronics deep neural network deep learning context understanding recurrent neural network action recognition memory
author_facet	Jonghong Kim Inchul Choi Minho Lee
author_sort	Jonghong Kim
title	Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer
title_short	Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer
title_full	Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer
title_fullStr	Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer
title_full_unstemmed	Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer
title_sort	context aware video caption generation with consecutive differentiable neural computer
publisher	MDPI AG
series	Electronics
issn	2079-9292
publishDate	2020-07-01
description	Recent video captioning models aim at describing all events in a long video. However, their event descriptions do not fully exploit the contextual information included in a video because they lack the ability to remember information changes over time. To address this problem, we propose a novel context-aware video captioning model that generates natural language descriptions based on the improved video context understanding. We introduce an external memory, differential neural computer (DNC), to improve video context understanding. DNC naturally learns to use its internal memory for context understanding and also provides contents of its memory as an output for additional connection. By sequentially connecting DNC-based caption models (DNC augmented LSTM) through this memory information, our consecutively connected DNC architecture can understand the context in a video without explicitly searching for event-wise correlation. Our consecutive DNC is sequentially trained with its language model (LSTM) for each video clip to generate context-aware captions with superior quality. In experiments, we demonstrate that our model provides more natural and coherent captions which reflect previous contextual information. Our model also shows superior quantitative performance on video captioning in terms of BLEU (BLEU@4 4.37), METEOR (9.57), and CIDEr-D (28.08).
topic	deep neural network deep learning context understanding recurrent neural network action recognition memory
url	https://www.mdpi.com/2079-9292/9/7/1162
work_keys_str_mv	AT jonghongkim contextawarevideocaptiongenerationwithconsecutivedifferentiableneuralcomputer AT inchulchoi contextawarevideocaptiongenerationwithconsecutivedifferentiableneuralcomputer AT minholee contextawarevideocaptiongenerationwithconsecutivedifferentiableneuralcomputer
_version_	1724635564713443328

Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer

Similar Items