Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications

碩士 === 國立臺灣大學 === 電機工程學研究所 === 107 === Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big chal...

Full description

Bibliographic Details
Main Authors:	Yu-Ting Hsu, 徐宇霆
Other Authors:	羅仁權
Format:	Others
Language:	en_US
Published:	2019
Online Access:	http://ndltd.ncl.edu.tw/handle/ksn8rp

id	ndltd-TW-107NTU05442025
record_format	oai_dc
spelling	ndltd-TW-107NTU054420252019-11-16T05:27:54Z http://ndltd.ncl.edu.tw/handle/ksn8rp Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications 多模態知識圖像描述系統於服務型機器人之應用 Yu-Ting Hsu 徐宇霆碩士國立臺灣大學電機工程學研究所 107 Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big challenge to equip them with sufficient intelligence. To make this possible, deep learning is an important technique. Deep learning gets popular in recent years including CNN (Convolutional Neural Network) for image processing and RNN (Recurrent Neural Network) for natural language processing. The more intelligent function, image caption, combines the techniques of CNN and RNN. Image caption is a function that given an image, it will generate a sentence to describe the image as a person does. Although this can be used in image retrieval, image indexing, etc., it cannot be applied directly on a service robot due to two main reasons. First, the image caption models proposed in recent works are trained with the famous common dataset, i.e. MSCOCO or flickr. These datasets gather images from a broad variety of field such as hand-drawn pictures, natural scene and paintings that are not usually seen in daily lives. Therefore, a robot equipped with such models may generate these special sentences sometimes even when it does not see the related images. Second, a service robot usually serves in a specific environment, so that it should be equipped with some specific knowledge corresponding to the objects and human in the environment. Unfortunately, the public and general dataset will not have that knowledge. The purpose of this work is to ground the image caption to a real service robot. This work will focus on the service robot for patrol. In other words, the robot should make a caption about what it sees and read that sentence to the guard in the remote control room. For this kind of purpose, we need to know what information the guard wants to know. For example, if there is a person in an image, the guard may want to know the person''s identity and state. The state includes the emotion and the behavior. In this work, the author proposes three methodologies to combine an image caption model with specific object recognition, so that the output sentence will contain the knowledge about the objects. Then, this image caption model is combined with a face recognition model and an emotion classification model so that the robot can also report the person''s identity and the emotion. Furthermore, the robot is also equipped with semantic localization to give the guard more comprehensive information about the scene the robot sees. From the experiment, we conclude that our informative image caption system outperform the MSCOCO-pretrained image caption model with higher object recognition accuracy. Our model also has a higher facial recognition rate and emotion recognition rate compared to the fine-tuned model. 羅仁權 2019 學位論文 ; thesis 80 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立臺灣大學 === 電機工程學研究所 === 107 === Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big challenge to equip them with sufficient intelligence. To make this possible, deep learning is an important technique. Deep learning gets popular in recent years including CNN (Convolutional Neural Network) for image processing and RNN (Recurrent Neural Network) for natural language processing. The more intelligent function, image caption, combines the techniques of CNN and RNN. Image caption is a function that given an image, it will generate a sentence to describe the image as a person does. Although this can be used in image retrieval, image indexing, etc., it cannot be applied directly on a service robot due to two main reasons. First, the image caption models proposed in recent works are trained with the famous common dataset, i.e. MSCOCO or flickr. These datasets gather images from a broad variety of field such as hand-drawn pictures, natural scene and paintings that are not usually seen in daily lives. Therefore, a robot equipped with such models may generate these special sentences sometimes even when it does not see the related images. Second, a service robot usually serves in a specific environment, so that it should be equipped with some specific knowledge corresponding to the objects and human in the environment. Unfortunately, the public and general dataset will not have that knowledge. The purpose of this work is to ground the image caption to a real service robot. This work will focus on the service robot for patrol. In other words, the robot should make a caption about what it sees and read that sentence to the guard in the remote control room. For this kind of purpose, we need to know what information the guard wants to know. For example, if there is a person in an image, the guard may want to know the person''s identity and state. The state includes the emotion and the behavior. In this work, the author proposes three methodologies to combine an image caption model with specific object recognition, so that the output sentence will contain the knowledge about the objects. Then, this image caption model is combined with a face recognition model and an emotion classification model so that the robot can also report the person''s identity and the emotion. Furthermore, the robot is also equipped with semantic localization to give the guard more comprehensive information about the scene the robot sees. From the experiment, we conclude that our informative image caption system outperform the MSCOCO-pretrained image caption model with higher object recognition accuracy. Our model also has a higher facial recognition rate and emotion recognition rate compared to the fine-tuned model.
author2	羅仁權
author_facet	羅仁權 Yu-Ting Hsu 徐宇霆
author	Yu-Ting Hsu 徐宇霆
spellingShingle	Yu-Ting Hsu 徐宇霆 Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
author_sort	Yu-Ting Hsu
title	Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_short	Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_full	Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_fullStr	Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_full_unstemmed	Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_sort	multi-modal knowledge image caption system for intelligent service robotics applications
publishDate	2019
url	http://ndltd.ncl.edu.tw/handle/ksn8rp
work_keys_str_mv	AT yutinghsu multimodalknowledgeimagecaptionsystemforintelligentserviceroboticsapplications AT xúyǔtíng multimodalknowledgeimagecaptionsystemforintelligentserviceroboticsapplications AT yutinghsu duōmótàizhīshítúxiàngmiáoshùxìtǒngyúfúwùxíngjīqìrénzhīyīngyòng AT xúyǔtíng duōmótàizhīshítúxiàngmiáoshùxìtǒngyúfúwùxíngjīqìrénzhīyīngyòng
_version_	1719291635417743360

Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications

Similar Items