Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications

碩士 === 國立臺灣大學 === 電機工程學研究所 === 107 === Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big chal...

Full description

Bibliographic Details
Main Authors: Yu-Ting Hsu, 徐宇霆
Other Authors: 羅仁權
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/ksn8rp
id ndltd-TW-107NTU05442025
record_format oai_dc
spelling ndltd-TW-107NTU054420252019-11-16T05:27:54Z http://ndltd.ncl.edu.tw/handle/ksn8rp Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications 多模態知識圖像描述系統於服務型機器人之應用 Yu-Ting Hsu 徐宇霆 碩士 國立臺灣大學 電機工程學研究所 107 Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big challenge to equip them with sufficient intelligence. To make this possible, deep learning is an important technique. Deep learning gets popular in recent years including CNN (Convolutional Neural Network) for image processing and RNN (Recurrent Neural Network) for natural language processing. The more intelligent function, image caption, combines the techniques of CNN and RNN. Image caption is a function that given an image, it will generate a sentence to describe the image as a person does. Although this can be used in image retrieval, image indexing, etc., it cannot be applied directly on a service robot due to two main reasons. First, the image caption models proposed in recent works are trained with the famous common dataset, i.e. MSCOCO or flickr. These datasets gather images from a broad variety of field such as hand-drawn pictures, natural scene and paintings that are not usually seen in daily lives. Therefore, a robot equipped with such models may generate these special sentences sometimes even when it does not see the related images. Second, a service robot usually serves in a specific environment, so that it should be equipped with some specific knowledge corresponding to the objects and human in the environment. Unfortunately, the public and general dataset will not have that knowledge. The purpose of this work is to ground the image caption to a real service robot. This work will focus on the service robot for patrol. In other words, the robot should make a caption about what it sees and read that sentence to the guard in the remote control room. For this kind of purpose, we need to know what information the guard wants to know. For example, if there is a person in an image, the guard may want to know the person''s identity and state. The state includes the emotion and the behavior. In this work, the author proposes three methodologies to combine an image caption model with specific object recognition, so that the output sentence will contain the knowledge about the objects. Then, this image caption model is combined with a face recognition model and an emotion classification model so that the robot can also report the person''s identity and the emotion. Furthermore, the robot is also equipped with semantic localization to give the guard more comprehensive information about the scene the robot sees. From the experiment, we conclude that our informative image caption system outperform the MSCOCO-pretrained image caption model with higher object recognition accuracy. Our model also has a higher facial recognition rate and emotion recognition rate compared to the fine-tuned model. 羅仁權 2019 學位論文 ; thesis 80 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 電機工程學研究所 === 107 === Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big challenge to equip them with sufficient intelligence. To make this possible, deep learning is an important technique. Deep learning gets popular in recent years including CNN (Convolutional Neural Network) for image processing and RNN (Recurrent Neural Network) for natural language processing. The more intelligent function, image caption, combines the techniques of CNN and RNN. Image caption is a function that given an image, it will generate a sentence to describe the image as a person does. Although this can be used in image retrieval, image indexing, etc., it cannot be applied directly on a service robot due to two main reasons. First, the image caption models proposed in recent works are trained with the famous common dataset, i.e. MSCOCO or flickr. These datasets gather images from a broad variety of field such as hand-drawn pictures, natural scene and paintings that are not usually seen in daily lives. Therefore, a robot equipped with such models may generate these special sentences sometimes even when it does not see the related images. Second, a service robot usually serves in a specific environment, so that it should be equipped with some specific knowledge corresponding to the objects and human in the environment. Unfortunately, the public and general dataset will not have that knowledge. The purpose of this work is to ground the image caption to a real service robot. This work will focus on the service robot for patrol. In other words, the robot should make a caption about what it sees and read that sentence to the guard in the remote control room. For this kind of purpose, we need to know what information the guard wants to know. For example, if there is a person in an image, the guard may want to know the person''s identity and state. The state includes the emotion and the behavior. In this work, the author proposes three methodologies to combine an image caption model with specific object recognition, so that the output sentence will contain the knowledge about the objects. Then, this image caption model is combined with a face recognition model and an emotion classification model so that the robot can also report the person''s identity and the emotion. Furthermore, the robot is also equipped with semantic localization to give the guard more comprehensive information about the scene the robot sees. From the experiment, we conclude that our informative image caption system outperform the MSCOCO-pretrained image caption model with higher object recognition accuracy. Our model also has a higher facial recognition rate and emotion recognition rate compared to the fine-tuned model.
author2 羅仁權
author_facet 羅仁權
Yu-Ting Hsu
徐宇霆
author Yu-Ting Hsu
徐宇霆
spellingShingle Yu-Ting Hsu
徐宇霆
Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
author_sort Yu-Ting Hsu
title Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_short Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_full Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_fullStr Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_full_unstemmed Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
title_sort multi-modal knowledge image caption system for intelligent service robotics applications
publishDate 2019
url http://ndltd.ncl.edu.tw/handle/ksn8rp
work_keys_str_mv AT yutinghsu multimodalknowledgeimagecaptionsystemforintelligentserviceroboticsapplications
AT xúyǔtíng multimodalknowledgeimagecaptionsystemforintelligentserviceroboticsapplications
AT yutinghsu duōmótàizhīshítúxiàngmiáoshùxìtǒngyúfúwùxíngjīqìrénzhīyīngyòng
AT xúyǔtíng duōmótàizhīshítúxiàngmiáoshùxìtǒngyúfúwùxíngjīqìrénzhīyīngyòng
_version_ 1719291635417743360