Summary: | 碩士 === 國立臺灣大學 === 電機工程學研究所 === 107 === Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big challenge to equip them with sufficient intelligence. To make this possible, deep learning is an important technique. Deep learning gets popular in recent years including CNN (Convolutional Neural Network) for image processing and RNN (Recurrent Neural Network) for natural language processing.
The more intelligent function, image caption, combines the techniques of CNN and RNN. Image caption is a function that given an image, it will generate a sentence to describe the image as a person does. Although this can be used in image retrieval, image indexing, etc., it cannot be applied directly on a service robot due to two main reasons.
First, the image caption models proposed in recent works are trained with the famous common dataset, i.e. MSCOCO or flickr. These datasets gather images from a broad variety of field such as hand-drawn pictures, natural scene and paintings that are not usually seen in daily lives. Therefore, a robot equipped with such models may generate these special sentences sometimes even when it does not see the related images. Second, a service robot usually serves in a specific environment, so that it should be equipped with some specific knowledge corresponding to the objects and human in the environment. Unfortunately, the public and general dataset will not have that knowledge.
The purpose of this work is to ground the image caption to a real service robot. This work will focus on the service robot for patrol. In other words, the robot should make a caption about what it sees and read that sentence to the guard in the remote control room. For this kind of purpose, we need to know what information the guard wants to know. For example, if there is a person in an image, the guard may want to know the person''s identity and state. The state includes the emotion and the behavior. In this work, the author proposes three methodologies to combine an image caption model with specific object recognition, so that the output sentence will contain the knowledge about the objects. Then, this image caption model is combined with a face recognition model and an emotion classification model so that the robot can also report the person''s identity and the emotion. Furthermore, the robot is also equipped with semantic localization to give the guard more comprehensive information about the scene the robot sees.
From the experiment, we conclude that our informative image caption system outperform the MSCOCO-pretrained image caption model with higher object recognition accuracy. Our model also has a higher facial recognition rate and emotion recognition rate compared to the fine-tuned model.
|