Summary: | 碩士 === 國立中正大學 === 資訊工程研究所 === 103 === Studies of image captioning explosively emerge in recent two years. Though many elegant approaches have been proposed for general purposed image captioning, considering domain knowledge or specific description structure in a targeted domain still remains undiscovered.
In this thesis, we concentrate on food image captioning where a food image is better described by not only what food it is but also how it was cooked. We propose neural networks to jointly consider multiple factors, i.e., food recognition, ingredient recognition, and cooking method recognition, and verify that recognition performance can be improved by taking multiple factors into account. With these three factors, food image captions composed of verb-noun pairs (usually cooking method followed by ingredients) can be generated. We demonstrate effectiveness of the proposed methods from various viewpoints, and believe this would be a better way to describe food images in contrast to general-purposed image captioning.
|