Learning to recognise visual content from textual annotation

This thesis explores how machine learning can be applied to the task of learning to recognise visual content from different forms of textual annotation, bringing together computer vision and natural language processing. The data used in the thesis is taken from real world sources including broadcast...

Full description

Bibliographic Details
Main Author: Marter, Matthew John
Other Authors: Bowden, Richard ; Hadfield, Simon
Published: University of Surrey 2019
Subjects:
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.767000
id ndltd-bl.uk-oai-ethos.bl.uk-767000
record_format oai_dc
collection NDLTD
sources NDLTD
topic 621.3
spellingShingle 621.3
Marter, Matthew John
Learning to recognise visual content from textual annotation
description This thesis explores how machine learning can be applied to the task of learning to recognise visual content from different forms of textual annotation, bringing together computer vision and natural language processing. The data used in the thesis is taken from real world sources including broadcast television and photographs harvested from the internet. This leads to very few constraints on the data meaning there can be large variations in lighting, facial expression, visual properties of objects and camera angles. These sources provide the levels of data required to support modern machine learning approaches. However, annotation and or ground truth are not available and potentially expensive to obtain. This work therefore, will employ weak textual annotation in the form of subtitles, scripts, captions and tags. The use of weak textual annotation means that different techniques are also required to handle the natural language that is used to describe the visual content. Character identification is a challenge that requires a different approach due to the similarities that will be shared between all faces. As with location recognition, the script is aligned with the video using subtitles. Faces are detected using a face detector and facial landmarks are regressed. These facial landmarks are used to create a descriptor for the face. Multiple techniques are used to assign the faces identities from the script. In the first technique, facial descriptors are clustered to the number of characters and the size of the clusters matched with the screen time of the character. In the second technique, a random forest is trained to differentiate between different faces, and the splitting criteria are used to reduce the dimensionality of the facial features. The reduced dimensionality allows a distribution of facial features to be created per scene. Then rules are created to separate scenes and identify distributions for individual characters. As well as this, data harvested from the internet is used to learn the appearance of the actors in the video and then matched to the characters. Using the various techniques give a character labelling performance of up to 82.75% accuracy using a SIFT-based descriptor and up to 96.82% using a state of the art descriptor. Automatic caption generation for images is a relatively new and complex topic as it requires both understanding of the visual content in the image and the formation of natural language. Deep learning is powerful for object recognition and provides excellent performance on image recognition data sets. Pretrained convolutional neural networks (CNNs) were fine tuned using the parts of speech (POS) extracted from the natural language captions. A probabilistic language model can be created from the captions in the training data and be used to recreate new sentences for unseen images. To better model more complex language rules, a recurrent neural network (RNN) is used to generate sentences directly from features extracted from a CNN. An RNN that uses attention to look at different parts of an image can also utilise the final layers of a CNN to provide context for the whole image. Location recognition, character identification and RNNs are combined to automatically generate descriptions for broadcast television using character and location names. This creates a full pipeline for automatically labelling an unseen episode of a television series. Compared with ground truth input of location and characters, only a small drop in performance occurs when using labels predicted by computer vision and machine learning techniques. Using ground truth, a CIDEr score of 1.585 is achieved compared with 1.343 for a fully predicted system. Data providing emotional context for words and images allow the RNN to be used to manipulate the emotional context for images. Subjective testing shows that the output captions are more emotive than captions generated without emotional context 74.85% of the time, and are almost equal to human written captions. Adjusting the emotional context is shown to generate captions that alter the content to reflect the emotion. The fusion of computer vision and natural language processing through machine learning represents an important step for both fields.
author2 Bowden, Richard ; Hadfield, Simon
author_facet Bowden, Richard ; Hadfield, Simon
Marter, Matthew John
author Marter, Matthew John
author_sort Marter, Matthew John
title Learning to recognise visual content from textual annotation
title_short Learning to recognise visual content from textual annotation
title_full Learning to recognise visual content from textual annotation
title_fullStr Learning to recognise visual content from textual annotation
title_full_unstemmed Learning to recognise visual content from textual annotation
title_sort learning to recognise visual content from textual annotation
publisher University of Surrey
publishDate 2019
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.767000
work_keys_str_mv AT martermatthewjohn learningtorecognisevisualcontentfromtextualannotation
_version_ 1718996145835868160
spelling ndltd-bl.uk-oai-ethos.bl.uk-7670002019-03-05T15:41:42ZLearning to recognise visual content from textual annotationMarter, Matthew JohnBowden, Richard ; Hadfield, Simon2019This thesis explores how machine learning can be applied to the task of learning to recognise visual content from different forms of textual annotation, bringing together computer vision and natural language processing. The data used in the thesis is taken from real world sources including broadcast television and photographs harvested from the internet. This leads to very few constraints on the data meaning there can be large variations in lighting, facial expression, visual properties of objects and camera angles. These sources provide the levels of data required to support modern machine learning approaches. However, annotation and or ground truth are not available and potentially expensive to obtain. This work therefore, will employ weak textual annotation in the form of subtitles, scripts, captions and tags. The use of weak textual annotation means that different techniques are also required to handle the natural language that is used to describe the visual content. Character identification is a challenge that requires a different approach due to the similarities that will be shared between all faces. As with location recognition, the script is aligned with the video using subtitles. Faces are detected using a face detector and facial landmarks are regressed. These facial landmarks are used to create a descriptor for the face. Multiple techniques are used to assign the faces identities from the script. In the first technique, facial descriptors are clustered to the number of characters and the size of the clusters matched with the screen time of the character. In the second technique, a random forest is trained to differentiate between different faces, and the splitting criteria are used to reduce the dimensionality of the facial features. The reduced dimensionality allows a distribution of facial features to be created per scene. Then rules are created to separate scenes and identify distributions for individual characters. As well as this, data harvested from the internet is used to learn the appearance of the actors in the video and then matched to the characters. Using the various techniques give a character labelling performance of up to 82.75% accuracy using a SIFT-based descriptor and up to 96.82% using a state of the art descriptor. Automatic caption generation for images is a relatively new and complex topic as it requires both understanding of the visual content in the image and the formation of natural language. Deep learning is powerful for object recognition and provides excellent performance on image recognition data sets. Pretrained convolutional neural networks (CNNs) were fine tuned using the parts of speech (POS) extracted from the natural language captions. A probabilistic language model can be created from the captions in the training data and be used to recreate new sentences for unseen images. To better model more complex language rules, a recurrent neural network (RNN) is used to generate sentences directly from features extracted from a CNN. An RNN that uses attention to look at different parts of an image can also utilise the final layers of a CNN to provide context for the whole image. Location recognition, character identification and RNNs are combined to automatically generate descriptions for broadcast television using character and location names. This creates a full pipeline for automatically labelling an unseen episode of a television series. Compared with ground truth input of location and characters, only a small drop in performance occurs when using labels predicted by computer vision and machine learning techniques. Using ground truth, a CIDEr score of 1.585 is achieved compared with 1.343 for a fully predicted system. Data providing emotional context for words and images allow the RNN to be used to manipulate the emotional context for images. Subjective testing shows that the output captions are more emotive than captions generated without emotional context 74.85% of the time, and are almost equal to human written captions. Adjusting the emotional context is shown to generate captions that alter the content to reflect the emotion. The fusion of computer vision and natural language processing through machine learning represents an important step for both fields.621.3University of Surrey10.15126/thesis.00850052https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.767000http://epubs.surrey.ac.uk/850052/Electronic Thesis or Dissertation