Multimodal interactive structured prediction
This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP sy...
Main Author: | |
---|---|
Other Authors: | |
Format: | Doctoral Thesis |
Language: | English |
Published: |
Universitat Politècnica de València
2014
|
Subjects: | |
Online Access: | http://hdl.handle.net/10251/35135 |
id |
ndltd-upv.es-oai-riunet.upv.es-10251-35135 |
---|---|
record_format |
oai_dc |
collection |
NDLTD |
language |
English |
format |
Doctoral Thesis |
sources |
NDLTD |
topic |
Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS |
spellingShingle |
Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS Alabau Gonzalvo, Vicente Multimodal interactive structured prediction |
description |
This thesis presents scientific contributions to the field of multimodal interac-
tive structured prediction (MISP). The aim of MISP is to reduce the human
effort required to supervise an automatic output, in an efficient and ergonomic
way. Hence, this thesis focuses on the two aspects of MISP systems. The first
aspect, which refers to the interactive part of MISP, is the study of strate-
gies for efficient human¿computer collaboration to produce error-free outputs.
Multimodality, the second aspect, deals with other more ergonomic modalities
of communication with the computer rather than keyboard and mouse.
To begin with, in sequential interaction the user is assumed to supervise the
output from left-to-right so that errors are corrected in sequential order. We
study the problem under the decision theory framework and define an optimum
decoding algorithm. The optimum algorithm is compared to the usually ap-
plied, standard approach. Experimental results on several tasks suggests that
the optimum algorithm is slightly better than the standard algorithm.
In contrast to sequential interaction, in active interaction it is the system that
decides what should be given to the user for supervision. On the one hand, user
supervision can be reduced if the user is required to supervise only the outputs
that the system expects to be erroneous. In this respect, we define a strategy
that retrieves first the outputs with highest expected error first. Moreover, we
prove that this strategy is optimum under certain conditions, which is validated
by experimental results. On the other hand, if the goal is to reduce the number
of corrections, active interaction works by selecting elements, one by one, e.g.,
words of a given output to be supervised by the user. For this case, several
strategies are compared. Unlike the previous case, the strategy that performs
better is to choose the element with highest confidence, which coincides with
the findings of the optimum algorithm for sequential interaction. However, this
also suggests that minimizing effort and supervision are contradictory goals.
With respect to the multimodality aspect, this thesis delves into techniques to
make multimodal systems more robust. To achieve that, multimodal systems
are improved by providing contextual information of the application at hand.
First, we study how to integrate e-pen interaction in a machine translation
task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two
approaches: inspired by word-based translation models and n-grams generated
from a phrase-based system. The experiments show that the former outper-
forms the latter for this task. Furthermore, the results present remarkable
improvements against not using contextual information. Second, similar ex-
periments are conducted on a speech-enabled interface for interactive machine
translation. The improvements over the baseline are also noticeable. How-
ever, in this case, phrase-based models perform much better than word-based
models. We attribute that to the fact that acoustic models are poorer estima-
tions than morphologic models and, thus, they benefit more from the language
model. Finally, similar techniques are proposed for dictation of handwritten
documents. The results show that speech and handwritten recognition can be
combined in an effective way.
Finally, an evaluation with real users is carried out to compare an interactive
machine translation prototype with a post-editing prototype. The results of
the study reveal that users are very sensitive to the usability aspects of the
user interface. Therefore, usability is a crucial aspect to consider in an human
evaluation that can hinder the real benefits of the technology being evaluated.
Hopefully, once usability problems are fixed, the evaluation indicates that users
are more favorable to work with the interactive machine translation system than
to the post-editing system. === Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135 === TESIS === Premiado |
author2 |
Casacuberta Nolla, Francisco |
author_facet |
Casacuberta Nolla, Francisco Alabau Gonzalvo, Vicente |
author |
Alabau Gonzalvo, Vicente |
author_sort |
Alabau Gonzalvo, Vicente |
title |
Multimodal interactive structured prediction |
title_short |
Multimodal interactive structured prediction |
title_full |
Multimodal interactive structured prediction |
title_fullStr |
Multimodal interactive structured prediction |
title_full_unstemmed |
Multimodal interactive structured prediction |
title_sort |
multimodal interactive structured prediction |
publisher |
Universitat Politècnica de València |
publishDate |
2014 |
url |
http://hdl.handle.net/10251/35135 |
work_keys_str_mv |
AT alabaugonzalvovicente multimodalinteractivestructuredprediction |
_version_ |
1719367304460894208 |
spelling |
ndltd-upv.es-oai-riunet.upv.es-10251-351352020-12-02T20:21:49Z Multimodal interactive structured prediction Alabau Gonzalvo, Vicente Casacuberta Nolla, Francisco Sanchis Navarro, José Alberto Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP systems. The first aspect, which refers to the interactive part of MISP, is the study of strate- gies for efficient human¿computer collaboration to produce error-free outputs. Multimodality, the second aspect, deals with other more ergonomic modalities of communication with the computer rather than keyboard and mouse. To begin with, in sequential interaction the user is assumed to supervise the output from left-to-right so that errors are corrected in sequential order. We study the problem under the decision theory framework and define an optimum decoding algorithm. The optimum algorithm is compared to the usually ap- plied, standard approach. Experimental results on several tasks suggests that the optimum algorithm is slightly better than the standard algorithm. In contrast to sequential interaction, in active interaction it is the system that decides what should be given to the user for supervision. On the one hand, user supervision can be reduced if the user is required to supervise only the outputs that the system expects to be erroneous. In this respect, we define a strategy that retrieves first the outputs with highest expected error first. Moreover, we prove that this strategy is optimum under certain conditions, which is validated by experimental results. On the other hand, if the goal is to reduce the number of corrections, active interaction works by selecting elements, one by one, e.g., words of a given output to be supervised by the user. For this case, several strategies are compared. Unlike the previous case, the strategy that performs better is to choose the element with highest confidence, which coincides with the findings of the optimum algorithm for sequential interaction. However, this also suggests that minimizing effort and supervision are contradictory goals. With respect to the multimodality aspect, this thesis delves into techniques to make multimodal systems more robust. To achieve that, multimodal systems are improved by providing contextual information of the application at hand. First, we study how to integrate e-pen interaction in a machine translation task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two approaches: inspired by word-based translation models and n-grams generated from a phrase-based system. The experiments show that the former outper- forms the latter for this task. Furthermore, the results present remarkable improvements against not using contextual information. Second, similar ex- periments are conducted on a speech-enabled interface for interactive machine translation. The improvements over the baseline are also noticeable. How- ever, in this case, phrase-based models perform much better than word-based models. We attribute that to the fact that acoustic models are poorer estima- tions than morphologic models and, thus, they benefit more from the language model. Finally, similar techniques are proposed for dictation of handwritten documents. The results show that speech and handwritten recognition can be combined in an effective way. Finally, an evaluation with real users is carried out to compare an interactive machine translation prototype with a post-editing prototype. The results of the study reveal that users are very sensitive to the usability aspects of the user interface. Therefore, usability is a crucial aspect to consider in an human evaluation that can hinder the real benefits of the technology being evaluated. Hopefully, once usability problems are fixed, the evaluation indicates that users are more favorable to work with the interactive machine translation system than to the post-editing system. Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135 TESIS Premiado 2014-01-27 info:eu-repo/semantics/doctoralThesis info:eu-repo/semantics/acceptedVersion http://hdl.handle.net/10251/35135 10.4995/Thesis/10251/35135 eng http://rightsstatements.org/vocab/InC/1.0/ info:eu-repo/semantics/openAccess Universitat Politècnica de València Riunet |