Multimodal interactive structured prediction

This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP sy...

Full description

Bibliographic Details
Main Author: Alabau Gonzalvo, Vicente
Other Authors: Casacuberta Nolla, Francisco
Format: Doctoral Thesis
Language:English
Published: Universitat Politècnica de València 2014
Subjects:
Online Access:http://hdl.handle.net/10251/35135
id ndltd-upv.es-oai-riunet.upv.es-10251-35135
record_format oai_dc
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Structured prediction
Handwritten text transcription
Automatic speech recognition
Machine translation
Human interaction
Interactive pattern recognition
Multimodal interaction
Interactive machine translation
Computer assisted transcription
Active interaction
Passive interaction
LENGUAJES Y SISTEMAS INFORMATICOS
spellingShingle Structured prediction
Handwritten text transcription
Automatic speech recognition
Machine translation
Human interaction
Interactive pattern recognition
Multimodal interaction
Interactive machine translation
Computer assisted transcription
Active interaction
Passive interaction
LENGUAJES Y SISTEMAS INFORMATICOS
Alabau Gonzalvo, Vicente
Multimodal interactive structured prediction
description This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP systems. The first aspect, which refers to the interactive part of MISP, is the study of strate- gies for efficient human¿computer collaboration to produce error-free outputs. Multimodality, the second aspect, deals with other more ergonomic modalities of communication with the computer rather than keyboard and mouse. To begin with, in sequential interaction the user is assumed to supervise the output from left-to-right so that errors are corrected in sequential order. We study the problem under the decision theory framework and define an optimum decoding algorithm. The optimum algorithm is compared to the usually ap- plied, standard approach. Experimental results on several tasks suggests that the optimum algorithm is slightly better than the standard algorithm. In contrast to sequential interaction, in active interaction it is the system that decides what should be given to the user for supervision. On the one hand, user supervision can be reduced if the user is required to supervise only the outputs that the system expects to be erroneous. In this respect, we define a strategy that retrieves first the outputs with highest expected error first. Moreover, we prove that this strategy is optimum under certain conditions, which is validated by experimental results. On the other hand, if the goal is to reduce the number of corrections, active interaction works by selecting elements, one by one, e.g., words of a given output to be supervised by the user. For this case, several strategies are compared. Unlike the previous case, the strategy that performs better is to choose the element with highest confidence, which coincides with the findings of the optimum algorithm for sequential interaction. However, this also suggests that minimizing effort and supervision are contradictory goals. With respect to the multimodality aspect, this thesis delves into techniques to make multimodal systems more robust. To achieve that, multimodal systems are improved by providing contextual information of the application at hand. First, we study how to integrate e-pen interaction in a machine translation task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two approaches: inspired by word-based translation models and n-grams generated from a phrase-based system. The experiments show that the former outper- forms the latter for this task. Furthermore, the results present remarkable improvements against not using contextual information. Second, similar ex- periments are conducted on a speech-enabled interface for interactive machine translation. The improvements over the baseline are also noticeable. How- ever, in this case, phrase-based models perform much better than word-based models. We attribute that to the fact that acoustic models are poorer estima- tions than morphologic models and, thus, they benefit more from the language model. Finally, similar techniques are proposed for dictation of handwritten documents. The results show that speech and handwritten recognition can be combined in an effective way. Finally, an evaluation with real users is carried out to compare an interactive machine translation prototype with a post-editing prototype. The results of the study reveal that users are very sensitive to the usability aspects of the user interface. Therefore, usability is a crucial aspect to consider in an human evaluation that can hinder the real benefits of the technology being evaluated. Hopefully, once usability problems are fixed, the evaluation indicates that users are more favorable to work with the interactive machine translation system than to the post-editing system. === Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135 === TESIS === Premiado
author2 Casacuberta Nolla, Francisco
author_facet Casacuberta Nolla, Francisco
Alabau Gonzalvo, Vicente
author Alabau Gonzalvo, Vicente
author_sort Alabau Gonzalvo, Vicente
title Multimodal interactive structured prediction
title_short Multimodal interactive structured prediction
title_full Multimodal interactive structured prediction
title_fullStr Multimodal interactive structured prediction
title_full_unstemmed Multimodal interactive structured prediction
title_sort multimodal interactive structured prediction
publisher Universitat Politècnica de València
publishDate 2014
url http://hdl.handle.net/10251/35135
work_keys_str_mv AT alabaugonzalvovicente multimodalinteractivestructuredprediction
_version_ 1719367304460894208
spelling ndltd-upv.es-oai-riunet.upv.es-10251-351352020-12-02T20:21:49Z Multimodal interactive structured prediction Alabau Gonzalvo, Vicente Casacuberta Nolla, Francisco Sanchis Navarro, José Alberto Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP systems. The first aspect, which refers to the interactive part of MISP, is the study of strate- gies for efficient human¿computer collaboration to produce error-free outputs. Multimodality, the second aspect, deals with other more ergonomic modalities of communication with the computer rather than keyboard and mouse. To begin with, in sequential interaction the user is assumed to supervise the output from left-to-right so that errors are corrected in sequential order. We study the problem under the decision theory framework and define an optimum decoding algorithm. The optimum algorithm is compared to the usually ap- plied, standard approach. Experimental results on several tasks suggests that the optimum algorithm is slightly better than the standard algorithm. In contrast to sequential interaction, in active interaction it is the system that decides what should be given to the user for supervision. On the one hand, user supervision can be reduced if the user is required to supervise only the outputs that the system expects to be erroneous. In this respect, we define a strategy that retrieves first the outputs with highest expected error first. Moreover, we prove that this strategy is optimum under certain conditions, which is validated by experimental results. On the other hand, if the goal is to reduce the number of corrections, active interaction works by selecting elements, one by one, e.g., words of a given output to be supervised by the user. For this case, several strategies are compared. Unlike the previous case, the strategy that performs better is to choose the element with highest confidence, which coincides with the findings of the optimum algorithm for sequential interaction. However, this also suggests that minimizing effort and supervision are contradictory goals. With respect to the multimodality aspect, this thesis delves into techniques to make multimodal systems more robust. To achieve that, multimodal systems are improved by providing contextual information of the application at hand. First, we study how to integrate e-pen interaction in a machine translation task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two approaches: inspired by word-based translation models and n-grams generated from a phrase-based system. The experiments show that the former outper- forms the latter for this task. Furthermore, the results present remarkable improvements against not using contextual information. Second, similar ex- periments are conducted on a speech-enabled interface for interactive machine translation. The improvements over the baseline are also noticeable. How- ever, in this case, phrase-based models perform much better than word-based models. We attribute that to the fact that acoustic models are poorer estima- tions than morphologic models and, thus, they benefit more from the language model. Finally, similar techniques are proposed for dictation of handwritten documents. The results show that speech and handwritten recognition can be combined in an effective way. Finally, an evaluation with real users is carried out to compare an interactive machine translation prototype with a post-editing prototype. The results of the study reveal that users are very sensitive to the usability aspects of the user interface. Therefore, usability is a crucial aspect to consider in an human evaluation that can hinder the real benefits of the technology being evaluated. Hopefully, once usability problems are fixed, the evaluation indicates that users are more favorable to work with the interactive machine translation system than to the post-editing system. Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135 TESIS Premiado 2014-01-27 info:eu-repo/semantics/doctoralThesis info:eu-repo/semantics/acceptedVersion http://hdl.handle.net/10251/35135 10.4995/Thesis/10251/35135 eng http://rightsstatements.org/vocab/InC/1.0/ info:eu-repo/semantics/openAccess Universitat Politècnica de València Riunet