Multimodal interactive structured prediction

This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP sy...

Full description

Bibliographic Details
Main Author:	Alabau Gonzalvo, Vicente
Other Authors:	Casacuberta Nolla, Francisco
Format:	Doctoral Thesis
Language:	English
Published:	Universitat Politècnica de València 2014
Subjects:	Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS
Online Access:	http://hdl.handle.net/10251/35135

id	ndltd-upv.es-oai-riunet.upv.es-10251-35135
record_format	oai_dc
collection	NDLTD
language	English
format	Doctoral Thesis
sources	NDLTD
topic	Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS
spellingShingle	Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS Alabau Gonzalvo, Vicente Multimodal interactive structured prediction
description	This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP systems. The first aspect, which refers to the interactive part of MISP, is the study of strate- gies for efficient human¿computer collaboration to produce error-free outputs. Multimodality, the second aspect, deals with other more ergonomic modalities of communication with the computer rather than keyboard and mouse. To begin with, in sequential interaction the user is assumed to supervise the output from left-to-right so that errors are corrected in sequential order. We study the problem under the decision theory framework and define an optimum decoding algorithm. The optimum algorithm is compared to the usually ap- plied, standard approach. Experimental results on several tasks suggests that the optimum algorithm is slightly better than the standard algorithm. In contrast to sequential interaction, in active interaction it is the system that decides what should be given to the user for supervision. On the one hand, user supervision can be reduced if the user is required to supervise only the outputs that the system expects to be erroneous. In this respect, we define a strategy that retrieves first the outputs with highest expected error first. Moreover, we prove that this strategy is optimum under certain conditions, which is validated by experimental results. On the other hand, if the goal is to reduce the number of corrections, active interaction works by selecting elements, one by one, e.g., words of a given output to be supervised by the user. For this case, several strategies are compared. Unlike the previous case, the strategy that performs better is to choose the element with highest confidence, which coincides with the findings of the optimum algorithm for sequential interaction. However, this also suggests that minimizing effort and supervision are contradictory goals. With respect to the multimodality aspect, this thesis delves into techniques to make multimodal systems more robust. To achieve that, multimodal systems are improved by providing contextual information of the application at hand. First, we study how to integrate e-pen interaction in a machine translation task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two approaches: inspired by word-based translation models and n-grams generated from a phrase-based system. The experiments show that the former outper- forms the latter for this task. Furthermore, the results present remarkable improvements against not using contextual information. Second, similar ex- periments are conducted on a speech-enabled interface for interactive machine translation. The improvements over the baseline are also noticeable. How- ever, in this case, phrase-based models perform much better than word-based models. We attribute that to the fact that acoustic models are poorer estima- tions than morphologic models and, thus, they benefit more from the language model. Finally, similar techniques are proposed for dictation of handwritten documents. The results show that speech and handwritten recognition can be combined in an effective way. Finally, an evaluation with real users is carried out to compare an interactive machine translation prototype with a post-editing prototype. The results of the study reveal that users are very sensitive to the usability aspects of the user interface. Therefore, usability is a crucial aspect to consider in an human evaluation that can hinder the real benefits of the technology being evaluated. Hopefully, once usability problems are fixed, the evaluation indicates that users are more favorable to work with the interactive machine translation system than to the post-editing system. === Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135 === TESIS === Premiado
author2	Casacuberta Nolla, Francisco
author_facet	Casacuberta Nolla, Francisco Alabau Gonzalvo, Vicente
author	Alabau Gonzalvo, Vicente
author_sort	Alabau Gonzalvo, Vicente
title	Multimodal interactive structured prediction
title_short	Multimodal interactive structured prediction
title_full	Multimodal interactive structured prediction
title_fullStr	Multimodal interactive structured prediction
title_full_unstemmed	Multimodal interactive structured prediction
title_sort	multimodal interactive structured prediction
publisher	Universitat Politècnica de València
publishDate	2014
url	http://hdl.handle.net/10251/35135
work_keys_str_mv	AT alabaugonzalvovicente multimodalinteractivestructuredprediction
_version_	1719367304460894208
spelling	ndltd-upv.es-oai-riunet.upv.es-10251-351352020-12-02T20:21:49Z Multimodal interactive structured prediction Alabau Gonzalvo, Vicente Casacuberta Nolla, Francisco Sanchis Navarro, José Alberto Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació Structured prediction Handwritten text transcription Automatic speech recognition Machine translation Human interaction Interactive pattern recognition Multimodal interaction Interactive machine translation Computer assisted transcription Active interaction Passive interaction LENGUAJES Y SISTEMAS INFORMATICOS This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP systems. The first aspect, which refers to the interactive part of MISP, is the study of strate- gies for efficient human¿computer collaboration to produce error-free outputs. Multimodality, the second aspect, deals with other more ergonomic modalities of communication with the computer rather than keyboard and mouse. To begin with, in sequential interaction the user is assumed to supervise the output from left-to-right so that errors are corrected in sequential order. We study the problem under the decision theory framework and define an optimum decoding algorithm. The optimum algorithm is compared to the usually ap- plied, standard approach. Experimental results on several tasks suggests that the optimum algorithm is slightly better than the standard algorithm. In contrast to sequential interaction, in active interaction it is the system that decides what should be given to the user for supervision. On the one hand, user supervision can be reduced if the user is required to supervise only the outputs that the system expects to be erroneous. In this respect, we define a strategy that retrieves first the outputs with highest expected error first. Moreover, we prove that this strategy is optimum under certain conditions, which is validated by experimental results. On the other hand, if the goal is to reduce the number of corrections, active interaction works by selecting elements, one by one, e.g., words of a given output to be supervised by the user. For this case, several strategies are compared. Unlike the previous case, the strategy that performs better is to choose the element with highest confidence, which coincides with the findings of the optimum algorithm for sequential interaction. However, this also suggests that minimizing effort and supervision are contradictory goals. With respect to the multimodality aspect, this thesis delves into techniques to make multimodal systems more robust. To achieve that, multimodal systems are improved by providing contextual information of the application at hand. First, we study how to integrate e-pen interaction in a machine translation task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two approaches: inspired by word-based translation models and n-grams generated from a phrase-based system. The experiments show that the former outper- forms the latter for this task. Furthermore, the results present remarkable improvements against not using contextual information. Second, similar ex- periments are conducted on a speech-enabled interface for interactive machine translation. The improvements over the baseline are also noticeable. How- ever, in this case, phrase-based models perform much better than word-based models. We attribute that to the fact that acoustic models are poorer estima- tions than morphologic models and, thus, they benefit more from the language model. Finally, similar techniques are proposed for dictation of handwritten documents. The results show that speech and handwritten recognition can be combined in an effective way. Finally, an evaluation with real users is carried out to compare an interactive machine translation prototype with a post-editing prototype. The results of the study reveal that users are very sensitive to the usability aspects of the user interface. Therefore, usability is a crucial aspect to consider in an human evaluation that can hinder the real benefits of the technology being evaluated. Hopefully, once usability problems are fixed, the evaluation indicates that users are more favorable to work with the interactive machine translation system than to the post-editing system. Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135 TESIS Premiado 2014-01-27 info:eu-repo/semantics/doctoralThesis info:eu-repo/semantics/acceptedVersion http://hdl.handle.net/10251/35135 10.4995/Thesis/10251/35135 eng http://rightsstatements.org/vocab/InC/1.0/ info:eu-repo/semantics/openAccess Universitat Politècnica de València Riunet

Multimodal interactive structured prediction

Similar Items