Refining Chinese Sentences by Removing Words and Choosing Concise Terms

碩士 === 國立臺灣大學 === 資訊工程學研究所 === 106 === Writing in a professional or formal context requires conciseness. Starting from a colloquial draft, text is gradually refined and wordiness removed, resulting in a more formal style. For newspaper editing this is among the most frequent operations, yet is still...

Full description

Bibliographic Details
Main Authors: Sven Riemenschneider, 斯文
Other Authors: HSIN-HSI CHEN
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/3kgc6v
id ndltd-TW-106NTU05392042
record_format oai_dc
spelling ndltd-TW-106NTU053920422019-07-25T04:46:48Z http://ndltd.ncl.edu.tw/handle/3kgc6v Refining Chinese Sentences by Removing Words and Choosing Concise Terms 詞彙刪簡模型用於中文句子精練 Sven Riemenschneider 斯文 碩士 國立臺灣大學 資訊工程學研究所 106 Writing in a professional or formal context requires conciseness. Starting from a colloquial draft, text is gradually refined and wordiness removed, resulting in a more formal style. For newspaper editing this is among the most frequent operations, yet is still carried out manually. We have obtained a year of editing records and provide some insight into this phenomenon. In spoken Chinese, many words are composed of two or more characters, in writing the same meaning can often be conveyed by a subsequence. This gives rise to subword deletion. We show this to be an open class problem, with thousands of different word reductions pairs. Often there exist different reduction or deletion possibilities for the same word, contributing to the difficulty of achieving consistency with a variety of human annotators, given only a single reference and without explicitly formulated rules. We show that a neural machine translation based model can usually judge with very high precision whether to delete a word, but suffers from low recall, especially at the subword level. We combine sequence labeling at word and character level and attain the best performance for full and subword deletion in a single. Considering the ambiguity inherent in the problem and given only a single reference, our model attains reasonable consistency, especially on grammatical function words with hundreds or even thousands of instances available for training, Open word classes are more difficult to handle with in many cases only a few instances per word. We show how syntactic features are particularly helpful for these. HSIN-HSI CHEN 陳信希 2018 學位論文 ; thesis 85 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 106 === Writing in a professional or formal context requires conciseness. Starting from a colloquial draft, text is gradually refined and wordiness removed, resulting in a more formal style. For newspaper editing this is among the most frequent operations, yet is still carried out manually. We have obtained a year of editing records and provide some insight into this phenomenon. In spoken Chinese, many words are composed of two or more characters, in writing the same meaning can often be conveyed by a subsequence. This gives rise to subword deletion. We show this to be an open class problem, with thousands of different word reductions pairs. Often there exist different reduction or deletion possibilities for the same word, contributing to the difficulty of achieving consistency with a variety of human annotators, given only a single reference and without explicitly formulated rules. We show that a neural machine translation based model can usually judge with very high precision whether to delete a word, but suffers from low recall, especially at the subword level. We combine sequence labeling at word and character level and attain the best performance for full and subword deletion in a single. Considering the ambiguity inherent in the problem and given only a single reference, our model attains reasonable consistency, especially on grammatical function words with hundreds or even thousands of instances available for training, Open word classes are more difficult to handle with in many cases only a few instances per word. We show how syntactic features are particularly helpful for these.
author2 HSIN-HSI CHEN
author_facet HSIN-HSI CHEN
Sven Riemenschneider
斯文
author Sven Riemenschneider
斯文
spellingShingle Sven Riemenschneider
斯文
Refining Chinese Sentences by Removing Words and Choosing Concise Terms
author_sort Sven Riemenschneider
title Refining Chinese Sentences by Removing Words and Choosing Concise Terms
title_short Refining Chinese Sentences by Removing Words and Choosing Concise Terms
title_full Refining Chinese Sentences by Removing Words and Choosing Concise Terms
title_fullStr Refining Chinese Sentences by Removing Words and Choosing Concise Terms
title_full_unstemmed Refining Chinese Sentences by Removing Words and Choosing Concise Terms
title_sort refining chinese sentences by removing words and choosing concise terms
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/3kgc6v
work_keys_str_mv AT svenriemenschneider refiningchinesesentencesbyremovingwordsandchoosingconciseterms
AT sīwén refiningchinesesentencesbyremovingwordsandchoosingconciseterms
AT svenriemenschneider cíhuìshānjiǎnmóxíngyòngyúzhōngwénjùzijīngliàn
AT sīwén cíhuìshānjiǎnmóxíngyòngyúzhōngwénjùzijīngliàn
_version_ 1719229968301424640