Word Vectorization Methods: Comparisons and Applications

碩士 === 國立臺灣大學 === 資訊管理學研究所 === 106 === Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systema...

Full description

Bibliographic Details
Main Authors:	Qian-Hui Zeng, 曾千蕙
Other Authors:	Hsin-Min Lu
Format:	Others
Language:	en_US
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/shf36j

id	ndltd-TW-106NTU05396020
record_format	oai_dc
spelling	ndltd-TW-106NTU053960202019-07-25T04:46:48Z http://ndltd.ncl.edu.tw/handle/shf36j Word Vectorization Methods: Comparisons and Applications 詞向量化方法之比較與應用 Qian-Hui Zeng 曾千蕙碩士國立臺灣大學資訊管理學研究所 106 Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances. Hsin-Min Lu 盧信銘 2018 學位論文 ; thesis 44 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立臺灣大學 === 資訊管理學研究所 === 106 === Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances.
author2	Hsin-Min Lu
author_facet	Hsin-Min Lu Qian-Hui Zeng 曾千蕙
author	Qian-Hui Zeng 曾千蕙
spellingShingle	Qian-Hui Zeng 曾千蕙 Word Vectorization Methods: Comparisons and Applications
author_sort	Qian-Hui Zeng
title	Word Vectorization Methods: Comparisons and Applications
title_short	Word Vectorization Methods: Comparisons and Applications
title_full	Word Vectorization Methods: Comparisons and Applications
title_fullStr	Word Vectorization Methods: Comparisons and Applications
title_full_unstemmed	Word Vectorization Methods: Comparisons and Applications
title_sort	word vectorization methods: comparisons and applications
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/shf36j
work_keys_str_mv	AT qianhuizeng wordvectorizationmethodscomparisonsandapplications AT céngqiānhuì wordvectorizationmethodscomparisonsandapplications AT qianhuizeng cíxiàngliànghuàfāngfǎzhībǐjiàoyǔyīngyòng AT céngqiānhuì cíxiàngliànghuàfāngfǎzhībǐjiàoyǔyīngyòng
_version_	1719230009768411136

Word Vectorization Methods: Comparisons and Applications

Similar Items