Word Vectorization Methods: Comparisons and Applications

碩士 === 國立臺灣大學 === 資訊管理學研究所 === 106 === Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systema...

Full description

Bibliographic Details
Main Authors: Qian-Hui Zeng, 曾千蕙
Other Authors: Hsin-Min Lu
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/shf36j
id ndltd-TW-106NTU05396020
record_format oai_dc
spelling ndltd-TW-106NTU053960202019-07-25T04:46:48Z http://ndltd.ncl.edu.tw/handle/shf36j Word Vectorization Methods: Comparisons and Applications 詞向量化方法之比較與應用 Qian-Hui Zeng 曾千蕙 碩士 國立臺灣大學 資訊管理學研究所 106 Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances. Hsin-Min Lu 盧信銘 2018 學位論文 ; thesis 44 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 資訊管理學研究所 === 106 === Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances.
author2 Hsin-Min Lu
author_facet Hsin-Min Lu
Qian-Hui Zeng
曾千蕙
author Qian-Hui Zeng
曾千蕙
spellingShingle Qian-Hui Zeng
曾千蕙
Word Vectorization Methods: Comparisons and Applications
author_sort Qian-Hui Zeng
title Word Vectorization Methods: Comparisons and Applications
title_short Word Vectorization Methods: Comparisons and Applications
title_full Word Vectorization Methods: Comparisons and Applications
title_fullStr Word Vectorization Methods: Comparisons and Applications
title_full_unstemmed Word Vectorization Methods: Comparisons and Applications
title_sort word vectorization methods: comparisons and applications
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/shf36j
work_keys_str_mv AT qianhuizeng wordvectorizationmethodscomparisonsandapplications
AT céngqiānhuì wordvectorizationmethodscomparisonsandapplications
AT qianhuizeng cíxiàngliànghuàfāngfǎzhībǐjiàoyǔyīngyòng
AT céngqiānhuì cíxiàngliànghuàfāngfǎzhībǐjiàoyǔyīngyòng
_version_ 1719230009768411136