Word Vectorization Methods: Comparisons and Applications
碩士 === 國立臺灣大學 === 資訊管理學研究所 === 106 === Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systema...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2018
|
Online Access: | http://ndltd.ncl.edu.tw/handle/shf36j |
id |
ndltd-TW-106NTU05396020 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-106NTU053960202019-07-25T04:46:48Z http://ndltd.ncl.edu.tw/handle/shf36j Word Vectorization Methods: Comparisons and Applications 詞向量化方法之比較與應用 Qian-Hui Zeng 曾千蕙 碩士 國立臺灣大學 資訊管理學研究所 106 Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances. Hsin-Min Lu 盧信銘 2018 學位論文 ; thesis 44 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立臺灣大學 === 資訊管理學研究所 === 106 === Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances.
|
author2 |
Hsin-Min Lu |
author_facet |
Hsin-Min Lu Qian-Hui Zeng 曾千蕙 |
author |
Qian-Hui Zeng 曾千蕙 |
spellingShingle |
Qian-Hui Zeng 曾千蕙 Word Vectorization Methods: Comparisons and Applications |
author_sort |
Qian-Hui Zeng |
title |
Word Vectorization Methods: Comparisons and Applications |
title_short |
Word Vectorization Methods: Comparisons and Applications |
title_full |
Word Vectorization Methods: Comparisons and Applications |
title_fullStr |
Word Vectorization Methods: Comparisons and Applications |
title_full_unstemmed |
Word Vectorization Methods: Comparisons and Applications |
title_sort |
word vectorization methods: comparisons and applications |
publishDate |
2018 |
url |
http://ndltd.ncl.edu.tw/handle/shf36j |
work_keys_str_mv |
AT qianhuizeng wordvectorizationmethodscomparisonsandapplications AT céngqiānhuì wordvectorizationmethodscomparisonsandapplications AT qianhuizeng cíxiàngliànghuàfāngfǎzhībǐjiàoyǔyīngyòng AT céngqiānhuì cíxiàngliànghuàfāngfǎzhībǐjiàoyǔyīngyòng |
_version_ |
1719230009768411136 |