Summary: | 碩士 === 國立政治大學 === 統計學系 === 106 === Text is the carrier of the human history. From the official history to the personal diary, it records the culture, thoughts, customs, and technological developments of human beings. With the progress of computer technology, text recordings are no longer restricted to physical vehicles, such as kraft paper or earthen bamboo slips, and they can be recorded in various digital forms. With the rise of interest in quantifying text analysis, more and more scholars are dedicated in the technologic development of text analysis and apply them to explore the text meaning. Many people think that computer technology, such as machine learning and artificial intelligence, can help us relax the burden of human experts in seeking the meaning under the text.
Topic analysis is an important research topic in text analysis. It makes text parsing faster by defining keywords and separating text attributes. This paper proposes the application of core vocabulary and screening tag features based on the commonly used TF-IDF (term frequency inverse document frequency) and the common tool word network (WordNet). We will apply them in exploring the relationship between instability caused by the length of the article and vocabulary (Magnini and Cavaglia, 2000). We use the Taiwan Social Science Citation Index (TSSCI), the U.S. patent, and the People's Daily as the study materials. The results of text analysis show that the classification accuracies of TSSCI and U.S. patent texts are nearly 80%. However, if the number of article is too small, then the noise will distort the analysis and semantic relations. Also, we found the style writing would influence the accuracy of topic classification, which may be the reason why the People’s Daily text classification accuracy is not good.
|