A data mining approach for identifying improper review articles on the internet - Taking Cosmetics as an example

碩士 === 國立臺灣科技大學 === 資訊管理系 === 104 === Due to the popularity of the Internet, people are willing to share their opinions on using a product through posting review articles on the Internet. Review articles affect a customer's attitude on purchasing a product. In the past, consumers may ask th...

Full description

Bibliographic Details
Main Authors: Yin-Hsuan Hsieh, 謝尹瑄
Other Authors: Yung-Ho Leu
Format: Others
Language:zh-TW
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/d4qmy9
Description
Summary:碩士 === 國立臺灣科技大學 === 資訊管理系 === 104 === Due to the popularity of the Internet, people are willing to share their opinions on using a product through posting review articles on the Internet. Review articles affect a customer's attitude on purchasing a product. In the past, consumers may ask their friends' or relatives' opinions on a product before purchasing the product. Today, consumers usually browse the review articles on using a product on a blog or a forum before buying the product. As review articles are influential on customer's purchasing behavior, they are regulated by the law. A review article may exaggerate the effect on using a product to entice a customer to purchase the product. Therefore, there are regulations on the contents of a review article. This thesis aims at automatically screening out improper review articles from review articles on the Internet. In this thesis, we chose the cosmetics as the subject of this study. First, we built a thesaurus of illegal words be referencing the website of Ministry of Health and Welfare of Taiwan. Then, we randomly selected 500 articles from 6000 review articles on Urcosme which is a forum on cosmetics in Taiwan. Then, we classified the selected articles into 2 categories—proper and improper. A review article is improper if it contains words from the thesaurus; otherwise, it is proper. Subsequently, we used Naïve Bayes and Decision Tree algorithms of Weka to classify this training dataset. Under 10-fold cross validation and defining the improper category as the positive class, the experimental results showed that the recalls of both algorithms were greater than 70 percent and specificities were all greater than 90 percent. The experimental results showed that the proposed method offered an effective way in automatically identifying improper review articles from the Internet.