Classifying Chinese Children Web Pages using Text and Visual Information

碩士 === 國立嘉義大學 === 資訊工程學系研究所 === 100 === The Internet contains rich resources, but children may encounter difficulty due to limited reading comprehension. In order to provide appropriate online materials for elementary school students, we need first to classify the web pages based on their content. T...

Full description

Bibliographic Details
Main Authors: Chun-Hung Wang, 王竣弘
Other Authors: Yaw-Huei Chen
Format: Others
Language:zh-TW
Online Access:http://ndltd.ncl.edu.tw/handle/57073372464847175360
Description
Summary:碩士 === 國立嘉義大學 === 資訊工程學系研究所 === 100 === The Internet contains rich resources, but children may encounter difficulty due to limited reading comprehension. In order to provide appropriate online materials for elementary school students, we need first to classify the web pages based on their content. Therefore, this study proposes a method to automatically identify Chinese web pages for children. We adopt machine learning methods to create the classifiers, where we first preprocess the data using the following four steps: parsing the web pages into DOM trees, performing word segmentation on the extracted main content, partitioning the page into blocks, and calculating the Chi-square values of terms from Sinica corpus. Then we obtain textual features extracted by Chi-square feature selection method, as well as some other visual features. We finally apply machine learning methods to the feature set and conduct the training and testing phases. The experiments conducted on the manually collected web pages show that the proposed method can produce satisfactory results.