Summary: | 碩士 === 國立嘉義大學 === 資訊工程學系研究所 === 100 === The Internet contains rich resources, but children may encounter difficulty due to limited reading comprehension. In order to provide appropriate online materials for elementary school students, we need first to classify the web pages based on their content. Therefore, this study proposes a method to automatically identify Chinese web pages for children. We adopt machine learning methods to create the classifiers, where we first preprocess the data using the following four steps: parsing the web pages into DOM trees, performing word segmentation on the extracted main content, partitioning the page into blocks, and calculating the Chi-square values of terms from Sinica corpus. Then we obtain textual features extracted by Chi-square feature selection method, as well as some other visual features. We finally apply machine learning methods to the feature set and conduct the training and testing phases. The experiments conducted on the manually collected web pages show that the proposed method can produce satisfactory results.
|