Summary: | 碩士 === 國立中央大學 === 軟體工程研究所 === 103 === The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers.
Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users.
In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX.
|