Summary: | 碩士 === 國立中央大學 === 資訊工程學系碩士在職專班 === 98 === A huge amount of information on the World Wide Web has a
structured HTML form as they are generated dynamically from databases
and have the same template. This paper proposes a page-level web data
extraction system FiVaTech2 that extracts schema and templates from
these template-based web pages automatically. The proposed system,
FiVaTech2, is an extension to our previously page-level web data
extraction system FiVaTech. FiVaTech2 uses a machine learning (ML)
based method which compares HTML tag pairs to estimate how likely
they present in the web pages. We use one of the ML techniques called
J48 decision tree classifier and also use image comparison to assist
templates detection. Each HTML tag in the web page has several features
that can be divided into the three types: visual information, DOM tree
information, and HTML tag contents. Our experiments show an
encouraging result for the test pages when combinations of the three
types of tag features are used. Also, our experiments show that FiVaTech2
performs better and has higher efficiency than FiVaTech.
|