Summary: | 碩士 === 淡江大學 === 資訊管理學系碩士班 === 99 === Many techniques have been proposed to extract important information in web tables. Many of these information extraction techniques are successful for simple tables. However, their applications to complex tables usually obtain unsatisfactory accuracy, due to inadequate similarity comparison among table cells and insufficient table information collection. We design and implement an automatic web data table structure recognition system to tackle this problem. This system would first classify web data tables into nine table categories by analyzing TSF (Table Structure Feature) and CT (Cell Type) through heuristics. After the classification phase, each cell is identified as table attributes or table values by analyzing table structures in each category. For complex tables, we use heuristics and common attribute name recognition in 2x2 tables to recognize table structures. Furthermore, table attributes and table values are presented as relational tables to save memory space and to identify each record clearly. We not only test the effectiveness of our system, but also analyze why some table structures are wrongly recognized. The reasons are identified and future developments to handle these cases are suggested.
|