Design and Implementation of Indexing Strategies for XML Documents

碩士 === 國立中山大學 === 資訊工程學系研究所 === 90 === In recent years, many people use the World Wide Web and Internet to find information that they want. HTML is a document markup language for publishing hypertext on the WWW. HTML has been the target format for content developers around the world. Basically, HTM...

Full description

Bibliographic Details
Main Authors: Mao-Tong Lin, 林茂桐
Other Authors: Ye-In Chang
Format: Others
Language:en_US
Published: 2002
Online Access:http://ndltd.ncl.edu.tw/handle/40920145381288849329
Description
Summary:碩士 === 國立中山大學 === 資訊工程學系研究所 === 90 === In recent years, many people use the World Wide Web and Internet to find information that they want. HTML is a document markup language for publishing hypertext on the WWW. HTML has been the target format for content developers around the world. Basically, HTML tags serve the primary purpose of describing how to display a data item. Therefore, HTML documents are difficult to find some useful information. That is because, HTML documents are mixed content with display tags. On the other hand, XML is the another data format for data exchange inter-enterprise applications on the Internet. In order to facilitate data exchange, industry groups define public Document Type Definitions (DTD) that specify the format of the XML documents to be exchanged between their applications. Moreover, WWW/EDI or Electric Commerce is very popular and a lot of business data uses XML to exchange information on the World Wide Web. Basically, XML tags describe the data itself. The contents (meaning) of the XML documents and the display format is separated. It could be easily to find meaningful information of the XML documents and analyze the information. Moreover, when a large volume of business data (XML documents) exists, one way to support the management of the XML documents is to apply the relational databases. For such an approach, we must transform the XML documents to the relational databases. In this thesis, we design and implement the indexing strategies to efficiently access XML documents. XML document is fundamentally different from relational data. XML is a hierarchical and nested document, it is very similar to the semistructured data model. The characteristic of semistructured data is that it may not have a fixed schema and it may be irregular or incomplete. Though, the semistructured data model is flexible in data modeling, it requires a large search space in query processing since there is no schema fixed in advance. Indexing is the way of how to improve query performance efficiently. However, due to the special properties of semistructued data, there are up to five types of queries: (1) complete single path, (2) specified leaf only, (3) specified intrapath, (4) specified attribute/element(value), and (5) multiple paths with the same level. In this thesis, we classify all possible queries into those five query types. Next, we create different indexes for different query types. Moreover, we design and implement the query transformation from XML query statements to SQL statements. Also, we create a user-friendly interface for users to input XML query statements. The whole system is implemented in JAVA and SQL Server 2000. From our experiences, we show that our indexing strategies can improve the XML query processing performance very well.