Summary: | XML has emerged as a universal data exchange format for disseminating and sharing information, particularly on the World Wide Web. As more XML data are generated, stored, and exchanged, the need to index XML data efficiently for querying purposes is becoming increasingly important. Designing efficient indexing structures for XML data presents serious challenges because any indexing scheme must index the structural a well as the data components of XML documents and provide tight integrations of the two components.
This thesis studies XML indexing methods for tree-structured XML documents that can be queried by a subset of XPath expressions. More specifically, this thesis proposes a new main memory index structure, named M*Ctree, which is an enhanced Ctree. Unlike the Ctree which is constructed solely from the structural and data characteristics of the database, the M*Ctree includes query workload characteristics in order to speedup query evaluations on frequently used paths and nodes on the index structure.
The Ctree index cleverly uses arrays to preserve child-parent relationships among individual data node pairs in a summary tree structure in order to avoid expensive structural join costs. The M*Ctree combines the use of child-parent links with additional arrays which provide child-ancestor links along frequently used paths to accelerate query evaluations. These child-ancestor links can be pre-computed based on query workloads, added, and removed as needed to reflect changing workload characteristics. Combined with value indexes which are structure-and-content sensitive, the M*Ctree becomes a multi-resolution index structure optimized for frequently used paths and achieves better overall performance than those index structures which do not consider query workloads. The M*Ctree trades off extra memory costs to support additional arrays and achieves better execution times for queries along frequently used paths of the index structure.
Experiments conducted in this research show that the M*Ctree achieves better performance than the Ctree for both simple and branching path queries matching index paths with child-ancestor links. The M*Ctree achieves larger performance gain over the Ctree on index paths which do not contain regular groups.
|