Efficient XML Stream Processing and Searching

In this dissertation, I present a table-driven streaming XML (Extensible Markup Language) parsing and searching technique, called TDX, and investigate related techniques. TDX expedites XML parsing, validation and searching by pre-recording the states of an XML parser in tabular forms and by utilizin...

Full description

Bibliographic Details
Other Authors: Zhang, Wei (authoraut)
Format: Others
Language:English
English
Published: Florida State University
Subjects:
Online Access:http://purl.flvc.org/fsu/fd/FSU_migr_etd-5297
Description
Summary:In this dissertation, I present a table-driven streaming XML (Extensible Markup Language) parsing and searching technique, called TDX, and investigate related techniques. TDX expedites XML parsing, validation and searching by pre-recording the states of an XML parser in tabular forms and by utilizing an efficient runtime streaming parsing engine based on a two-stack push-down automaton. The parsing tables are automatically produced from the XML schemas or from the WSDL (Web Services Description Language) service descriptions. Because the schema constraints and XPath expressions are pre-encoded in a parsing table, the approach effectively implements a schema-specific XML parser and/or query processor that combines parsing, validation and search into a single pass. Moreover, the runtime parsing engine is independent of XML schemas and XPath query expressions, parsing can be populated on-the-fly to the runtime engine, thus TDX efficiently eliminates the recompilation and redeployment requirements of schema-specific parsers to address the schema changes. Similarly, different XPath queries can also be preprocessed at compile time and populated on-the-fly to the TDX searching engine without runtime overhead. To construct the parsing tables, we developed a set of mapping rules that translate XML schemas to augmented grammars. The augmented grammars support the full expressive power of the W3C XML Schema by introducing permutation phrase grammars and multi-occurrence phrase grammars. The augmented grammars are suitable to construct a predicative parsing table. The predictive parsing table constructed from the augmented grammars can be integrated into the parser at any time to maximize the performance or be populated on-the-fly at runtime and address schema changes efficiently. Because parsing tables or searching tables are pre-processed at compile time, and looking up the tables at runtime is deterministic and takes constant time, TDX efficiently implements a single pass, predictive validating parser without backtracking or function calling overheads. Our experimental results show a significant performance improvement compared to widely used XML parsers, either validating and non-validating, and to XML query processors. === A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy. === Spring Semester, 2012. === March 22, 2012. === Parsing, Query, Table-Driven, Valiadtion, XML, XPath === Includes bibliographical references. === Robert A. van Engelen, Professor Directing Dissertation; Erlebacher Gordon, University Representative; Xiuwen Liu, Committee Member; Xin Yuan, Committee Member; Zhenhai Duan, Committee Member.