Summary: | The industry-wide adoption of graph databases has been hindered due to the lack of a standard query language. Hence projects such as ISO/IEC 39075 have been proposed to integrate features from existing graph query languages, including Cypher, PGQL and G-Core. Integrating existing query languages requires a systematic comparison so that exclusive characteristics can be identified. Comparisons can be conducted by using graph query language benchmarks which are built on theoretical language formalisms. Literature suggests that existing theoretical language formalisms for graph databases are not expressive enough in formulating different data retrieval queries. Existing benchmarks also utilize the topological information stored in the graph schema to formulate queries for comparing graph query languages. Contemporary graph databases including Neo4j, Oracle and, Apache Tinkerpop, are either schema-less or schema optional to support frequent changes in the structure of data found in domains requiring high flexibility. However, the absence of robust graph schema impacts data consistency, integrity, and analytics in graph databases. Lack of expressive theoretical language formalisms and robustly-defined graph schemas are the two open problems in current graph database research. This thesis contributes towards solving these problems. We propose novel formalisms of conjunctive and union of conjunctive queries extended with Tarski's algebra that are more expressive than existing theoretical language formalisms. The formalisms are then used to formulate benchmark queries for comparing the expressiveness of Cypher and PGQL on the two core features of graph pattern matching and graph navigation, revealing the standard and exclusive characteristics for these languages. We present a formal algebra FLASc that assists in formulating robust graph schemas. We consider three case studies related to domains such as cyber-physical systems, big data analytics and tourism. These case studies illustrate the use of FLASc for transforming and loading data-sets for heterogeneous sources into graph databases such as Neo4j, thereby ensuring data consistency and integrity. Findings from this research suggest that formally defined graph schemas help generate efficient benchmark queries, facilitating the comparison of existing graph query languages. Furthermore, graph schemas are vital for ensuring better data manageability and developing future graph query languages to support data definition and data retrieval mechanisms from graph databases. Overall, our study serves as a formal basis for generating robust graph schemas and developing benchmarks for comparing existing graph query languages. This study assists in moving towards query language integration and interoperability between available graph database technologies; therefore, it serves as a basis for upcoming standards such as ISO/IEC 39075.
|