A Formal, Declarative Approach to Data Format Description

The concept of data formats is central to information storage and exchange, as it coins the process of how information is written to and read back from format-compliant data by senders and receivers. In contrast to the widespread use of natural-language descriptions intended for human engineers, and...

Full description

Bibliographic Details
Main Author: Hartle, Michael
Format: Others
Language:English
en
Published: 2010
Online Access:https://tuprints.ulb.tu-darmstadt.de/2271/1/Dissertationsschrift_Michael_Hartle_%28ver%C3%B6ffentlicht%29_20100901.pdf
Hartle, Michael <http://tuprints.ulb.tu-darmstadt.de/view/person/Hartle=3AMichael=3A=3A.html> (2010): A Formal, Declarative Approach to Data Format Description.Darmstadt, Technische Universität, [Ph.D. Thesis]
id ndltd-tu-darmstadt.de-oai-tuprints.ulb.tu-darmstadt.de-2271
record_format oai_dc
collection NDLTD
language English
en
format Others
sources NDLTD
description The concept of data formats is central to information storage and exchange, as it coins the process of how information is written to and read back from format-compliant data by senders and receivers. In contrast to the widespread use of natural-language descriptions intended for human engineers, and of procedural definitions of format-compliant components, describing data format knowledge in a formal, declarative manner is necessary for making this knowledge machine-processible, enabling its flexible, automated application to format-compliant data. To that effect, data format knowledge is considered both on the level of format-compliant data as a data format instance and on the level of a data format consisting of such instances. In a survey of current State of the Art in Data Format Description, examined related work from the data-centric research domains of Digital Preservation, Multimedia and Telecommunication show a lack suitable formalised models for universal applicability. As well, examined related work provides only a subset of the four necessary descriptive capabilities to describe data which may be primitive, structured, transcoded or fragmented. In the analysis, a formalisation is presented which is based on the research hypothesis that a data format defines a normative set of lossless information representations, where there exists a bijective mapping between interal representations of senders / receivers, and external representations that are exchanged as format-compliant data. The formalisation is universally applicable for arbitrary data formats, is suitable for both so-called lossless and lossy data formats, and leads to the notion of four elementary descriptive capabilities, which exactly match those used in the State of the Art survey. A valid Portable Network Graphics (PNG) raster image is given as ``litmus test'' for data format description, as its description exercises all four elementary descriptive capabilities. Based on the formalisation, it is shown that a universal approach to data format description is too powerful in computational terms as to be able to guarantee termination, that the tractability of bijective mapping functions and their inverses is neither given nor necessarily related, and that one-to-one correspondence of a bijective mapping function can be guaranteed using information-preserving, Turing-complete Reversible Turing Machines. Building on the formalisation given in the analysis, the thesis defines the Bitstream Segment Graph (BSG) model for describing arbitrary data format instances. For BSG instances, representations are defined both for visualisation as well as for storage and exchange through machine-processible, RDF-based representations. Incremental construction and modification of BSG instances is enabled through a closed set of operations, and the ``coverage'' of a BSG instance is defined as a measure of its completeness. Actual tool support for the construction, modification and exploration of BSG instances on arbitrary data is provided through the Apeiron BSG Editor. Applications of the BSG model are demonstrated through the description of the PNG image ``litmus test'' from the previous analysis, and for the description of an exploit in the context of IT Security. Building on the BSG model, the thesis defines the BSG Reasoning approach for describing arbitrary data formats as potentially infinite sets of data format instances. Using logic rules, a BSG instance can be inferred for a given bit sequence which is considered to be format-compliant data. The BSG Reasoning approach defines the representation of rulesets for storage and exchange. Applications of BSG Reasoning are demonstrated through the description of a PNG image file format subset and through an outlined approach for format-aware fuzzing of bitstreams in IT Security. The PNG image file format subset described through BSG Reasoning exercises all elementary descriptive capabilities previously identified in the analysis, and it is shown that the resulting set of logic rules, despite a low number of format-specific rules, already yields a high coverage of inferred BSG instances on a number of valid PNG images. The thesis closes with a retrospection, conclusions and an outlook on potential future research on the BSG model and the BSG Reasoning approach, focusing on aspects such as the computer-aided reverse-engineering of data format rules, or the use of reversible programming languages for the definition of lossless coding and transformation functions.
author Hartle, Michael
spellingShingle Hartle, Michael
A Formal, Declarative Approach to Data Format Description
author_facet Hartle, Michael
author_sort Hartle, Michael
title A Formal, Declarative Approach to Data Format Description
title_short A Formal, Declarative Approach to Data Format Description
title_full A Formal, Declarative Approach to Data Format Description
title_fullStr A Formal, Declarative Approach to Data Format Description
title_full_unstemmed A Formal, Declarative Approach to Data Format Description
title_sort formal, declarative approach to data format description
publishDate 2010
url https://tuprints.ulb.tu-darmstadt.de/2271/1/Dissertationsschrift_Michael_Hartle_%28ver%C3%B6ffentlicht%29_20100901.pdf
Hartle, Michael <http://tuprints.ulb.tu-darmstadt.de/view/person/Hartle=3AMichael=3A=3A.html> (2010): A Formal, Declarative Approach to Data Format Description.Darmstadt, Technische Universität, [Ph.D. Thesis]
work_keys_str_mv AT hartlemichael aformaldeclarativeapproachtodataformatdescription
AT hartlemichael formaldeclarativeapproachtodataformatdescription
_version_ 1719326836131889152
spelling ndltd-tu-darmstadt.de-oai-tuprints.ulb.tu-darmstadt.de-22712020-07-15T07:09:31Z http://tuprints.ulb.tu-darmstadt.de/2271/ A Formal, Declarative Approach to Data Format Description Hartle, Michael The concept of data formats is central to information storage and exchange, as it coins the process of how information is written to and read back from format-compliant data by senders and receivers. In contrast to the widespread use of natural-language descriptions intended for human engineers, and of procedural definitions of format-compliant components, describing data format knowledge in a formal, declarative manner is necessary for making this knowledge machine-processible, enabling its flexible, automated application to format-compliant data. To that effect, data format knowledge is considered both on the level of format-compliant data as a data format instance and on the level of a data format consisting of such instances. In a survey of current State of the Art in Data Format Description, examined related work from the data-centric research domains of Digital Preservation, Multimedia and Telecommunication show a lack suitable formalised models for universal applicability. As well, examined related work provides only a subset of the four necessary descriptive capabilities to describe data which may be primitive, structured, transcoded or fragmented. In the analysis, a formalisation is presented which is based on the research hypothesis that a data format defines a normative set of lossless information representations, where there exists a bijective mapping between interal representations of senders / receivers, and external representations that are exchanged as format-compliant data. The formalisation is universally applicable for arbitrary data formats, is suitable for both so-called lossless and lossy data formats, and leads to the notion of four elementary descriptive capabilities, which exactly match those used in the State of the Art survey. A valid Portable Network Graphics (PNG) raster image is given as ``litmus test'' for data format description, as its description exercises all four elementary descriptive capabilities. Based on the formalisation, it is shown that a universal approach to data format description is too powerful in computational terms as to be able to guarantee termination, that the tractability of bijective mapping functions and their inverses is neither given nor necessarily related, and that one-to-one correspondence of a bijective mapping function can be guaranteed using information-preserving, Turing-complete Reversible Turing Machines. Building on the formalisation given in the analysis, the thesis defines the Bitstream Segment Graph (BSG) model for describing arbitrary data format instances. For BSG instances, representations are defined both for visualisation as well as for storage and exchange through machine-processible, RDF-based representations. Incremental construction and modification of BSG instances is enabled through a closed set of operations, and the ``coverage'' of a BSG instance is defined as a measure of its completeness. Actual tool support for the construction, modification and exploration of BSG instances on arbitrary data is provided through the Apeiron BSG Editor. Applications of the BSG model are demonstrated through the description of the PNG image ``litmus test'' from the previous analysis, and for the description of an exploit in the context of IT Security. Building on the BSG model, the thesis defines the BSG Reasoning approach for describing arbitrary data formats as potentially infinite sets of data format instances. Using logic rules, a BSG instance can be inferred for a given bit sequence which is considered to be format-compliant data. The BSG Reasoning approach defines the representation of rulesets for storage and exchange. Applications of BSG Reasoning are demonstrated through the description of a PNG image file format subset and through an outlined approach for format-aware fuzzing of bitstreams in IT Security. The PNG image file format subset described through BSG Reasoning exercises all elementary descriptive capabilities previously identified in the analysis, and it is shown that the resulting set of logic rules, despite a low number of format-specific rules, already yields a high coverage of inferred BSG instances on a number of valid PNG images. The thesis closes with a retrospection, conclusions and an outlook on potential future research on the BSG model and the BSG Reasoning approach, focusing on aspects such as the computer-aided reverse-engineering of data format rules, or the use of reversible programming languages for the definition of lossless coding and transformation functions. 2010-08-31 Ph.D. Thesis PeerReviewed application/pdf eng CC-BY-NC-ND 2.5 de - Creative Commons, Attribution Non-commerical, No-derivatives https://tuprints.ulb.tu-darmstadt.de/2271/1/Dissertationsschrift_Michael_Hartle_%28ver%C3%B6ffentlicht%29_20100901.pdf Hartle, Michael <http://tuprints.ulb.tu-darmstadt.de/view/person/Hartle=3AMichael=3A=3A.html> (2010): A Formal, Declarative Approach to Data Format Description.Darmstadt, Technische Universität, [Ph.D. Thesis] en info:eu-repo/semantics/doctoralThesis info:eu-repo/semantics/openAccess