The notion of 'information content of data' for databases

This thesis is concerned with a fundamental notion of information in the context of databases. The problem of information content of a conceptual data schema appears elusive. The conventional definition of information is established upon an entropy-based quantitative theory proposed by Shannon (1948...

Full description

Bibliographic Details
Main Author: Xu, Kaibo
Published: University of the West of Scotland 2009
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.544372
Description
Summary:This thesis is concerned with a fundamental notion of information in the context of databases. The problem of information content of a conceptual data schema appears elusive. The conventional definition of information is established upon an entropy-based quantitative theory proposed by Shannon (1948). It is widely used in measuring the amount of information that is created and transmitted through a communication channel by applying the notion of entropy. However, such an approach seems lacking a capability of explaining phenomena concerning the content aspect of information. Moreover, it would appear how the information content of data in a database may be reasoned about has not been addressed adequately. We therefore believe that the notion of the information content of data should be fully investigated and formally defined. To this end, the notion of the information content of a signal is redefined by modifying the known definition of information content given by Dretske (1981, p. 65). Then what we call the information content inclusion relation (IIR) (a partial order of random events) between two random events is defined. A set of inference rules is presented for reasoning about the information content of a random event and explore how these ideas and the rules may be used in a database setting including the derivation of otherwise hidden information by deriving new IIR from a given set of IIR. Furthermore, it is observed that the problem of whether the instances of a data schema may be recovered from those of another does not seem to have been well investigated, and this, we believe, is fundamental for the relationship between two schemata. In the literature, works that are closest to this question are based upon the notion of relevant information capacity, which is concerned with whether one schema may replace another without losing the capacity of the system in storing data. It is also observed that the rationale of such an approach is over intuitive (even though the techniques involved are sophisticated): a convincing answer to this question should be based on the question whether one or more instances of a schema can tell us truly what an instance of another schema would be. This is a matter of one thing carrying information about another. To capture such a relationship, the notion of information carrying between states of affairs is introduced, through which we look at much more detailed levels of informational relationships than the conventional entropy-based approach, namely random events and particulars of random events. The validity of our ideas is demonstrated by applying them to schema transformations that are information bearing capability preserving. This includes, among others, some aspects of normalization for relational databases, schema transformation with Miller et al’s (1994) Schema Intension Graph (SIG) model. To verify our ideas on reasoning about the information content of data, a prototype called IIR-Reasoning is presented, which shows how our idea might be exploited in a real database setting including how real world events and database values are aligned.