Summary: | Big Data has received much attention in the multi-domain industry. In the digital and computing world, information is generated and collected at a rate that quickly exceeds the boundaries. The traditional data integration system interconnects the limited number of resources and is built with relatively stable and generally complex and time-consuming design activities. However, the rapid growth of these large data sets creates difficulties in learning heterogeneous data structures for integration and indexing. It also creates difficulty in information retrieval for the various data analysis requirements. In this paper, a probabilistic feature Patterns (PFP) approach using feature transformation and selection method is proposed for efficient data integration and utilizing the features latent semantic analysis (F-LSA) method for indexing the unsupervised multiple heterogeneous integrated cluster data sources. The PFP approach takes the advantage of the features transformation and selection mechanism to map and cluster the data for the integration, and an analysis of the data features context relation using LSA to provide the appropriate index for fast and accurate data extraction. A huge volume of BibText dataset from different publication sources are processed to evaluated to understand the effectiveness of the proposal. The analytical study and the outcome results show the improvisation in integration and indexing of the work.
|