Summary: | Real-time classification of data streams remains one of the most challenging aspects of Big Data. As a data stream is an unending source of information, classification models and metrics must be created and adapted in real-time as the data is made available to them. This time constrained learning is problematic, conventional data models require a training period to examine the data and produce models for evaluation. In data stream mining this training period does not exist, instead the models are continuously updated in real-time. As data streams become faster and larger the quantity of data to be processed can overwhelm a single machines’ learning capabilities. One method to reduce the work load upon a data mining algorithm is to implement parallel solutions. This has the benefit of distributing the classification over one or more machines. Unfortunately, most parallel implementations of classification algorithms are not suitable for real-time processing, and most data stream mining algorithms are not suitable for parallelisation. This research develops on real-time parallel classification of data instances with respect to vast amounts of data. The proposed solution is vastly scalable as it incurs no additional communications costs when training. Moreover, it is capable of accepting data streams that contain multiple sources. The newly created algorithm Parallel MC-NN has been implemented and evaluated on open source parallel technologies. The results of experimentation show a scalable solution that has been evaluated and peer reviewed via multiple publications.
|