Random forest och glesa datarespresentationer

In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representin...

Full description

Bibliographic Details
Main Authors:	Linusson, Henrik, Rudenwall, Robin, Olausson, Andreas
Format:	Others
Language:	English
Published:	Högskolan i Borås, Institutionen Handels- och IT-högskolan 2012
Subjects:	data mining machine learning regression classification in silico modeling random forest sparse data feature selection Engineering and Technology Teknik och teknologier
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-16672

Description
Summary:	In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representing molecules - the problem that motivated this paper derives from collecting substructures of the chemical into what is known as fractional representations. Assembling large sets of molecules represented in this way will result in sparse data, where a large portion of the set is null values. This consumes an excessive amount of computer memory which inhibits the size of data sets that can be used when constructing predictive models.In this study, we suggest a set of criteria for evaluation of random forest implementations to be used for in silico predictive modeling on sparse data sets, with regard to computer memory usage, model construction time and predictive accuracy.A novel random forest system was implemented to meet the suggested criteria, and experiments were made to compare our implementation to existing machine learning algorithms to establish our implementation‟s correctness. Experimental results show that our random forest implementation can create accurate prediction models on sparse datasets, with lower memory usage overhead than implementations using a common matrix representation, and in less time than existing random forest implementations evaluated against. We highlight design choices made to accommodate for sparse data structures and data sets in the random forest ensemble technique, and therein present potential improvements to feature selection in sparse data sets. === Program: Systemarkitekturutbildningen

Random forest och glesa datarespresentationer

Similar Items