Summary: | This thesis concerns feature selection, with a particular emphasis on the computational biology domain and the possibility of non-linear interaction between features. Towards this it establishes a two-step approach, where the first step is feature selection, followed by the learning of a kernel machine in this reduced representation. Optimization of kernel target alignment is proposed as a model selection criterion and its properties are established for a number of feature selection algorithms, including some novel variants of stability selection. The thesis further studies greedy and stochastic approaches for optimizing alignment, propos- ing a fast stochastic method with substantial probabilistic guarantees. The proposed stochastic method compares favorably to its deterministic counterparts in terms of computational complexity and resulting accuracy. The characteristics of this stochastic proposal in terms of computational complexity and applicabil- ity to multi-class problems make it invaluable to a deep learning architecture which we propose. Very encouraging results of this architecture in a recent challenge dataset further justify this approach, with good further results on a signal peptide cleavage prediction task. These proposals are evaluated in terms of generalization accuracy, interpretability and numerical stability of the models, and speed on a number of real datasets arising from infectious disease bioinfor- matics, with encouraging results.
|