Discriminative Pattern Mining in Microbiomic Data

博士 === 國立臺灣大學 === 資訊工程學研究所 === 104 === Machine learning classifiers have long been used to solve biological problems by predicting the target class (e.g. disease state, bacterial taxonomy, etc.) of unseen samples. A favorable and important byproduct of a special type of classifier is “interpretabili...

Full description

Bibliographic Details
Main Authors: Nancy Huang, 黃安婷
Other Authors: 歐陽彥正
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/29121629577865058185
Description
Summary:博士 === 國立臺灣大學 === 資訊工程學研究所 === 104 === Machine learning classifiers have long been used to solve biological problems by predicting the target class (e.g. disease state, bacterial taxonomy, etc.) of unseen samples. A favorable and important byproduct of a special type of classifier is “interpretability” (also known as “comprehensibility”), which could be utilized to offer explanations as to why and how a sample is assigned to the predicted class. Interpretable classifiers produce “discriminative patterns” that lead to different prediction results, and provide insights to critical properties of the biological problem by capturing a greater extent of underlying semantics than single features. Discriminative patterns can be directly utilized by pattern-based classifiers to predict unseen samples by a majority voting or aggregation mechanism. In this case, we are concerned with not only finding useful individual patterns, but also the effectiveness of the pattern set as a whole. Thus, it is imperative to ensure the relevancy and non-redundancy of the discriminating patterns. Few studies have evaluated pattern redundancy via examining samples covered by the patterns; and in those that do, the focus has been mostly on the proportion of overlapping samples, suggesting that a great deal of information on non-overlapping samples were overlooked. In addition, traditional pattern mining approaches often require the generation of a complete set of initial patterns and a global discretization of continuous attributes, both of which are impractical for high-dimensional biological datasets of complex nature. We address the above issues by presenting a novel pattern selection algorithm that estimates pattern redundancy by not only the proportion of overlapping samples, but also the resemblance of non-overlapping samples. The proposed method was applied on three real microbiomic datasets, with the aim of providing new insights on the interactions between microbial factors and their effects on the host. When compared with other robust classifiers and feature selection heuristics, our pattern selection algorithm led to diverse and compact sets of final patterns that demonstrated comparable or even superior predictive capabilities.