Summary: | Object classification by learning from data is a vast area of statistics and machine learning. Within classification, unlabelled data may be plentiful, allowing a few objects to be chosen for labelling by an expert. Systematically choosing these few objects can maximise classifier improvement: this is the problem of active learning (AL). Many heuristic methods coexist with theoretical approaches making substantial assumptions, leaving a gulf between theory and practice. There is a plethora of applications demanding better algorithms and understanding. Experimental studies give a very mixed picture of results, making AL performance rather mysterious. To explore this, a large scale empirical study examines performance in detail. One approach to active learning is to consider the optimal selection behaviour. Defining optimality by classifier improvement produces a new characterisation of optimal AL behaviour. This optimum yields theoretical insights and practical algorithms for applications, unifying theory and practice. This approach is model retraining improvement (MRI), a novel statistical estimation framework for AL. MRI generates a new guarantee for AL, that an unbiased MRI estimator should outperform random selection on average. New statistical AL algorithms are constructed to estimate the MRI optimum, revealing intricate estimation issues. One new algorithm in particular performs strongly in a large-scale experimental study, compared to standard AL methods. MRI is entirely general in terms of problems, classifiers and loss functions. AL shows that classification examples are not created equal; this diversity of example quality implies that systematic selection and modification can both improve classifier performance. This idea is extended to classification, where label improbability gives a new definition of quality (improbability given the covariates). Handling improbable labels (HIL) defines actions to modify the training data, by pruning, relabelling and weighting, to reduce the impact of the improbable labels. Two large experimental studies establish the effectiveness of HIL algorithms.
|