Summary: | As is the case in many big data mining scenarios with a large scale of samples, the heavy computation cost hinders the application of machine learning, which has to iteratively compute by passing over the whole dataset without considering the roles of different samples in training computation. However, we argue that most of the samples dominating computation resources contribute little to the gradient-based model update, particularly when the model is close to convergence. We define this observation as the Sample Contribution Pattern (SCP) in machine learning. This paper proposes two approaches to exploit SCP by detecting gradient characteristics and triggering the reuse of outdated gradients. In particular, this paper reports research results in (1) the definition and description of SCP to reveal an intrinsic gradient contribution pattern of different samples; (2) a novel SCP-based optimizing algorithm (SCPOA) that outperforms alternative tested algorithms in terms of computation overhead; (3) a variant of SCPOA that incorporates discarding-recovering mechanisms to carefully tradeoff between model accuracy and computation cost; (4) the implementation and evaluation of two algorithms based on popular distributed big data mining platforms running typical sample-sets; (5) intuitive convergence proof of the algorithms. Our experimental results illustrate that the proposed approaches can significantly reduce the computation cost with competitive accuracy.
|