Summary: | Modeling and analyzing the performance of distributed file systems (DFSs) benefit the reliability and quality of data processing in data-intensive applications. Hadoop Distributed File System (HDFS) is a typical representative of DFSs. Its internal heterogeneity and complexity as well as external disturbance contribute to HDFS's built-in features of nonlinearity as well as randomness in system level, which raises a great challenge in modeling these features. Particularly, the randomness results in the uncertainty of HDFS performance model. Due to the complex mathematical structure and parameters hardly estimated of analytical models, it is highly complicated and computationally impossible to build an explicit and precise analytical model of the randomness. The measurement-based methodology is a promising way to model HDFS performance in terms of randomness since it requires no knowledge of system's internal behaviors. In this paper, the estimation of HDFS performance models on account of the randomness is transformed to an optimization problem of finding out the real best design of performance model structure with large design space. Core ideas of ordinal optimization (OO) are introduced to solve this problem with a limited computing budget. Piecewise linear (PL) model is applied to approximate the nonlinear characteristics and randomness of HDFS performance. The experimental results show that the proposed method is effective and practical to estimate the optimal design of the PL-based performance model structure for HDFS. It not only provides a globally consistent evaluation of the design space but also guarantees the goodness of the solution with high probability. Moreover, it improves the accuracy of system model-based HDFS performance models.
|