Summary: | A structured overlay technology has the advantages for fault tolerance and computation resource (i.e., node) discovery in distributed data storage and its computation platform, however, these strengths are only guaranteed on stable environment that node failures do not occur frequently. To deal with the environment, many advanced schemes based on the well-known node failure information propagation scheme are proposed, which stabilizes the platform by quickly handling node failures. In the existing scheme, a computation node propagates a node-failure information when the node detect its failure. However, the existing scheme needs stateful maintenance against propagation targets; in other words, it must maintain the network connections of both the propagation target nodes and the nodes held on the general overlay. The nodes then exhaust the machine resources (e.g., CPU, memory, network bandwidth) for the connection management and cannot concentrates on their own tasks, such as data analysis or its storage application. To resolve this problem, I propose a stateless node-failure information propagation scheme, which propagates a node failure at the speed of the existing scheme but without requiring maintenance of the propagation target connections. In the proposed scheme, each computational node can effectively utilize the machine resources for its own task. Instead of retaining the propagation targets, my scheme estimates the propagation targets after detecting a node failure. I analyzed the estimation accuracy of a simple propagation model, which guarantees effective propagation. The accuracy was found to depend on the overlay distance between the failed node and the propagator node. Based on this observation, my scheme adjusts the keep-alive interval to bias the detection of closer node failures. In a simulation evaluation, the detection delay of the proposed stateless propagation was similar to that of the stateful propagation scheme, but delivered superior maintenance cost and scalability.
|