Summary: | <p> Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for large-scale networked multicore systems and cloud computing. This dissertation pursues an effective checkpointing mechanism to handle failures and unavailable events in such systems and thus to reduce the expected job turnaround time, the aggregated file size, and the monetary cost involved. To withstand unavailability/failures of local nodes in networked systems, multi-level checkpointing is indispensable, with checkpoint files kept not only locally but also at remote storage. As the number of nodes in such a system grows, I/O bandwidth to remote storage quickly becomes the bottleneck for multi-level checkpointing. </p><p> The first part of this work deals with an effective mechanism, dubbed adaptive incremental checkpointing (AIC), which reduces the checkpointing file size considerably to lower its involved overhead and thus to shorten the expected job turnaround time. Given production multicore systems are observed to often have unused cores available, we design AIC to make use of separate, otherwise unused, cores for carrying out delta compression at desirable points of time adaptively. AIC permits multi-level checkpointing effectively, with checkpoint files of execution nodes written to their partner nodes and to remote storage concurrently during job execution. AIC is observed in our implemented testbed to substantially lower the normalized expected turnaround time (by up to 41%) and the aggregated file size (by up to 1,000×) when compared to its static counterpart and a recent multi-level checkpointing scheme with fixed checkpoint intervals. </p><p> The second part presents design and implementation of our enhanced adaptive incremental checkpointing (EAIC) for multithreaded applications on the RaaS clouds under spot instance pricing. EAIC model takes into account spot instance revocation events, besides hardware failures, for fast and accurately predicting the desirable points of time to take checkpoints so as to markedly reduce the expected job turnaround time and the monetary cost. The experimental results from our established testbed under real spot instance price traces from Amazon EC2 show that EAIC lowers both the application turnaround time and the monetary cost markedly (by up to 58% and 59%, respectively) in comparison to its recent checkpointing counterpart.</p>
|