Summary: | Traditionally, distributed systems requiring high dependability were designed using custom hardware with massive amounts of redundancy. Not only the nodes, but the network, was replicated in most of these systems. Recently, the need for cost reduction and access to the latest commercial technologies has prompted the use of commercial off-the-shelf (COTS) hardware and software products in the design of such systems. On the other hand, reliance on COTS technology brings about new challenges in system reliability. This dissertation attempts to address these challenges by developing fault tolerance techniques for modern high-speed networking-based systems. Being driven by the demand for greater network performance, emerging network technologies have complex network interfaces with a Network Interface Card (NIC) processor and large local memory. However, increasing complexity results in a larger set of failure points and a potential increase in the network failure rate. This is in addition to the system failures that can be caused by faults that strike the host system. In this dissertation, we propose to achieve higher dependability of distributed systems through host and NIC processor collaboration. The host processor will detect and recover a failed network interface, and in addition, the symbiotic relationship allows the NIC processor to aid in the recovery of a failed host system or application. More specifically, we present an effective low-overhead adaptive and concurrent self-testing technique to protect programmable high-speed network interfaces, and a low-overhead message logging protocols to achieve fast recovery from host application crashes.
|