Summary: | Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018. === This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. === Cataloged from student-submitted PDF version of thesis. === Includes bibliographical references (pages 113-117). === In the recent years, the explosive growth of the data storage demand has made the storage cost a critically important factor in the design of distributed storage systems (DSS). At the same time, optimizing the storage cost is constrained by the reliability requirements. The goal of the thesis is to further study the fundamental limits of maintaining data fault tolerance in a DSS spread across a communication network. Particularly, we focus our attention on performing efficient storage node repair in a redundant erasure-coded storage with a low storage overhead. We consider two operating scenarios of the DSS. First, we consider a clustered scenario, where individual nodes are grouped into clusters representing data centers, storage clouds of different service providers, racks, etc. The network bandwidth within a cluster is assumed to be cheap with respect to the bandwidth between nodes in different clusters. We extend the regenerating codes framework by Dimakis et al. [1] to the clustered topologies, and introduce generalized regenerating codes (GRC), which perform node repair using the helper data both from the local cluster and from other clusters. We show the optimal trade-off between the storage overhead and the inter-cluster repair bandwidth, along with optimal code constructions. In addition, we find the minimal amount of the intra-cluster repair bandwidth required for achieving a given point on the trade-off. Second, we consider a scenario, where the underlying network features a highly varying topology. Such behavior is characteristic for peer-to-peer, content delivery, or ad-hoc mobile networks. Because of the limited and time-varying connectivity, the sources for node repair are scarce. We consider a stochastic model of failures in the storage, which also describes the random and opportunistic nature of selecting the sources for node repair. We show that, even though the repair opportunities are scarce, with a practically high probability, the data can be maintained for a large number of failures and repairs and for the time periods far exceeding a typical lifespan of the data. The thesis also analyzes a random linear network coded (RLNC) approach to operating in such variable networks and demonstrates its high achievable rates, outperforming that of regenerating codes, and robustness in a wide range of model and implementation assumptions and parameters such as code rate, field size, repair bandwidth, node distributions, etc. === by Vitaly Abdrashitov. === Ph. D.
|