Summary: | Large-scale distributed deep learning is of great importance in various applications. For data-parallel distributed training systems, limited hardware resources (e.g., GPU memory and interconnection bandwidth) often become a performance bottleneck, and it is necessary to consider the full utilization of multiple resources simultaneously, especially for extreme-scale deep neural networks. Although two different types of strategies, based on memory management and sparse communication, have been proposed to reduce the usage of resources, a naïve combination of these two optimizations is impractical, since they cannot successfully coexist with each other. We therefore consider the idea of collaborative optimization in terms of both system memory and bandwidth resources, and propose a layer-centric memory-efficient distributed sparse communication mechanism called LaySA. Firstly, to tackle the memory ballooning issue caused by sparse communication, the existing memory reuse strategy is refined, and the data object of the memory optimization is augmented and redefined. Secondly, a mirror weight update mechanism is proposed to address the contradiction between memory management and sparse communication optimization for weight gradients. Our scheme, which involves the deep integration and collaborative execution of these two types of strategies, can fill the gap in relation to multiple resource optimization in distributed GPU-based training systems. Our experimental results show that the proposed collaborative optimization can significantly alleviate the memory pressure on the computing nodes, and improve both the resource utilization and the throughput of distributed training systems. Compared with baseline systems using only a single strategy, LaySA can help to reduce the system memory usage by up to 80.5%, and the overall training time of the neural network models on a single GPU is reduced by about 12.25%. Furthermore, LaySA can scale up the batch size of the datasets by an extremely large factor during distributed training, and the overall throughput is increased by more than 150%, meaning that our approach outperforms current systems that use memory or communication optimization mechanisms alone.
|