通信系统优化对分布式机器学习系统性能提升的分析

doi:10.12267/j.issn.2096-5931.2021.03.014

摘要/Abstract

摘要： 随着人工智能技术的迅猛发展,分布式机器学习系统的应用不断加速,对该系统性能提升的研究愈发紧迫。聚焦用于分布式机器学习的通信系统对整体系统性能提升的重大影响,从机器学习计算的独特性及分布式系统性能现有分析理论的局限性入手,对理论和工程实现两个维度深度分析了通信系统优化对于分布式机器学习系统实现线性乃至超线性加速的可行性,提出了影响分布式机器学习系统性能提升最为关键的三个通信系统优化核心要素,并对机器学习分布式系统中的通信优化理论及未来实践方向作出了展望。

关键词: 人工智能, 通信系统优化, 分布式系统

Abstract: As Artificial Intelligence develops rapidly, utilization of distributed machine learning systems continues to accelerate, and the research on this area is urgent. This paper focuses on the analysis of key factors that communication systems impact the performance of distributed machine learning systems, starting from the analysis of unique features of machine learning computation and limitations of existing theories. Then, it focuses on the feasibility study of linear and super-linear acceleration of distributed machine learning systems, proposes three key factors consisting numbers of communication system optimization technics, and puts forward future prospects of communication optimization theories as well as engineering technics.

Key words: Artificial Intelligence, communication system optimization, distributed system

王蕴韬. 通信系统优化对分布式机器学习系统性能提升的分析[J]. 信息通信技术与政策, 2021, 47(3): 83-89.

WANG Yuntao. Analysis of communication system optimizationson performance of distributed machine learning systems[J]. Information and Communications Technology and Policy, 2021, 47(3): 83-89.

导出引用管理器 EndNote|Ris|BibTeX

链接本文:

http://ictp.caict.ac.cn/CN/10.12267/j.issn.2096-5931.2021.03.014

http://ictp.caict.ac.cn/CN/Y2021/V47/I3/83

参考文献 25

[1]	Dean J . The deep learning revolution and its implications for computer architecture and chip design [ J ]. arXiv: 1911. 05289, 2019.
[2]	Amdahl G M . Validity of the single processor approach to achieving large-scale computing capabilities [ C ] / / AFIPS Conference, 1967.
[3]	Gustafson J L. Reevaluating amdahl􀆳s law [ J ]. Communications of the ACM, 1988,31(5):532-533.
[4]	Ristov S , Prodan R , Gusev M , et al. Superlinear speedup in HPC systems: why and when? [ C ] / / Computer Science & Information Systems. IEEE, 2016.
[5]	Shi Y. Reevaluating amdahl􀆳s law and gustafson􀆳s law [ R ]. US: Computer Sciences Department, Temple University, 1996.
[6]	Gusev M , Ristov S . A superlinear speedup region for matrix multiplication[J]. Concurrency and Computation: Practice and Experience, 2014,26(11).
[7]	Djinevski L, Ristov S, Gusev M. Superlinear speedup for matrix multiplication in GPU devices [J]. Advances in Intelligent Systems and Computing, 2013(207):285-294.
[8]	Gusev M, Ristov S. Superlinear speedup in windows azure Cloud [ C ] / / 1st Int. Conference on Cloud Networking, 2012.
[9]	Cédric A, Thibault S, Namyst R, et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures[J]. Concurrency and Computation: Practiceand Experience, 2011,23(2).
[10]	Anchev N, Gusev M, Ristov S. Intel vs AMD:matrix multiplication performance [J]. Information Communication
	Technology Electronics Microelectronics, 2013:182-187.
[11]	刘铁岩, 陈薇, 王太峰, 等. 分布式机器学习:算法、理论与实践[M]. 北京:机械工业出版社, 2018.
[12]	You Y, Zhang Z, Hsieh C J, et al. Fast deep neural network training on distributed systems and cloud TPUs [J ]. IEEE Transactions on Parallel and Distributed Systems, 2019(99):1-1.
[13]	Goyal P, Dollár, Piotr, Girshick R, et al. Accurate, large minibatch SGD: training image net in 1 hour[J]. arXiv:1706. 02677, 2017.
[14]	Akiba T, Suzuki S, Fukuda K. Extremely large minibatch sgd: training resnet-50 on imagenet in 15 minutes[J]. 2017.
[15]	Jia X, Song S, He W, et al. Highly scalable deep learning training system with mixed-precision: training imagenet
	in four minutes[J]. arXiv:1807. 11205, 2018:1-9.
[16]	Hiroaki M, Hisahiro S, et al. Imagenet / resnet-50 training in 224 seconds[J]. arXiv:1811. 05233v1, 2018.
[17]	Masafumi Y, Akihiko K, Akihiro T. Yet another accelerated sgd: resnet-50 training on imagenet in 74. 7 seconds[J]. arXiv:1903. 12650v1, 2019.
[18]	Dekel O, Gilad-Bachrach R, Shamir O, et al. Optimal distributed online prediction using mini-batches [ J ]. Journal of Machine Learning Research, 2010(13): 165- 202.
[19]	Yu H, Yang S, Zhu S. Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning [ J ]. Proceedings of the AAAI Conference on Artificial
	Intelligence, 2019(33):5693-5700.
[20]	Stich S U. Local SGD converges fast and communicates little [ J ]. International Conference on Learning Representations, 2018.
[21]	Haddadpour F, Kamani M, Mahdavi M , et al. Local SGD with periodic averaging: tighter analysis and adaptive synchronization[J]. arXiv:1910. 13598, 2019:1-24.
[22]	Shen S, Xu L, Liu J, et al. Faster distributed deep net training: computation and communication decoupled stochastic gradient descent [ C ] / / Twenty-Eighth International Joint Conference on Artificial Intelligence IJCAI-19, 2019.
[23]	Liu C, Wei J, Wang Y, et al. Optimizing deep learning frameworks incrementally to get linear speedup: a comparison between IPoIB and RDMA verbs [ C ] / / 2018 IEEE 24th International Conference on Parallel and
	Distributed Systems. IEEE, 2018.
[24]	Ouyang S, Dong D, Xu Y, et al. Communication optimization strategies for distributed deep learning: a survey [ J ]. Journal of Parallel and Distributed Computing, 2021(149):52-65.
[25]	王蕴韬. “新基建”助推人工智能基础设施全面升级[J]. 通信世界, 2020(7):20-21.