香港科技大学陈凯教授学术报告
报告题目:Domain-Specific Network Optimization for Distributed Deep Learning
报告人:陈凯(香港科技大学)
报告时间:2021年11月29日上午10:30-11:30
报告地点:计算机与大数据学院2号楼219室
Abstract:Communication overhead poses a significant challenge to distributed DNN training. In this talk, I will overview existing efforts toward this challenge, introduce their advantages and shortcomings, and further present a novel solution exploiting the domain-specific characteristics of deep learning to optimize communication overhead of DNN training in a fine-grained manner. Our solution consists of several key innovations beyond prior work, including bounded-loss tolerant transmission, gradient-aware flow scheduling, and order-free per-packet load-balancing, etc., delivering up to 74.5% training acceleration over the best existing solutions. Our proposal by no means provides an ultimate answer to this question, instead, we hope it can inspire more critical thinking on networking for AI.
Bio:Kai Chen is an Associate Professor of HKUST, Director of intelligent Networking Systems Lab (iSING Lab) and HKUST-WeChat joint Lab on Artificial Intelligence Technology (WHAT Lab). He received his BS and MS from University of Science and Technology of China in 2004 and 2007, and PhD from Northwestern University in 2012, respectively. His research interests include Data Center Networking, Machine Learning Systems, and Privacy-preserving Computing. His work has been published in various top venues such as SIGCOMM, NSDI and TON, etc., including a SIGCOMM best paper candidate. He is the Steering Committee Co-Chair of APNet, serves on Program Committees of SIGCOMM, NSDI, INFOCOM, etc., and Editorial Boards of IEEE/ACM Transactions on Networking, Big Data, and Cloud Computing.
摘要:通信开销对分布式DNN训练提出了重大挑战。在本次报告中,我将概述针对这一挑战的现有努力,介绍它们的优点和缺点,并进一步提出一种利用深度学习的特定领域特性以细粒度方式优化DNN训练通信开销的新解决方案。我们的解决方案包含超越先前工作的多项关键创新,包括有界损失容忍传输、梯度感知流调度和无序每包负载平衡等,与现有最佳解决方案相比,可提供高达74.5%的训练加速。我们的方案绝不是对这个问题的最终答案。相反,我们希望它能激发更多关于人工智能网络的批判性思维。
报告人简介:陈凯老师是香港科技大学副教授,智能网络系统实验室(iSING Lab)和香港科技大学-微信人工智能技术联合实验室(WHAT Lab)主任。2004年和2007年分别在中国科学技术大学获得学士和硕士学位,2012年在西北大学获得博士学位。他的研究兴趣包括数据中心网络、机器学习系统和隐私保护计算。其工作曾在SIGCOMM、NSDI、TON等多个顶级会议发表,包括SIGCOMM最佳论文候选人。他是APNet指导委员会联合主席,在SIGCOMM、NSDI、INFOCOM等程序委员会任职,并是IEEE/ACM Transactions on Networking、Big Data和Cloud Computing的编委。
欢迎相关方向的老师及研究生参加!