The surge of artificial intelligence, specifically large language models, is emphasizing the ever-growing demand to efficiently train and serve them efficiently. Large sizes of models and datasets have necessitated the need for distributed execution over hundreds to thousands of customized GPU/TPU-based platforms connected via high-speed network fabrics. Examples of such platforms today include Google’s Cloud TPU, NVIDIA’s HGX, Intel’s Habana, Cerebras’ Andromeda, Tesla Dojo, and many more. This, in turn, brings the communication overheads to exchange gradients and activations into the critical path, making the design and optimization of the network fabric a crucial component for overall performance and efficiency.
Designing an optimized network fabric for AI platforms is an open and active challenge today – with co-design opportunities spanning across technology (e.g., waferscale, photonics), hardware architectures (i.e., network topologies) and software scheduling (e.g., optimal collective algorithms). This talk will introduce our work in (i) modeling diverse distributed AI platforms to identify communication bottlenecks, (ii) designing scalable fabric topologies leveraging diverse technologies, and (iii) collective scheduling optimizations to enhance network bandwidth efficiency.
Tushar Krishna is an Associate Professor in the School of Electrical and Computer Engineering (ECE) at Georgia Institute of Technology, with a courtesy appointment in Computer Science. He has also been a visiting professor at MIT EECS + CSAIL, Harvard University CS and a researcher at Intel’s VSSAD group. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007).
Dr. Krishna’s research spans computer architecture, interconnection networks, networks-on-chip (NoC), and AI/ML accelerator systems – with a focus on optimizing data movement in modern computing platforms. His papers have been cited over 19,000 times. He is part of the Halls of Fame for both HPCA and ISCA.
문의: 김지홍 교수