Young Jin Kim
직함: Principal Researcher
Microsoft Machine Translation Group
The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. In this talk, various training algorithms to stabilize MoE training in practice are introduced as well as highly efficient MoE implementations. Especially, MoE models usually suffer from overfitting problem with the large memory capacity by design. Several novel regularization techniques including Stochastic Experts, Gating Dropout and Random Token Selection are introduced together with multi-task training paradigm. Finally, how those algorithms are improving the real world large scale models will be presented.
Young Jin Kim, Ph.D., is a Principal Researcher at the Microsoft Machine Translation group where he develops machine learning models with state-of-the-art techniques. His recent research focus includes designing efficient and effective algorithms and model architectures for large scale language models. Young received his Ph.D. from Georgia Institute of Technology for his research in deep learning and high-performance computing.