The limited memory capacity of single GPUs constrains large language model (LLM) inference, necessitating cost-prohibitive multi-GPU deployments or frequent performance-limiting CPU-GPU transfers over slow PCIe. In this work, we first benchmark recent Intel CPUs with Advanced Matrix Extensions (AMX), including 4th generation (Sapphire Rapids) and 6th generation (Granite Rapids) Xeon Scalable Processors, demonstrating matrix multiplication throughput of 20TFLOPS and 40TFLOPS, respectively— comparable to some recent GPUs. These findings unlock more extensive computation offloading to CPUs, reducing CPU-GPU transfers and alleviating throughput bottlenecks compared to prior-generation CPUs. Building on these insights, we design LIA, a single-GPU LLM inference acceleration framework leveraging cooperative AMX-enabled CPU-GPU computation and CXL offloading. LIA systematically offloads computation to CPUs, optimizing both latency and throughput. The framework also introduces a memory-offloading policy that seamlessly integrates affordable CXL memory with DDR memory to enhance performance in throughput-driven tasks. On Saphhire Rapids (Granite Rapids) systems with a single H100 GPU, LIA achieves up to 5.1× (19×) lower latency and 3.7× (5.1×) higher throughput compared to the latest single-GPU offloading framework. Furthermore, LIA deploying CXL offloading yields an additional 1.5× throughput improvement over LIA using only DDR memory with a 1.8× increase in maximum batch size (900→1.6K).
* 줌 링크 https://snu-ac-kr.zoom.us/j/88378640707
Nam Sung Kim is the W.J. ‘Jerry’ Sanders III - Advanced Micro Devices, Inc. Endowed Chair Professor at the University of Illinois, Urbana-Champaign and a fellow of ACM, IEEE, and NAI. His research interest spans across system software and computer architecture. He has published more than 260 refereed articles to highly selective conferences and journals in the field of digital circuit, processor architecture, and computer-aided design. The top three most frequently cited papers have more than 6000 citations and the total number of citations of all his papers approaches 17000. He is a recipient of many awards, including the ACM/IEEE Most Influential International Symposium on Computer Architecture (ISCA) Paper Award, and IEEE/ACM International Symposium on Microarchitecture (MICRO) Test of Time Award, and Intel Outstanding Researcher Award. He earned a PhD degree in Computer Science and Engineering from the University of Michigan, Ann Arbor, and Master and Bachelor degrees in Electrical Engineering from the Korea Advanced Institute of Science and Technology.
문의: 김지홍 교수 (kjihong@snu.ac.kr)