권용기
SK hynix Inc.
Large language models (LLMs) are becoming increasingly popular for a variety of AI services, such as chatbots and virtual assistants. However, serving LLMs can be challenging, due to their high operating costs and long service latency. The main challenge in serving LLMs is the memory bandwidth bottleneck. LLMs require a lot of memory to store their parameters, and this memory bandwidth can be a limiting factor in the speed of inference. As LLM models continue to grow in size, this problem is only going to get worse.
We propose a new solution to the memory bandwidth bottleneck for serving LLMs. Our solution, called AiM (Accelerator-in-Memory), is a SK hynix's PIM device that is specialized for serving LLMs. AiM can exploit the abundant memory bandwidth available inside memory to accelerate GEMV operations, which are the most computationally expensive operations of matrix-vector product in LLM inference. We evaluated AiM on a variety of LLM models and tasks. Our results show that AiM can significantly improve the performance and energy efficiency of LLM inference. For example, on the GPT-3 model, AiM can achieve up to 10x speedup at lower cost and energy consumption over the state-of-the-art GPU systems.
We believe that AiM is a promising solution to the memory bandwidth bottleneck for serving LLMs. AiM can significantly improve the performance and energy efficiency of LLM inference, making it possible to deploy LLMs in real-world applications.