김한준 박사
직함: CTO
We introduce a novel tensor contraction processor (TCP) architecture that offers a paradigm shift from traditional architectures that rely on fixed-size matrix multiplications. TCP aims at exploiting the rich parallelism and data locality inherent in tensor contractions, thereby enhancing both efficiency and performance of AI workloads. TCP is composed of coarse-grained processing elements (PEs) to simplify software development. In order to efficiently process operations with diverse tensor shapes, the PEs are designed to be flexible enough to be utilized as a large-scale single unit or a set of small independent compute units. We aim at maximizing data reuse on both levels of inter and intra compute units. To do that, we propose a circuit switch-based fetch network to flexibly connect compute units to enable inter-compute unit data reuse. We also exploit input broadcast to multiple contraction engines and input buffer based reuse to further exploit reuse behavior in tensor contraction. Our compiler explores the design space of tensor contractions considering tensor shapes and the order of their associated loop operations as well as the underlying accelerator architecture. A TCP chip was designed and fabricated in 5nm technology as the second-generation product of Furiosa AI, offering 256/512/1024 TOPS (BF16/FP8 or INT8/INT4) with 256 MB SRAM and 1.5 TB/s 48 GB HBM3 under 150 W TDP. Commercialization will start in August 2024. We performed an extensive case study of running the LLaMA-2 7B model and evaluated its performance and power efficiency on various configurations of sequence length and batch size. For this model, TCP is 2.7× and 4.1× better than H100 and L40s, respectively, in terms of performance per watt.
.