Integrated Circuits and Systems >
Enhancing LLM Inference Performance on ARM CPUs Through Software and Hardware Co-Optimization Strategies
|
XINGYU ZHU (Graduate Student Member, IEEE); |
|
YANG ZHAO (Member, IEEE); |
|
WEI MAO (Senior Member, IEEE) |
Received date: 2024-12-31
Revised date: 2025-03-24
Accepted date: 2025-04-14
Online published: 2025-10-22
Supported by
National Key Research and Development Program of China under Grant(2023YFB2806000)
Postdoctoral Fellowship Program of CPSF under Grant(GZC20241305)
Proof of Concept Foundation of Xidian, University Hangzhou Institute of Technology, under Grant(GNYZ2024JC004)
Large language models (LLMs) have exhibited remarkable performance across a broad spectrum of tasks, yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios. To address the challenges, this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms. A mixed-precision quantization technique is employed, preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8, thereby reducing the model’s memory footprint. This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data. Furthermore, the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor. These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages. The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9, with only a 0.66% decrease in accuracy and a reduction in memory usage to 58.8% of the baseline, while achieving a 4.09× and 15.23× increase in inference performance for the prefill and decode stages over the baseline, respectively.
CHENG ZHANG , XINGYU ZHU , LONGHAO CHEN , TINGJIE YANG , EVENS PAN , GUOSHENG YU , YANG ZHAO , XIGUANG WU , BO LI , WEI MAO , GENQUAN HAN . Enhancing LLM Inference Performance on ARM CPUs Through Software and Hardware Co-Optimization Strategies[J]. Integrated Circuits and Systems, 2025 , 2(2) : 49 -57 . DOI: 10.23919/ICS.2025.3568404
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
Arm Limited, “Neon — developer.arm.com,” 2020. Accessed: Apr. 15, 2024. [Online]. Available: https://developer.arm.com/Architectures/Neon
|
| [27] |
|
| [28] |
|
/
| 〈 |
|
〉 |