Co-Optimization for Large Language Models: Advances in Algorithm and Hardware

Enhancing LLM Inference Performance on ARM CPUs Through Software and Hardware Co-Optimization Strategies

  • CHENG ZHANG 1, 2 ,
  • XINGYU ZHU 1, 2 ,
  • LONGHAO CHEN 3 ,
  • TINGJIE YANG 1, 2 ,
  • EVENS PAN 4 ,
  • GUOSHENG YU 5 ,
  • YANG ZHAO 6 ,
  • XIGUANG WU 1, 2 ,
  • BO LI , 1, 2 ,
  • WEI MAO , 1, 2 ,
  • GENQUAN HAN 1, 2
Expand
  • 1 Hangzhou Institute of Technology, Xidian University, Hangzhou 311200, China
  • 2 School of Microelectronics, Xidian University, Xi’an 710071, China
  • 3 Zhuoyue Honors College, Hangzhou Dianzi University, Hangzhou 310018, China
  • 4 Arm Technology Company, Ltd., Shanghai 200233, China
  • 5 T-HEAD Semiconductor Company, Ltd., Shanghai 200120, China
  • 6 Department of Micro-Nano Electronics, Shanghai Jiao Tong University, Shanghai 200240, China
WEI MAO(e-mail: )
BO LI (e-mail: ).

XINGYU ZHU (Graduate Student Member, IEEE);

YANG ZHAO (Member, IEEE);

WEI MAO (Senior Member, IEEE)

Received date: 2024-12-31

  Revised date: 2025-03-24

  Accepted date: 2025-04-14

  Online published: 2025-10-22

Supported by

National Key Research and Development Program of China under Grant(2023YFB2806000)

Postdoctoral Fellowship Program of CPSF under Grant(GZC20241305)

Proof of Concept Foundation of Xidian, University Hangzhou Institute of Technology, under Grant(GNYZ2024JC004)

Abstract

Large language models (LLMs) have exhibited remarkable performance across a broad spectrum of tasks, yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios. To address the challenges, this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms. A mixed-precision quantization technique is employed, preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8, thereby reducing the model’s memory footprint. This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data. Furthermore, the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor. These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages. The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9, with only a 0.66% decrease in accuracy and a reduction in memory usage to 58.8% of the baseline, while achieving a 4.09× and 15.23× increase in inference performance for the prefill and decode stages over the baseline, respectively.

Cite this article

CHENG ZHANG , XINGYU ZHU , LONGHAO CHEN , TINGJIE YANG , EVENS PAN , GUOSHENG YU , YANG ZHAO , XIGUANG WU , BO LI , WEI MAO , GENQUAN HAN . Enhancing LLM Inference Performance on ARM CPUs Through Software and Hardware Co-Optimization Strategies[J]. Integrated Circuits and Systems, 2025 , 2(2) : 49 -57 . DOI: 10.23919/ICS.2025.3568404

[1]
D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 2, pp. 604-624, Feb. 2021.

[2]
M. Shoeybi et al., “Megatron-LM: Training multi-billion parameter language models using model parallelism,” 2019, arXiv:1909.08053.

[3]
M. Lewis et al., “BASE layers: Simplifying training of large, sparse models,” in Proc. 38th Int. Conf. Mach. Learn. (ICML), vol. 139, pp. 6265-6274 2021.

[4]
B. Liu et al., “An energy-efficient SIFT based feature extraction accelerator for high frame-rate video applications,” IEEE Trans. Circuits Syst.-I, Reg. Papers, vol. 69, no. 12, pp. 4930-4943, Dec. 2022.

[5]
J. Wu et al., “An energy-efficient deep belief network processor based on heterogeneous multi-core architecture with transposable memory and on-chip learning,” IEEE Trans. Emerg. Sel. Top. Circuits Syst., vol. 11, no. 4, pp. 725-738, Dec. 2021.

[6]
D. Gope et al., “Highly optimized kernels and fine-grained codebooks for LLM inference on arm CPUs,” 2024, arXiv:2501.00032.

[7]
L. Chen et al., “Optimizing inference performance for large language models on armv9 architecture,” in Proc. IEEE Int. Conf. Artif. Intell. Circuits Syst. (AICAS), 2025, pp. 1-5.

[8]
X. Zhu et al., “Hardware-aware optimization of large language models for enhanced throughput on arm CPUs,” in Proc. IEEE Int. Conf. Artif. Intell. Circuits Syst. (AICAS), 2025, pp. 1-5.

[9]
B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 2704-2713.

[10]
W. Mao et al., “A 28-nm 135.19 TOPS/W bootstrapped-SRAM compute-in-memory accelerator with layer-wise precision and sparsity,” IEEE Trans. Circuits Syst. I, Reg. Papers, early access, Nov. 2024.

[11]
W. Mao et al., “A low-power charge-domain bit-scalable readout system for fully-parallel computing-in-memory accelerators,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 71, no. 6, pp. 2916-2920, Jun. 2024.

[12]
E. Frantar et al., “OPTQ: Accurate quantization for generative pretrained transformers,” in Proc. 11th Int. Conf. Learn. Representations (ICLR), 2023, pp. 1-16.

[13]
C. Hooper et al., “KVQuant: Towards 10 million context length LLM inference with KV cache quantization,” 2024, arXiv:2401.18079.

[14]
J. Lin et al., “AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration,” in Proc. Mach. Learn. Syst. (MLSys), vol. 28, no. 4, pp. 87-100, Jan. 2024.

[15]
W. Wang et al., “Model compression and efficient inference for large language models: A survey,” 2024, arXiv:2402.09748.

[16]
J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol. (NAACL-HLT), vol. 1, pp. 4171-4186, Jun. 2019.

[17]
C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1-67, 2020.

[18]
T. Brown et al., “Language models are few-shot learners,” in Proc. Adv. Neural Info. Proc. Syst., 2020, vol. 33, pp. 1877-1901.

[19]
P. Cai et al., “Compare encoder-decoder, encoder-only, and decoderonly architectures for text generation on low-resource datasets,” in Proc. 15th Int. Conf. Broad-Band Wireless Comput. Commun. Appl. (BWCCA), 2022, pp. 216-225.

[20]
J. Bai et al., “Qwen technical report,” 2023, arXiv:2309.16609.

[21]
A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS), 2017, pp. 6000-6010.

[22]
T. J. Ham et al., “ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks,” in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput. Archit. (ISCA), 2021, pp. 692-705.

[23]
B. Wu et al., “Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism,” in Proc. ACM SIGOPS 30th Symp. Operating Syst. Princ. (SOSP), Nov. 2024, pp. 640-654.

[24]
Z. Yang et al., “A batched GEMM optimization framework for deep learning,” J. Supercomputing, vol. 78, no. 11, pp. 13393-13408, Jan. 2022.

[25]
Arm Limited, “Arm neon technology,” 2020. Accessed: Apr. 13, 2024. [Online]. Available: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon

[26]
Arm Limited, “Neon — developer.arm.com,” 2020. Accessed: Apr. 15, 2024. [Online]. Available: https://developer.arm.com/Architectures/Neon

[27]
J. L. Hennessy, et al., Comput. Archit., Sixth Edition: A Quantitative Approach, 6th ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2017.

[28]
Y. Bisk et al., “PIQA: Reasoning about physical commonsense in natural language,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2020,vol. 34, no. 5, pp. 7432-7439.

Outlines

/