Co-Optimization for Large Language Models: Advances in Algorithm and Hardware

An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding

  • ZHE WEN ,
  • LIANG XU ,
  • MEIQIWANG
Expand
  • School of Integrated Circuits, Sun Yat-sen University, Shenzhen 518107, China
MEIQI WANG (e-mail: ).

Zhe Wen and Liang Xu contributed equally to this work.

Received date: 2024-12-31

  Revised date: 2025-03-31

  Accepted date: 2025-05-23

  Online published: 2025-10-22

Supported by

National Natural Science Foundation of China under Grant(62404256)

Jiangsu Provincial Science and Technology Major Special Project under Grant(BG2024032)

Key Project of Shenzhen Basic Research Program under Grant(JCYJ20241206180301003)

High-performance Computing Public Platform (Shenzhen Campus) of Sun Yat-sen University

Abstract

In recent years, the exponential growth in Large Language Model (LLM) parameter sizes has significantly increased computational complexity, with inference latency emerging as a prominent challenge. The primary bottleneck lies in the token-by-token prediction process during autoregressive decoding, resulting in substantial delays. Therefore, enhancing decoding efficiency while maintaining accuracy has become a critical research objective. This paper proposes an Adaptive Parallel Layer-Skipping Speculative Decoding (APLS) method, which leverages speculative decoding techniques by employing a Small-Scale Model (SSM) for preliminary inference and validating the predictions using the original LLM. This approach effectively balances the high precision of LLMs with the efficiency of SSMs. Notably, our SSM does not require additional training but is instead derived through a simplification of the original large-scale model. By incorporating parallelization and a layer-skipping structure, the inference process dynamically bypasses certain redundant transformation layers, significantly improving GPU utilization and inference speed without compromising performance. Furthermore, to address challenges such as window size limitations and memory fragmentation in long-text processing, this paper introduces progressive layer reduction and key-value cache deletion techniques to further optimize the performance of SSMs. Experimental results demonstrate that the proposed method achieves a 2.51 × improvement in efficiency during autoregressive decoding. As this approach eliminates the need for additional training of SSM, it offers a significant competitive advantage in high-cost model compression environments.

Cite this article

ZHE WEN , LIANG XU , MEIQIWANG . An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding[J]. Integrated Circuits and Systems, 2025 , 2(2) : 58 -66 . DOI: 10.23919/ICS.2025.3575371

[1]
H. Touvron, L. Martin, and E. A. Stone,“Llama 2: Open foundation and fine-tuned chat models,” Jul. 2023.

[2]
W.-L. Chiang et al., “Vicuna: An open-source chatbot impressing GPT- 4 with 90%* chatGPT quality,” Mar. 2023.

[3]
OpenAI et al., “Gpt-4 technical report,” 2024.

[4]
X. Liang et al., “Controllable text generation for large language models: A survey,” 2024.

[5]
Z. Feng et al., “TEaR: Improving LLM-based machine translation with systematic self-refinement,” in Proc. Find. Assoc. Comput. Linguist. : NAACL 2025, L. Chiruzzo, A. Ritter, and L. Wang, Eds., Albuquerque, New Mexico, Apr. 2025, pp. 3922-3938.

[6]
D. Allemang and J. Sequeda, “Increasing the accuracy of LLM question-answering systems with ontologies” in Proc. Semantic Web - ISWC 2024: 23rd Int. Semantic Web Conf., Baltimore, MD, USA, Nov. 11- 15, 2024, pp. 324-339.

[7]
X. Ma, G. Fang, and X. Wang, “LLM-pruner:On the structural pruning of large language models,” in Proc. Adv. Neural Inf. Process. Syst., A.Oh, T.Naumann, A.Globerson, K.Saenko, M. Hardtand S. Eds., 2023, vol. 36, pp. 21702-21720.

[8]
Y. Yang, Z. Cao, and H. Zhao,“Laco: Large language model pruning via layer collapse,” Feb. 2024.

[9]
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer,“Llm.int8():8- bit matrix multiplication for transformers at scale,” in Proc. 36th Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA, Nov. 2022, pp. 30318-30332.

[10]
Z. Liu et al., “LLM-QAT: Data-free quantization aware training for large language models,” May 2023.

[11]
H. Wang et al., “Bitnet: Scaling 1-bit transformers for large language models,” Oct. 2023.

[12]
X. Xu et al., “A survey on knowledge distillation of large language models,” Oct. 2024.

[13]
S. Sun, W. Ren, J. Li, R. Wang, and X. Cao, “Logit standardization in knowledge distillation,” in Proc. 2024 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 15731-15740.

[14]
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” 2023.

[15]
Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from transformers via speculative decoding,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 19274-19286.

[16]
X. Miao et al., “Accelerating generative large language model serving with speculative inference and token tree verification,” in Proc. Int. Conf. Architectural Support Program. Lang. Operat. Syst. (ASPLOS), 2024, vol. 3, pp. 932-949.

[17]
Y. Zhou et al., “DistillSpec: Improving speculative decoding via knowledge distillation,” in Proc. 12th Int. Conf. Learn. Representations, 2024.

[18]
T. Cai et al., “MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads,” in Proc. 41st Int. Conf. Mach. Learn., 2024.

[19]
J. Zhang et al., “Draft & verify: Lossless large language model acceleration via self-speculative decoding,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguist., L.-W. Ku, A. Martins, and V. Srikumar, Eds., Bangkok, Thailand, Aug. 2024, vol. 1, pp. 11263-11282.

[20]
Z. He, Z. Zhong, T. Cai, J. D. Lee, and D. He, Eds., “REST: Retrieval-based speculative decoding,” in Proc. 2024 Conf. North Amer. Chapt. Assoc. Comput. Linguist.:Hum. Lang. Technol., K. Duh, H. Gomez, and S. Bethard, Mexico City, Mexico, Jun. 2024, vol. 1, pp. 1582-1595.

[21]
Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Break the sequential dependency of LLM inference using lookahead decoding,” in Proc. Int. Conf. Mach. Learn., 2024.

[22]
Y. Tang, F. Yu, W. Pedrycz, X. Yang, J. Wang, and S. Liu, “Building trend fuzzy granulation-based LSTM recurrent neural network for longterm time-series forecasting,” IEEE Trans. Fuzzy Syst., vol. 30, no. 6, pp. 1599-1613, 2021.

[23]
T. Lei, X. Jia, Y. Zhang, S. Liu, H. Meng, and A. K Nandi, “Superpixelbased fast fuzzy c-means clustering for color image segmentation,” IEEE Trans. Fuzzy Syst., vol. 27, no. 9, pp. 1753-1766, 2018.

[24]
J.-S. Jang, “ANFIS: Adaptive-network-based fuzzy inference system,” IEEE Trans. Syst., Man, Cybern., vol. 23, no. 3, pp. 665-685, 1993.

[25]
G. Selvachandran et al., “A new design of mamdani complex fuzzy inference system for multiattribute decision making problems,” IEEE Trans. Fuzzy Syst., vol. 29, no. 4, pp. 716-730, 2019.

[26]
S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao, “Model tells you what to discard: Adaptive KV cache compression for LLMs,” 2023, arXiv:2310.01801.

[27]
Z. Zhang et al., “H2O: Heavy-hitter oracle for efficient generative inference of large language models,” in Proc. Adv. Neural Inf. Process. Syst., 2023, vol. 36, pp. 34661-34710.

[28]
L. Shi, H. Zhang, Y. Yao, Z. Li, and H. Zhao, “Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption,” 2024, arXiv:2407.18003.

[29]
Z. Liu et al., “Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time,” in Proc. Adv. Neural Inf. Process. Syst., 2024, vol. 36.

[30]
X. Miao et al., “Accelerating generative large language model serving with tree-based speculative inference and verification,” in Proc. Int. Conf. Architectural Support Program. Lang. Operat. Syst. (ASPLOS), 2024, vol. 3, pp. 932-949.

[31]
C. Gulcehre et al., “Reinforced self-training (rest) for language modeling,” 2023, arXiv:2308.08998.

Outlines

/