Integrated Circuits and Systems >
An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding
Zhe Wen and Liang Xu contributed equally to this work.
Received date: 2024-12-31
Revised date: 2025-03-31
Accepted date: 2025-05-23
Online published: 2025-10-22
Supported by
National Natural Science Foundation of China under Grant(62404256)
Jiangsu Provincial Science and Technology Major Special Project under Grant(BG2024032)
Key Project of Shenzhen Basic Research Program under Grant(JCYJ20241206180301003)
High-performance Computing Public Platform (Shenzhen Campus) of Sun Yat-sen University
In recent years, the exponential growth in Large Language Model (LLM) parameter sizes has significantly increased computational complexity, with inference latency emerging as a prominent challenge. The primary bottleneck lies in the token-by-token prediction process during autoregressive decoding, resulting in substantial delays. Therefore, enhancing decoding efficiency while maintaining accuracy has become a critical research objective. This paper proposes an Adaptive Parallel Layer-Skipping Speculative Decoding (APLS) method, which leverages speculative decoding techniques by employing a Small-Scale Model (SSM) for preliminary inference and validating the predictions using the original LLM. This approach effectively balances the high precision of LLMs with the efficiency of SSMs. Notably, our SSM does not require additional training but is instead derived through a simplification of the original large-scale model. By incorporating parallelization and a layer-skipping structure, the inference process dynamically bypasses certain redundant transformation layers, significantly improving GPU utilization and inference speed without compromising performance. Furthermore, to address challenges such as window size limitations and memory fragmentation in long-text processing, this paper introduces progressive layer reduction and key-value cache deletion techniques to further optimize the performance of SSMs. Experimental results demonstrate that the proposed method achieves a 2.51 × improvement in efficiency during autoregressive decoding. As this approach eliminates the need for additional training of SSM, it offers a significant competitive advantage in high-cost model compression environments.
ZHE WEN , LIANG XU , MEIQIWANG . An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding[J]. Integrated Circuits and Systems, 2025 , 2(2) : 58 -66 . DOI: 10.23919/ICS.2025.3575371
| [1] |
|
| [2] |
|
| [3] |
OpenAI et al., “Gpt-4 technical report,” 2024.
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
/
| 〈 |
|
〉 |