Integrated Circuits and Systems >
SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning Method
Received date: 2024-04-11
Revised date: 2024-06-09
Accepted date: 2024-08-20
Online published: 2025-01-09
Supported by
the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant(XDB44000000)
Recently, large Transformer models have achieved impressive results in various natural language processing tasks but require enormous parameters and intensive computations, necessitating deployment on multi-device systems. Current solutions introduce complicated topologies with dedicated high-bandwidth interconnects to reduce communication overhead. To deal with the complexity problem in system architecture and reduce the overhead of inter-device communications, this paper proposes SALTM, a multi-device system based on a unidirectional ring topology and a 2-D model partitioning method considering quantization and pruning. First, a 1-D model partitioning method is proposed to reduce the amount of communication. Then, the block distributed on each device is further partitioned in the orthogonal direction, introducing a task-level pipeline to overlap communication and computation. To further explore the SALTM’s performance on a real large model like GPT-3, we develop an analytical model to evaluate the performance and communication overhead. Our simulation shows that a BERT model with 110 million parameters, implemented by SALTM on four FPGAs can achieve 9.65× and 1.12× speedups compared to CPU and GPU, respectively. The simulation also shows that the execution time of 4-FPGA SALTM is 1.52× that of an ideal system with infinite inter-device bandwidth. For GPT-3 with 175 billion parameters, our analytical model predicts that SALTM comprising 16 VC1502 FPGAs and 16 A30 GPUs can achieve inference latency of 287 ms and 164 ms, respectively.
YIFAN SONG , YULONG MENG , BINHAN CHEN , CHEN SONG , KANG YI . SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning Method[J]. Integrated Circuits and Systems, 2024 , 1(3) : 144 -156 . DOI: 10.23919/ICS.2024.3458897
[1] |
|
[2] |
|
[3] |
|
[4] |
|
[5] |
|
[6] |
|
[7] |
|
[8] |
|
[9] |
|
[10] |
|
[11] |
|
[12] |
|
[13] |
|
[14] |
|
[15] |
|
[16] |
|
[17] |
|
[18] |
|
[19] |
|
[20] |
|
[21] |
|
[22] |
|
[23] |
|
[24] |
|
[25] |
|
[26] |
|
[27] |
|
[28] |
|
[29] |
|
[30] |
|
[31] |
|
[32] |
|
[33] |
|
[34] |
|
[35] |
|
[36] |
|
[37] |
|
[38] |
|
[39] |
AMD,“High speed serial.” 2017. [Online].
|
[40] |
|
[41] |
|
/
〈 |
|
〉 |