Original article

SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning Method

  • YIFAN SONG ,
  • YULONG MENG ,
  • BINHAN CHEN ,
  • CHEN SONG ,
  • KANG YI
Expand
  • University of Science and Technology of China, Hefei, Anhui 230022, China

Received date: 2024-04-11

  Revised date: 2024-06-09

  Accepted date: 2024-08-20

  Online published: 2025-01-09

Supported by

the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant(XDB44000000)

Abstract

Recently, large Transformer models have achieved impressive results in various natural language processing tasks but require enormous parameters and intensive computations, necessitating deployment on multi-device systems. Current solutions introduce complicated topologies with dedicated high-bandwidth interconnects to reduce communication overhead. To deal with the complexity problem in system architecture and reduce the overhead of inter-device communications, this paper proposes SALTM, a multi-device system based on a unidirectional ring topology and a 2-D model partitioning method considering quantization and pruning. First, a 1-D model partitioning method is proposed to reduce the amount of communication. Then, the block distributed on each device is further partitioned in the orthogonal direction, introducing a task-level pipeline to overlap communication and computation. To further explore the SALTM’s performance on a real large model like GPT-3, we develop an analytical model to evaluate the performance and communication overhead. Our simulation shows that a BERT model with 110 million parameters, implemented by SALTM on four FPGAs can achieve 9.65× and 1.12× speedups compared to CPU and GPU, respectively. The simulation also shows that the execution time of 4-FPGA SALTM is 1.52× that of an ideal system with infinite inter-device bandwidth. For GPT-3 with 175 billion parameters, our analytical model predicts that SALTM comprising 16 VC1502 FPGAs and 16 A30 GPUs can achieve inference latency of 287 ms and 164 ms, respectively.

Cite this article

YIFAN SONG , YULONG MENG , BINHAN CHEN , CHEN SONG , KANG YI . SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning Method[J]. Integrated Circuits and Systems, 2024 , 1(3) : 144 -156 . DOI: 10.23919/ICS.2024.3458897

[1]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Proces. Syst., Long Beach, California, USA, 2017, pp. 6000-6010.

[2]
S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, “Tale of two Cs: Computation vs. communication scaling for future transformers on future hardware,” in Proc. IEEE Int. Symp. Workload Charact., Los Alamitos, CA, USA, 2023, pp. 140-153.

[3]
S. Kumar et al., “Exploring the limits of concurrency in ML training on google TPUs,” in Proc. Mach. Learn. Syst., 2021, pp. 81-92.

[4]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Annu. Meeting Assoc. Comput. Linguist., Florence, Italy, 2019, pp. 4171-4186.

[5]
A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019.

[6]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,“Megatron-LM: Training multi-billion parameter language models using model parallelism,” 2019, arXiv preprint.

[7]
C. Rosset, “Turing-NLG: A 17-billion-parameter language model by microsoft,” Microsoft Blog, vol. 1, no. 2, 2020.

[8]
T. Brown et al., “Language models are few-shot learners,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 1877-1901, 2020.

[9]
S. Smith et al., “Using deepspeed and megatron to train megatron-turing NLG 530b, a large-scale generative language model,” 2022, arXiv preprint.

[10]
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” J. Mach. Learn. Res., vol. 23, no. 1, pp. 5232-5270, 2022.

[11]
Y. Sheng et al., “FLEXGEN: High-throughput generative inference of large language models with a single GPU,” in Proc. Int. Conf. Mach. Learn., Honolulu, Hawaii, USA, 2023, pp. 31094-31116.

[12]
R. Y. Aminabadi et al., “Deepspeed-inference: Enabling efficient inference of transformer models at unprecedented scale,” in Proc. IEEE Int. Conf. High Perform. Compu. Netw. Storage Anal., Dallas, TX, USA, 2022, pp. 1-15.

[13]
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the effects of data parallelism on neural network training J. Mach. Learn. Res., vol. 20, no. 112, pp. 1-49, 2019. [Online].

[14]
D. Narayanan et al., “Pipedream: Generalized pipeline parallelism for DNN training,” in Proc. ACM Symp. Oper. Syst. Princ., Huntsville, Ontario, Canada, 2019, pp. 1-15.

[15]
B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract: Parallelize the tensor parallelism efficiently,” in Proc. Int. Conf. Parallel Proc., Bordeaux, France, 2022, pp. 1-11.

[16]
N. Jouppi et al., “TPU V4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in Proc. Int. Symp. Comput. Archit., Orlando, FL, USA, 2023, pp. 1-14.

[17]
J. Choquette,“NVIDIA hopper H100 GPU: Scaling performance,” IEEE Micro, vol. 43, no. 3, pp. 9-17, May/Jun. 2023.

[18]
A. Ishii and R. Wells, “The NVLINK-network switch: Nvidia’s switch chip for high communication-bandwidth superpods,” in Proc. IEEE Hot Chips 34 Symp., Cupertino, CA, USA, 2022, pp. 1-23.

[19]
Y. Song, S. Zhao, S. Chen, and Y. Kang, “SAUST: A scheme for acceleration of unstructured sparse transformer,” in Proc. IEEE Int. Conf. Integr. Circuits Technol. Appl., Xi’an, China, 2022, pp. 56-57.

[20]
J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky,“NVIDIA A100 tensor core GPU: Performance and innovation,” IEEE Micro, vol. 41, no. 2, pp.29-35, Mar./Apr. 2021.

[21]
G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “SmoothQuant: Accurate and efficient post-training quantization for large language models,” in Proc. Int. Conf. Mach. Learn., Jul. 2023, vol. 202, pp. 38087-38099.

[22]
Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He, “Zero-quant: Efficient and affordable post-training quantization for large-scale transformers,” in Proc. Adv. Neural Inf. Process. Syst., New Orleans, LA, USA, 2022, pp. 27168-27183.

[23]
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for transformers at scale,” in Proc. Adv. Neural Inf. Process. Syst., New Orleans, LA, USA, 2022, pp. 30318-30332.

[24]
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” 2022, arXiv:2210.17323.

[25]
T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” J. Mach. Learn. Res., vol. 22, no. 1, pp. 10882-11005, 2021.

[26]
W. Kwon, S. Kim, M. W. Mahoney, J. Hassoun, K. Keutzer, and A. Gholami, “A fast post-training pruning framework for transformers,” in Proc. Adv. Neural Inf. Process. Syst., New Orleans, LA, USA, 2022, vol. 35, pp. 24101-24116.

[27]
E. Frantar and D. Alistarh, “SparseGPT: Massive language models can be accurately pruned in one-shot,” in Proc. Int. Conf. Mach. Learn., Honolulu, Hawaii, USA, 2023, pp. 10323-10337.

[28]
E. Frantar and D. Alistarh, “Optimal brain compression: A framework for accurate post-training quantization and pruning,” in Proc. Adv. Neural Inf. Proces. Syst., New Orleans, LA, USA, 2022, pp. 4475-4488.

[29]
V. Likhosherstov, K. M. Choromanski, J. Q. Davis, X. Song, and A. Weller, “Sub-linear memory: How to make performers slim,” in Proc. Adv. Neural Inf. Process. Syst., 2021, vol. 34, pp. 6707-6719.

[30]
R. Csordás, K. Irie, and J. Schmidhuber, “Approximating two-layer feedforward networks for efficient transformers,” in Proc. Findings Assoc. Comput. Linguist., 2023, pp. 674-692.

[31]
R. Pope et al., “Efficiently scaling transformer inference,” in Proc. Mach. Learn. Syst., Miami, FL, USA, 2023, vol. 5, pp. 606-624.

[32]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” 2018, arXiv:1804.07461.

[33]
T. Gale, E. Elsen, and S. Hooker, “The state of sparsity in deep neural networks,” 2019, arXiv:1902.09574.

[34]
S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-BERT: Integer-only BERT quantization,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 5506-5518.

[35]
A. Buluc and J. R. Gilbert, “On the representation and multiplication of hypersparse matrices,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2008, pp. 1-11.

[36]
Z. Li et al., “AUTO-VIT-ACC: An FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization,” in Proc. IEEE 32nd Int. Conf. Field-Prog. Log. Appl., 2022, pp. 109-116.

[37]
S.-C. Kao, S. Subramanian, G. Agrawal, A. Yazdanbakhsh, and T. Krishna, “Flat: An optimized dataflow for mitigating attention bottle-necks,” in Proc. Int. Conf. Archit. Support. Program. Lang. Oper. Syst, Vancouver, BC, Canada, 2023, pp. 295-310.

[38]
Y. Zhu, Z. He, W. Jiang, K. Zeng, J. Zhou, and G. Alonso, “Distributed recommendation inference on FPGA clusters,” in Proc. IEEE Int. Conf. Field-Prog. Log. Appl., 2021, 2021, pp. 279-285.

[39]
AMD,“High speed serial.” 2017. [Online].

[40]
H. M. Waidyasooriya and M. Hariyama, “Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability,” IEEE Access, vol. 7, pp. 53188-53201, 2019.

[41]
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput., Vis., Oct. 2021, pp. 10012-10022.

Outlines

/