Original article

TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips

  • HUIZHENG WANG 1 ,
  • QIZE YANG 1 ,
  • TAIQUAN WEI 1 ,
  • XINGMAO YU 1 ,
  • CHENGRAN LI 1 ,
  • JIAHAO FANG 1 ,
  • GUANGYANG LU 1 ,
  • XU DAI 2 ,
  • LIANG LIU 2 ,
  • SHENFEI JIANG 2 ,
  • YANG HU , 1 ,
  • SHOUYI YIN 1, 3 ,
  • SHAOJUN WEI 1
Expand
  • 1 School of Integrated Circuits, Tsinghua University, Beijing 100084, China
  • 2 Shanghai AI laboratory, Shanghai 200003, China
  • 3 International Innovation Center of Tsinghua University, Shanghai 200003, China
+ CORRESPONDING AUTHOR: YANG HU (e-mail: ).

(Huizheng Wang and Qize Yang contributed equally to this work.)

Revised date: 2024-12-05

  Accepted date: 2024-12-05

  Online published: 2025-01-09

Abstract

Transformer-based large language models (LLMs) have made significant strides in the field of artificial intelligence (AI). However, training these LLMs imposes immense demands on computational power and bandwidth for hardware systems. Wafer-scale chips (WSCs) offer a promising solution, yet they struggle with limited on-chip memory and complex tensor partitioning. To fully harness the high-bandwidth, low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations, a specialized mapping and architecture co-exploration method is essential. Despite existing efforts in memory optimization and mapping, current approaches fall short for WSC scenarios. To bridge this gap, we introduce TMAC, an architecture-mapping co-exploration framework that integrates recomputation into the design space, fully exploiting optimization opportunities overlooked by existing works. Further, TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme. TMAC then introduces a novel operator-centric encoding scheme (OCES) designed to comprehensively describe the mapping space for training LLMs. Unlike previous studies that focus solely on communication volume analysis based on mapping, TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance. However, fully accounting for these untapped optimization opportunities increases the complexity of the design space. To address this, we streamline the simulation process, reducing the time needed for exploration. Compared to AccPar, Deepspeed and Megatron, TMAC delivers a 3.1×, 2.9×, 1.6× performance gain. In terms of memory usage, TMAC requires 3.6×, 3.1× less memory than AccPar and Deepspeed, respectively and is comparable to Megatron’s full recomputation method.

Cite this article

HUIZHENG WANG , QIZE YANG , TAIQUAN WEI , XINGMAO YU , CHENGRAN LI , JIAHAO FANG , GUANGYANG LU , XU DAI , LIANG LIU , SHENFEI JIANG , YANG HU , SHOUYI YIN , SHAOJUN WEI . TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips[J]. Integrated Circuits and Systems, 2024 , 1(4) : 178 -195 . DOI: 10.23919/ICS.2024.3515003

Fund
This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115200, in part by Frontier Technique Collaboration Project under Grant QYJS-2023-2801-B, in part by NSFC under Grant 62125403 and Grant 92164301, in part by the Beijing S&T Project under Grant Z221100007722023, in part by the Shanghai Municipal Science and Technology Major Project, in part by the 2022 Special Project on Industrial Foundation Reconstruction and High Quality Development of Manufacturing Industry under Grant CEIEC-2022-ZM02-0245, in part by the Beijing National Research Center for Information Science and Technology, and in part by the Beijing Advanced Innovation Center for Integrated Circuits.

[1].
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:Pretraining of deep bidirectional transformers for language understanding,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2018, vol. 1, pp. 2-18.

[2].
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,“Albert: A lite BERT for self-supervised learning of language representations,” 2019, arXiv:1909.11942.

[3].
Y. Liu et al., “Roberta: A robustly optimized BERT pretraining approach,” 2019, arXiv:1907.11692.

[4].
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9-23, 2019.

[5].
C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485-5551, 2020.

[6].
V. Sanh, L. Debut, J. Chaumond, and T. Wolf,“DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” 2019, arXiv:1910.01108.

[7].
K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammus: A survey of transformer-based pretrained models in natural language processing,” 2021, arXiv:2108.05542.

[8].
Q. Zhang et al., “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7829-7833.

[9].
H. Wang et al., “SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling,” in Proc. 57th IEEE/ACM Int. Symp. Microarchit., 2024, pp. 1247-1263.

[10].
T. B. Brown et al., “Language models are few-shot learners,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 1877-1901, 2020.

[11].
A. T. Rosário, “Generative AI and generative pre-trained transformer applications: Challenges and opportunities,” Mak. Art With Generative AI Tools, vol. 1, pp. 45-71, 2024.

[12].
S. Feuerriegel, J. Hartmann, C. Janiesch, and P. Zschech,“Generative AI,” Inf. Syst. Eng., vol. 66, pp. 111-126, 2024.

[13].
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 28092-28103, 2021.

[14].
H. Wang, Z. Zhang, X. You, and C. Zhang, “Low-complexity wino-grad convolution architecture based on stochastic computing,” in Proc. IEEE 23rd Int. Conf. Digit. Signal Process., 2018, pp. 1-5.

[15].
L. Yuan et al., “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 558-567.

[16].
Z.-Y. Dou et al., “An empirical study of training end-to-end vision- and-language transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18166-18176.

[17].
A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929.

[18].
C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multiscale vision transformer for image classification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 357-366.

[19].
H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou,“Going deeper with image transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 32-42.

[20].
H. Wang, W. Xu, Z. Zhang, X. You, and C. Zhang, “An efficient stochastic convolution architecture based on fast FIR algorithm,” IEEE Trans. Circuits Syst. II: Exp. Briefs, vol. 69, no. 3, pp. 984-988, Mar. 2022.

[21].
K. S. Kalyan, “A survey of GPT-3 family large language models including ChatGPT and GPT-4,” Natural Lang. Process. J., Art. no. 100048, 2023.

[22].
D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia, “Memory-efficient pipeline-parallel DNN training in Proc. Int. Conf. Mach. Learn., 2021, pp. 7937-7947.

[23].
D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory,” ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 380-392, 2016.

[24].
S. Hall, R. Schreiber, and S. Lie, “Training giant neural networks using weight streaming on cerebras wafer-scale systems,” Tech. Rep. 111521, 2021.

[25].
F. Zeng, W. Gan, Y. Wang, and S. Y. Philip,“Distributed training of large language models,” in Proc. IEEE 29th Int. Conf. Parallel Distrib. Syst., 2023, pp. 840-847.

[26].
Y. Yu et al., “Large language model as attributed training data generator: A tale of diversity and bias,” Adv. Neural Inf. Process. Syst., vol. 36, pp. 55734-55784, 2024.

[27].
D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. Oper. Syst. Principles, 2019, pp. 1-15.

[28].
X. S. Huang, F. Perez, J. Ba, and M. Volkovs,“Improving transformer optimization through better initialization,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 4475-4483.

[29].
S. Shen, P. Walsh, K. Keutzer, J. Dodge, M. Peters, and I. Beltagy,“Staged training for transformer language models,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 19893-19908.

[30].
L. Zheng et al., “Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning,” in Proc. 16th USENIX Symp. Operating Syst. Des. Implementation, 2022, pp. 559-578.

[31].
Y. Huang et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 103-112, 2019.

[32].
D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. operating Syst. Princ., 2019, pp. 1-15.

[33].
S. Pal, D. Petrisko, A. A. Bajwa, P. Gupta, S. S. Iyer, and R. Kumar, “A case for packageless processors,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2018, pp. 466-479.

[34].
S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar, “Architecting waferscale processors—A GPU case study,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2019, pp. 250-263.

[35].
Y. Hu et al., “Wafer-scale computing: Advancements, challenges, and future perspectives [feature],” IEEE Circuits Syst. Mag., vol. 24, no. 1, pp. 52-81, Firstquarter 2024.

[36].
S. Hou et al., “Wafer-level integration of an advanced logic-memory system through the second-generation cowos technology,” IEEE Trans. Electron Devices, vol. 64, no. 10, pp. 4071-4077, Oct. 2017.

[37].
R. Mahajan et al., “Embedded multi-die interconnect bridge (EMIB)— A high density, high bandwidth packaging interconnect,” in Proc. IEEE 66th Electron. Compon. Technol. Conf., 2016, pp. 557-565.

[38].
S. Lie, “Cerebras architecture deep dive: First look inside the HW/SW co-design for deep learning: Cerebras systems,” in Proc. IEEE Hot Chips 34 Symp., 2022, pp. 1-34.

[39].
E. Talpes, D. Williams, and D. D. Sarma, “Dojo: The microarchitecture of Tesla’s exa-scale computer,” in Proc. IEEE Hot Chips 34 Symp., 2022, pp. 1-28.

[40].
D. Patel,“Tenstorrent Wormhole analysis—A scale out architecture for machine learning that could put Nvidia on their back foot,” 2021, [Online].

[41].
T. Morgan,“Inside Tesla’s innovative and homegrown ‘DOJO’ AI supercomputer,” 2022, [Online].

[42].
L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Hypar: Towards hybrid parallelism for deep learning accelerator array,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2019, pp. 56-68.

[43].
L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Accpar: Tensor partitioning for heterogeneous deep learning accelerators,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2020, pp. 342-355.

[44].
B. Li, Q. Du, D. Liu, J. Zhang, G. Chen, and H. You,“Placement for wafer-scale deep learning accelerator,” in Proc. 26th Asia South Pacific Des. Automat. Conf., 2021, pp. 665-670.

[45].
H. Peng, L. Guo, L. Sun, and X. Zhang, “Resource allocation for wafer-scale deep learning accelerator,” in Proc. IEEE 41st Int. Conf. Distrib. Comput. Syst., 2021, pp. 1114-1115.

[46].
B. Jiang et al., “CU. POKer: Placing DNNs on wafer-scale AI accelerator with optimal kernel sizing,” in Proc. 39th Int. Conf. Comput-Aided Des., 2020, pp. 1-9.

[47].
J. Chen, S. Li, R. Gun, J. Yuan, and T. Hoefler,“Autoddl: Automatic distributed deep learning with asymptotically optimal communication,” 2023, arXiv:2301.06813.

[48].
H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Kr- ishna, “Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach,” in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchit., 2019, pp. 754-768.

[49].
Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism in distributed training for huge neural networks,” 2021, arXiv:2105.14450.

[50].
T. Chen, B. Xu, C. Zhang, and C. Guestrin,“Training deep nets with sublinear memory cost,” 2016, arXiv:1604.06174.

[51].
P. Jain et al., “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” Proc. Mach. Learn. Syst., vol. 2, pp. 497-511, 2020.

[52].
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,“Megatron-LM: Training multi-billion parameter language models using model parallelism,” 2019, arXiv:1909.08053.

[53].
D. Narayanan et al., “Efficient large-scale language model training on GPU clusters using Megatron-LM,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2021, pp. 1-15.

[54].
S. Smith et al., “Using deepspeed and megatron to train Megatronturing NLG 530B, a large-scale generative language model,” 2022, arXiv:2201.11990.

[55].
V. A. Korthikanti et al., “Reducing activation recomputation in large transformer models,” Proc. Mach. Learn. Syst., vol. 5, pp. 341-353, 2023.

[56].
H. Touvron et al., “Llama: Open and efficient foundation language models,” 2023, arXiv:2302.13971.

[57].
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023, arXiv:2307.09288.

[58].
J. Bai et al., “Qwen technical report,” 2023, arXiv:2309.16609.

[59].
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZERO: Memory optimizations toward training trillion parameter models,” in Proc. SC20:Int. Conf. High Perform. Comput., Netw., Storage Anal., 2020, pp. 1-16.

[60].
S. Pal et al., “Designing a 2048-chiplet, 14336-core waferscale processor,” in Proc. 58th ACM/IEEE Des. Automat. Conf., 2021, pp. 1183-1188.

[61].
J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, and S. Millner, “A wafer-scale neuromorphic hardware system for large-scale neural modeling,” in Proc. IEEE Int. Symp. Circuits Syst., 2010, pp. 1947-1950.

[62].
Cerebras,“Cerebras systems: Achieving industry best AI performance through a systems approach,” 2021, [Online].

[63].
Y. S. Shao et al., “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchit., 2019, pp. 14-27.

[64].
A. A. Bajwa et al., “Demonstration of a heterogeneously integrated system-on-wafer (SoW) assembly,” in Proc. IEEE 68th Electron. Compon. Technol. Conf., 2018, pp. 1926-1930.

[65].
M. Zhang, F. Qin, S. Chen, Y. Dai, P. Chen, and T. An, “Protrusion of through-silicon-via (TSV) copper with double annealing processes,” J. Electron. Mater., vol. 51, no. 5, pp. 2433-2449, 2022.

[66].
H. Guo, S. Cao, L. Li, and X. Zhang, “A review on the mainstream through-silicon via etching methods,” Mater. Sci. Semicond. Process., vol. 137, 2022, Art. no. 106182.

[67].
P. Huang et al., “Wafer level system integration of the fifth generation CoWoS-S with high performance si interposer at 2500 mm2,” in Proc. IEEE 71st Electron. Compon. Technol. Conf., 2021, pp. 101-104.

[68].
Y.-C. Hu et al., “CoWoS architecture evolution for next generation HPC on 2.5 D. system in package,” in Proc. IEEE 73rd Electron. Compon. Technol. Conf., 2023, pp. 1022-1026.

[69].
J. H. Lau et al., “Hybrid substrate by fan-out RDL-first panel-level packaging,” IEEE Trans. Compon. Packag. Manuf. Technol., vol. 11, no. 8, pp. 1301-1309, Aug. 2021.

[70].
J. H. Lau. et al., “Panel-level fan-out RDL-first packaging for hetero-geneous integration,” IEEE Trans. Compon. Packag. Manuf. Technol., vol. 10, no. 7, pp. 1125-1137, Jul. 2020.

[71].
M. La and A. Chien, “Cerebras systems: Journey to the wafer-scale engine,” Univ. Chicago, Tech. Rep. CS24440, 2020.

[72].
Q. Zhou et al., “Mpress: Democratizing billion-scale model training on multi-GPU servers via memory-saving inter-operator parallelism,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit., 2023,

[73].
pp.556-569.

[74].
H. Wang, L. Wang, H. Xu, Y. Wang, Y. Li, and Y. Han, “Primepar: Efficient spatial-temporal tensor partitioning for large transformer model training,” in Proc. 29th ACM Int. Conf. Architectural Support Program. Lang. Operating Syst., 2024, pp. 801-817.

[75].
S. Li et al., “Colossal-AI: A unified deep learning system for large-scale parallel training,” in Proc. 52nd Int. Conf. Parallel Process., 2023, pp. 766-775.

[76].
A. Li et al., “Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect,” IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 1, pp. 94-110, Jan. 2020.

[77].
M. Kusumoto, T. Inoue, G. Watanabe, T. Akiba, and M. Koyama, “A graph theoretic framework of recomputation algorithms for memory-efficient backpropagation,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 1163-1172, 2019.

[78].
B. Zheng, N. Vijaykumar, and G. Pekhimenko, “Echo: Compiler-based GPU memory footprint reduction for LSTM RNN training,” in Proc. ACM/IEEE 47th Annu. Int. Symp. Comput. Archit., 2020, pp. 1089-1102.

[79].
A. Yazdanbakhsh, A. Moradifirouzabadi, Z. Li, and M. Kang,“Sparse attention acceleration with synergistic in-memory pruning and on-chip recomputation,” in Proc. 55th IEEE/ACM Int. Symp. Microarchit., 2022, pp. 744-762.

[80].
O. Beaumont, L. Eyraud-Dubois, and A. Shilova, “Efficient combination of rematerialization and offloading for training DNNs,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 23844-23857, 2021.

[81].
J. Hu et al., “Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputation,” in Proc. 47th Des. Automat. Conf., 2010, pp. 350-355.

[82].
J. Hu, C. J. Xue, W.-C. Tseng, Q. Zhuge, and E. H.-M. Sha, “Minimizing write activities to non-volatile memory via scheduling and recomputation,” in Proc. IEEE 8th Symp. Appl. Specific Processors, 2010, pp. 101-106.

[83].
J. Hu et al., “Write activity minimization for nonvolatile main memory via scheduling and recomputation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 4, pp. 584-592, Apr. 2011.

[84].
Z. Jia, M. Zaharia, and A. Aiken,“Beyond data and model parallelism for deep neural networks,” in Proc. Mach. Learn. Syst., 2019, pp. 1-13.

[85].
H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects,” ACM SIGPLAN Notices, vol. 53, no. 2, pp. 461-475, 2018.

[86].
L. Lu et al., “TENET: A framework for modeling tensor dataflow based on relation-centric notation,” in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput. Archit., 2021, pp. 720-733.

[87].
V. Muttillo, P. Giammatteo, G. Fiorilli, and L. Pomante,“An open mp parallel genetic algorithm for design space exploration of heterogeneous multi-processor embedded systems,” in Proc. 11th Workshop Parallel Program. Run-Time Manage. Techn. Many-core Archit./9th Workshop Des. Tools Archit. Multicore Embedded Comput. Platforms, 2020, pp. 1-6.

[88].
H. Esbensen and E. S. Kuh, “Design space exploration using the genetic algorithm,” in Proc. IEEE Int. Symp. Circuits Syst., 1996, vol. 4, pp. 500-503.

[89].
M. Barbareschi, S. Barone, A. Bosio, J. Han, and M. Traiola, “A genetic-algorithm-based approach to the design of DCT hardware accelerators,” ACM J. Emerg. Technol. Comput. Syst., vol. 18, no. 3, pp. 1-25, 2022.

[90].
Y. Dou, C. Wang, H. Waris, R. Woods, and W. Liu, “FPAX: A fast prior knowledge-based framework for DSE in approximate configurations,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 43, no. 6, pp. 1650-1662, Jun. 2024.

[91].
N. Wu, Y. Xie, and C. Hao, “Ironman: GNN-assisted design space exploration in high-level synthesis via reinforcement learning,” in Proc. 2021 Great Lakes Symp. VLSI, 2021, pp. 39-44.

[92].
N. Wu, Y. Xie, and C. Hao, “Ironman-pro: Multiobjective design space exploration in HLS via reinforcement learning and graph neural network-based modeling,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 42, no. 3, pp. 900-913, Mar. 2023.

[93].
K. Feng et al., “Erdse: Efficient reinforcement learning based design space exploration method for cnn accelerator on resource limited platform,” Graph. Vis. Comput., vol. 4, 2021, Art. no. 200024.

[94].
S. Saeedi, A. Savino, and S. Di Carlo,“Design space exploration of approximate computing techniques with a reinforcement learning approach,” in Proc. 53rd Annu. IEEE/IFIP Int. Conf. Dependable Syst. Netw. Workshops, 2023, pp. 167-170.

[95].
Y. Bengio, “Gradient-based optimization of hyperparameters,” Neural Comput., vol. 12, no. 8, pp. 1889-1900, 2000.

[96].
I. Ahmadianfar, O. Bozorg-Haddad, and X. Chu, “Gradient-based optimizer: A new metaheuristic optimization algorithm,” Inf. Sci., vol. 540, pp. 131-159, 2020.

[97].
S. Xydis, C. Skouroumounis, K. Pekmestzi, D. Soudris, and G. Economakos,“Efficient high level synthesis exploration methodology combining exhaustive and gradient-based pruned searching,” in Proc. 2010 IEEE Comput. Soc. Annu. Symp. VLSI, 2010, pp. 104-109.

[98].
V. Filipovic´, “Fine-grained tournament selection operator in genetic algorithms,” Comput. Informat., vol. 22, no. 2, pp. 143-161, 2003.

[99].
B. L. Miller et al., “Genetic algorithms, tournament selection, and the effects of noise,” Complex Syst., vol. 9, no. 3, pp. 193-212, 1995.

[100].
A. Shukla, H. M. Pandey, and D. Mehrotra,“Comparative review of selection techniques in genetic algorithm,” in Proc. 2015 IEEE Int. Conf. Futuristic Trends Comput. Anal. Knowl. Manage., 2015, pp. 515-519.

[101].
A. Lipowski and D. Lipowska, “Roulette-wheel selection via stochastic acceptance,” Physica A: Stat. Mechan. Appl., vol. 391, no. 6, pp. 2193-2196, 2012.

[102].
F. Yu, X. Fu, H. Li, and G. Dong,“Improved Roulette wheel selection-based genetic algorithm for TSP,” in Proc. 2016 IEEE Int. Conf. Netw. Inf. Syst. Comput., 2016, pp. 151-154.

[103].
W. Qian, J. Chai, Z. Xu, and Z. Zhang, “Differential evolution algorithm with multiple mutation strategies based on Roulette wheel selection,” Appl. Intell., vol. 48, pp. 3612-3629, 2018.

[104].
A. J. Umbarkar and P. D. Sheth, “Crossover operators in genetic algorithms: A review,” ICTACT J. Soft Comput., vol. 6, no. 1, pp. 1083-1092, 2015.

[105].
A. B. Bakker, M. Westman, and I. H. van Emmerik, “Advancements in crossover theory,” J. Managerial Psychol., vol. 24, no. 3, pp. 206-219, 2009.

[106].
R. W. Hockney,“The communication challenge for MPP: Intel paragon and meiko CS-2,” Parallel Comput., vol. 20, no. 3, pp. 389-398, 1994.

[107].
J. Fang et al., “Palm: A efficient performance simulator for tiled accelerators with large-scale model training,” 2024. [Online]. org/abs/2406.03868

[108].
S. Rashidi, W. Won, S. Srinivasan, P. Gupta, and T. Krishna,“Fred: Flexible reduction-distribution interconnect and communication implementation for wafer-scale distributed training of DNN models,” 2024, arXiv:2406.19580.

[109].
Z. Tan, Z. Zhu, and K. Ma, “Cocco: Hardware-mapping co-exploration towards memory capacity-communication optimization,” in Proc. 29th ACM Int. Conf. Architectural Support Program. Lang. Operating Syst., 2024, pp. 69-84.

[110].
J. Cai et al., “Gemini: Mapping and architecture co-exploration for large-scale DNN chiplet accelerators,” in Proc. 2024 IEEE Int. Symp. High-Perform. Comput. Archit., 2024, pp. 156-171.

[111].
J. Zhu, C. Xue, Y. Chen, Z. Wang, and G. Sun, “Theseus: Exploring efficient wafer-scale chip design for large language models,” 2024, arXiv:2407.02079.

[112].
R. Dale,“GPT-3: What’s it good for?.” Natural Lang. Eng., vol. 27, no. 1, pp. 113-118, 2023.

[113].
H. Luo et al., “Ramulator 2.0: A modern, modular, and extensible dram simulator,” IEEE Comput. Archit. Lett., vol. 23, no. 1, pp. 112-116, Jan.-Jun. 2024.

[114].
A. H. M. Rad and V. W. Wong, “Congestion-aware channel assignment for multi-channel wireless mesh networks,” Comput. Netw., vol. 53, no. 14, pp. 2502-2516, 2009.

[115].
A. A. Pirzada, R. Wishart, and M. Portmann, “Congestion aware routing in hybrid wireless mesh networks,” in Proc. 15th IEEE Int. Conf. Netw., 2007, pp. 513-518.

[116].
N. Taherkhani, R. Akbar, F. Safaei, and M. Moudi,“A congestion-aware routing algorithm for mesh-based platform networks-on-chip,” Microelectronics J., vol. 114, 2021, Art. no. 105145.

[117].
J. Yao et al., “Training ultra long context language model with fully pipelined distributed transformer,” 2024, arXiv:2408.16978.

[118].
P. Basu, L. Zhao, J. Fantl, S. Pal, A. Krishnamurthy, and J. Khoury,“Efficient all-to-all collective communication schedules for direct-connect topologies,” in Proc. 33rd Int. Symp. High-Perform. Parallel Distrib. Comput., 2024, pp. 28-41.

[119].
S. Li, K. Lu, Z. Lai, W. Liu, K. Ge, and D. Li, “A multidimensional communication scheduling method for hybrid parallel DNN training,” IEEE Trans. Parallel Distrib. Syst., vol. 35, no. 8, pp. 1415-1428, Aug. 2024.

Outlines

/