Special Issue Papers

Edge-Optimized AI Architecture: MRAM-Based Near Memory Computing Macro Balancing Between Memory Capacity and Computation

  • XI YANG 1 ,
  • YAMIN MAO 2 ,
  • LIANG CHANG , 1, 3 ,
  • HAOJIE WEI 1 ,
  • YUANBO WANG 1 ,
  • JINGKE WANG 1 ,
  • CHAO FAN 4 ,
  • ZHONGMOU WU 3 ,
  • SHOUZHONG PENG 5 ,
  • JUN ZHOU 1
Expand
  • 1 School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
  • 2 CETC Rongwei Electronic Technology Company, Ltd., Chengdu 610036, China
  • 3 Casicyber Industrial Technology Research Institute, Chengdu 610317, China
  • 4 Beijing Houmo AI Inc., Beijing 100176, China
  • 5 Anhui Province Key Laboratory of Spintronic Chip Research and Manufacturing, Hefei Innovation Research Institute, Beihang University, Hefei 230013, China
LIANG CHANG (e-mail: ).

XI YANG (Student Member, IEEE);

LIANG CHANG(Member, IEEE);

SHOUZHONG PENG(Member, IEEE);

JUN ZHOU (Senior Member, IEEE)

Received date: 2025-01-10

  Revised date: 2025-03-11

  Accepted date: 2025-03-16

  Online published: 2025-10-22

Supported by

Open Project Program of Anhui Province Key Laboratory of Spintronic Chip Research and Manufacturing under Grant WNKFKT-25-01, in part by the National Science Foundation of China under Grant 62104025, and in part by the State Key Laboratory of Computer Architecture (ICT, CAS) under Grant CLQ202305

Abstract

General-purpose edge neural networks need a lightweight architecture that effectively balances storage and computing resources. However, SRAM-based computing-in-memory (CIM) architectures face challenges in delivering adequate on-chip storage while fulfilling computing requirements. To overcome this, we introduce a new MRAM-based near-memory computing (NMC) architecture. It retains the costeffective data access benefits of CIM while separating storage and computing at the macro-level, improving deployment adaptability. We refine the NMC macro by incorporating small temporary storage and adopting a layer-fusion approach to enhance data-transfer efficiency. By integrating a high-capacity MRAM into the macro, we attain a storage density of 0.532 um2/bit. Moreover, we enhance the adder tree with a shift module, supporting multiply-and-accumulate (MAC) operations at five distinct depths (8, 9, 16, 32, and 64), which raises resource utilization efficiency to 88.3%. Our architecture achieves an on-chip storage density of 1.49 Mb/mm2 and an energy efficiency of 6.164 TOPS/W.

Cite this article

XI YANG , YAMIN MAO , LIANG CHANG , HAOJIE WEI , YUANBO WANG , JINGKE WANG , CHAO FAN , ZHONGMOU WU , SHOUZHONG PENG , JUN ZHOU . Edge-Optimized AI Architecture: MRAM-Based Near Memory Computing Macro Balancing Between Memory Capacity and Computation[J]. Integrated Circuits and Systems, 2025 , 2(1) : 4 -12 . DOI: 10.23919/ICS.2025.3553460

[1]
X. Kong, Y. Wu, H. Wang, and F. Xia, “Edge computing for Internet of Everything: A survey,” IEEE Internet Things J., vol. 9, no. 23, pp. 23472-23485, Dec. 2022.

[2]
K. Cao, Y. Liu, G.Meng, and Q. Sun, “An overview on edge computing research,” IEEE Access, vol. 8, pp. 85714-85728, 2020.

[3]
J. Xiao et al., “14.8 KASP: A 96.8 accuracy and 1.68 μJ/classification keyword spotting and speaker verification processor using adaptive beamforming and progressive wake-up,”in Proc. IEEE Int. Solid-State Circuits Conf., 2024, pp. 268-270.

[4]
J. Liu et al., “33.1 a high-accuracy and energy-efficient zero-shotretraining seizure-detection processor with hybrid-feature-driven adaptive processing and learning-based adaptive channel selection,” in Proc. IEEE Int. Solid-State Circuits Conf., 2024, pp. 542-544.

[5]
T. Sipola, J. Alatalo, T. Kokkonen, and M. Rantonen, “Artificial intelligence in the IoT era: A review of edge ai hardware and software,” in Proc. 31st Conf. Open Innovations Assoc., 2022, pp. 320-331.

[6]
Y.-C. Lo and R.-S. Liu, “Morphable CIM: Improving operation intensity and depthwise capability for SRAM-CIM architecture,” in Proc. 60th ACM/IEEE Des. Automat. Conf., 2023, pp. 1-6.

[7]
S. Yu, H. Jiang, S. Huang, X. Peng, and A. Lu, “Compute-in-memory chips for deep learning: Recent trends and prospects,” IEEE Circuits Syst. Mag., vol. 21, no. 3, pp. 31-56, Third Quarter 2021.

[8]
X. Si et al., “24.5 A twin-8T SRAM computation-in-memory macro for multiple-bit CNN-based machine learning,” in Proc. IEEE Int. Solid- State Circuits Conf., 2019, pp. 396-398.

[9]
J.-W. Su et al., “16.3 A 28nm 384kb 6T-SRAM computation-in-memory macro with 8b precision for AI edge chips,” in Proc. IEEE Int. Solid- State Circuits Conf., 2021, pp. 250-252.

[10]
S. Sharma et al., “AFE-CIM: A current-domain compute-in-memory macro for analog-to-feature extraction,” in Proc. IEEE 49th Eur. Solid State Circuits Conf., 2023, pp. 33-36.

[11]
A. Singh, M. A. Lebdeh, A. Gebregiorgis, R. Bishnoi, R. V. Joshi, and S. Hamdioui, “SRIF: Scalable and reliable integrate and fire circuit ADC for memristor-based CIM architectures,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 5, pp. 1917-1930, May 2021.

[12]
J. Wang et al., “Reconfigurable bit-serial operation using toggle SOT-MRAM for high-performance computing in memory architecture,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 11, pp. 4535-4545, Nov. 2022.

[13]
P. Chi et al., “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in Proc. ACM/IEEE 43 rd Annu. Int. Symp. Comput. Archit., 2016, pp. 27-39.

[14]
Y. Long, T. Na, and S. Mukhopadhyay, “ReRAM-based processingin- memory architecture for recurrent neural network acceleration,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 12, pp. 2781-2794, Dec. 2018.

[15]
A. Guo et al., “A 28nm 64-kb 31.6-TFLOPS/W digital-domain floatingpoint- computing-unit and double-bit 6T-SRAM computing-in-memory macro for floating-point CNNs,” in Proc. IEEE Int. Solid-State Circuits Conf., 2023, pp. 128-130.

[16]
J.-W. Su et al., “15.2 A 28nm 64Kb inference-training two-way transpose multibit 6T SRAM compute-in-memory macro for AI edge chips,” in Proc. IEEE Int. Solid-State Circuits Conf., 2020, pp. 240-242.

[17]
S. Xie, C. Ni, A. Sayal, P. Jain, F. Hamzaoglu, and J. P. Kulkarni, “eDRAM-CIM: Reconfigurable charge domain compute-in-memory design with embedded dynamic random access memory array realizing adaptive data converters,” IEEE J. Solid-State Circuits, vol. 59, no. 6, pp. 1950-1961, Jun. 2024.

[18]
Y. Zhan, W.-H. Yu, K.-F. Un, R. P.Martins, and P.-I. Mak, “GSLP-CIM: A 28-nm globally systolic and locally parallel CNN/transformer accelerator with scalable and reconfigurable eDRAM compute-in-memory macro for flexible dataflow,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 72, no. 4, pp. 1657-1667, Nov. 2024.

[19]
Y. He et al., “An RRAM-based digital computing-in-memory macro with dynamic voltage sense amplifier and sparse-aware approximate adder tree,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 70, no. 2, pp. 416-420, Feb. 2023.

[20]
Z. Jing, B. Yan, Y. Yang, and R. Huang, “VSDCA: A voltage sensing differential column architecture based on 1T2R RRAM array for computing-in-memory accelerators,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 10, pp. 4028-4041, Oct. 2022.

[21]
J. He, Y. Huang, M. Lastras, T. T. Ye, C.-Y. Tsui, and K.-T. Cheng, “RVComp: Analog variation compensation for RRAM-based in-memory computing,” in Proc. 28th Asia South Pacific Des. Automat. Conf., 2023, pp. 1-6.

[22]
X. Yang et al., “Edge-optimized ai architecture: MRAM-based near memory computing macro balancing between memory capacity and computation,” in Proc. IEEE Int. Conf. Integr. Circuits, Technol. Appl., 2024, pp. 158-159.

[23]
S. Kvatinsky et al. “MAGIC—memristor-aided logic , IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 11, pp. 895-899, Nov. 2014.

[24]
Q. Dong et al., “15.3 A 351 TOPS/W and 372.4GOPS compute-inmemory SRAM macro in 7 nm finFET CMOS for machine-learning applications,” in Proc. IEEE Int. Solid-State Circuits Conf., 2020, pp. 242-244.

[25]
S. Jung et al., “A crossbar array of magnetoresistive memory devices for in-memory computing,” Nature, vol. 601, no. 7892, pp. 211-216, 2022.

[26]
W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-memory processing paradigm for bitwise logic operations in STT-MRAM,” IEEE Trans. Magn., vol. 53, no. 11, pp. 1-4, Nov. 2017.

[27]
C. Wang, Z. Wang, G. Wang, Y. Zhang, and W. Zhao, “Design of an area-efficient computing in memory platform based on STT-MRAM,” IEEE Trans. Magn., vol. 57, no. 2, pp. 1-4, Feb. 2021.

[28]
T.-C. Chang et al., “13.4 A 22nm 1Mb 1024b-read and near-memorycomputing dual-mode STT-MRAM macro with 42.6GB/s read bandwidth for security-aware mobile devices,” in Proc. IEEE Int. Solid-State Circuits Conf., 2020, pp. 224-226.

[29]
Y.-C. Chiu et al., “A 22nm 4Mb STT-MRAM data-encrypted nearmemory computation macro with a 192GB/s read-and-decryption bandwidth and 25.1- 55.1TOPS/W 8bMAC for AI operations,” in Proc. IEEE Int. Solid-State Circuits Conf., vol. 65, 2022, pp. 178-180.

[30]
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” 2016, arXiv:1602.07360.

[31]
A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” 2017, arXiv:1704.04861.

[32]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” 2019, arXiv:1801.04381.

[33]
N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 116-131.

[34]
T. Na, B. Song, J. P. Kim, S. H. Kang, and S.-O. Jung, “Offsetcanceling current-sampling sense amplifier for resistive nonvolatile memory in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 52, no. 2, pp. 496-504, Feb. 2017.

[35]
C. Zhang, M. Wang, Y. Mai, C. Tang, and Z. Yu, “A high-density and reconfigurable SRAM-based digital compute-in-memory macro for low-power AI chips,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 70, no. 9, pp. 3589-3593, Sep. 2023.

[36]
H. Kim, J. Mu, C. Yu, T. T.-H. Kim, and B. Kim, “A 1-16b reconfigurable 80 Kb 7T SRAM-based digital near-memory computing macro for processing neural networks,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 70, no. 4, pp. 1580-1590, Apr. 2023.

[37]
H. Cai et al., “33.4 A 28 nm 2Mb STT-MRAM computing-in-memory macro with a refined bit-cell and 22.4-41. 5 TOPS/W for AI inference,” in Proc. IEEE Int. Solid-State Circuits Conf., 2023, pp. 500-502.

Outlines

/