Integrated Circuits and Systems

Select

Original article

A Dynamic Comparator With Cross-Coupled Pre-Amplifier With < 160 ps Delay and 81 fJ.ns EDP

NIDHI SHARMA, VINAYAK HANDE, DEVARSHI MRINAL DAS

Integrated Circuits and Systems. 2024, 1(4): 206-213. https://doi.org/10.23919/ICS.2024.3505092

Download PDF (26) HTML (115)

Knowledge map

Save

A low-power, high-speed dynamic comparator with the addition of a cross-coupled pair in the pre-amplifier stage, followed by a strong-arm latch, is presented. The proposed modification increases the pre-amplifier's differential and common-mode gains, improving the latch's differential and common-mode input voltage, resulting in faster regeneration with 22% speed improvement as compared to conventional comparator at small input differential voltages ( V_i,id ). The proposed technique boosts the comparator's speed and helps achieve 21% lower energy per conversion delay product (EDP) compared to the literature. Analytical modeling of the delay that proves the improvement in the speed of the proposed comparator is also presented and verified with the simulation results. The proposed comparator's delay is insensitive to the common-mode voltage (V_i,cm ). The proposed comparator is fabricated in 180-nm CMOS technology and measurement shows less than 160 ps relative CLK-Q delay with 81 fJ.ns EDP and 0.8 mV input-referred rms noise with 1.8 V supply. To demonstrate the scalability of the proposed technique to advanced technology nodes, the proposed design is also simulated in 65-nm CMOS technology with a 1.1 V supply for 5 GHz frequency. For V_i,cm of 0.3 V and V_i,id of 1 mV and 10 mV, the proposed comparator exhibits a 40.69 ps and 32.41 ps delay and has 3.74 fJ.ns and 2.78 fJ.ns EDP respectively.

Select

Original article

Overview of Cryogenic CMOS Based Computing Systems

YUHAO SHU, BIN NING, YIFEI LI, ZHAODONG LYU, JINCHENG WANG, LINTAO LAN, YUXIN ZHOU, MENGRU ZHANG, HONGTU ZHANG, YAJUN HA

Integrated Circuits and Systems. 2024, 1(4): 167-177. https://doi.org/10.23919/ICS.2024.3505839

Download PDF (31) HTML (85)

Knowledge map

Save

As integrated circuits advance into the post-Moore era, the improvement of computing performance encounters several challenges, making it difficult to meet the ever-growing computing demands. Cryogenic complementary metal oxide semiconductor (CMOS) based computing systems have emerged as a promising solution for overcoming the existing computing performance bottleneck. By cooling the circuitry to cryogenic temperatures, device leakage and wire resistance can be significantly reduced, leading to further improvements in energy efficiency and performance. Here, we conduct a comprehensive review of the cryogenic CMOS based computing systems across multiple optimization layers, including the CMOS process, modeling, electronic design automation (EDA), circuits, and architecture. Moreover, this review identifies potential future works and applications.

Select

Original article

TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips

HUIZHENG WANG, QIZE YANG, TAIQUAN WEI, XINGMAO YU, CHENGRAN LI, JIAHAO FANG, GUANGYANG LU, XU DAI, LIANG LIU, SHENFEI JIANG, YANG HU, SHOUYI YIN, SHAOJUN WEI

Integrated Circuits and Systems. 2024, 1(4): 178-195. https://doi.org/10.23919/ICS.2024.3515003

Download PDF (117) HTML (91)

Knowledge map

Save

Transformer-based large language models (LLMs) have made significant strides in the field of artificial intelligence (AI). However, training these LLMs imposes immense demands on computational power and bandwidth for hardware systems. Wafer-scale chips (WSCs) offer a promising solution, yet they struggle with limited on-chip memory and complex tensor partitioning. To fully harness the high-bandwidth, low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations, a specialized mapping and architecture co-exploration method is essential. Despite existing efforts in memory optimization and mapping, current approaches fall short for WSC scenarios. To bridge this gap, we introduce TMAC, an architecture-mapping co-exploration framework that integrates recomputation into the design space, fully exploiting optimization opportunities overlooked by existing works. Further, TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme. TMAC then introduces a novel operator-centric encoding scheme (OCES) designed to comprehensively describe the mapping space for training LLMs. Unlike previous studies that focus solely on communication volume analysis based on mapping, TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance. However, fully accounting for these untapped optimization opportunities increases the complexity of the design space. To address this, we streamline the simulation process, reducing the time needed for exploration. Compared to AccPar, Deepspeed and Megatron, TMAC delivers a 3.1×, 2.9×, 1.6× performance gain. In terms of memory usage, TMAC requires 3.6×, 3.1× less memory than AccPar and Deepspeed, respectively and is comparable to Megatron’s full recomputation method.

Select

Original article

A Multi-Modal System Featuring Wireless Flexible Sensor Patches and a Depth-Sensing Imager for Home-Based Monitoring of Rehabilitation Exercises

RUNTIAN YANG, YUHAN HOU, SAVANNA BLADE, YINFEI LI, VANSH TYAGI, GLORIA-EDITH BOUDREAULT-MORALES, JOSÉ ZARIFFA, XILIN LIU

Integrated Circuits and Systems. 2024, 1(4): 227-238. https://doi.org/10.23919/ICS.2024.3512503

Download PDF (6) HTML (91)

Knowledge map

Save

Monitoring rehabilitation progress at home over the long term following a spinal cord injury (SCI) is crucial for maximizing therapeutic outcomes and enhancing the quality of life of affected individuals. Comprehensive monitoring requires collecting a range of physiological data, including surface electromyography (sEMG) and exercise motion data. Currently, assessments typically take place in clinical settings, which can be both costly and inconvenient for patients. There is a lack of accessible, user-friendly systems that allow individuals with SCI to independently gather this data at home. Additionally, video recordings may be necessary to verify that patients are positioning the sensors correctly and performing the exercises accurately. To bridge this gap, we have developed a self-contained, multi-modal sensor system that captures sEMG and motion data, along with depth-sensing video to track patient exercises while ensuring privacy by minimizing identifiable details. The system includes a configurable number of wireless, multisensor wearable patches that are easy to attach and comfortable for extended use, along with a time-of-flight depth-sensing camera. The multi-modal data is streamed and synchronized in real-time on a Raspberry Pi, establishing an innovative platform to support SCI rehabilitation and adaptable for various clinical monitoring applications.

Select

Special Issue on Selected Papers from ICTA2023

A 312.5 Mbps-32 Gbps JESD204C Wireline Transceiver Back-Compatible With JESD204B in 28 nm CMOS

SHIJIE LI, RUICHANG MA, MINGXING DENG, JIAMIN XUE, WEI DENG, BAOYONG CHI, HAIKUN JIA

Integrated Circuits and Systems. 2024, 1(2): 109-118. https://doi.org/10.23919/ICS.2024.3423852

Download PDF (158) HTML (98)

Knowledge map

Save

This paper presents a 32 Gbps wireline transceiver that not only supports the JESD204 C standard but also maintains back-compatibility with JESD204B with minimal additional circuitry. Additionally, a pattern-filtered phase detector (PFPD) is proposed to circumvent the side effect of ambiguous sampling clock phase caused by loop-unrolled 1st post-cursor tap equalization scheme in the decision-feedback equalization (DFE). A 16 GHz external half-rate clock is injected into an on-chip injection-locked ring oscillator to distribute the 16 GHz clock for both the receiver and the transmitter. Multiple on-chip adaption engines and calibration loops are also added to assist the whole system work properly, such as tap weight and desired level adaption engine integrated into the decision-feedback equalizer, duty cycle distortion correction and IQ-mismatch correction. Fabricated in 28 nm CMOS process, the proposed transceiver demonstrates its ability to operate within a signaling range from 312.5 Mbps to 32 Gbps, achieving a BER below 10⁻¹²over a 14.9 dB channel loss at Nyquist frequency. It occupies an aggregated area of 1.4 mm² and consumes 203 mW at 32 Gbps, in which 50 mW for the transmitter (TX) and 153 mW for the receiver (RX), therefore end up achieving 6.34pJ/bit power efficiency at 32 Gbps.

Select

Original article

PODALA: Power-Efficient Object Detection Accelerator With Customized Layer Fusion Engine

TING YUE, LIANG CHANG, HAOBO XU, CHENGZHI WANG, SHUISHENG LIN, JUN ZHOU

Integrated Circuits and Systems. 2024, 1(4): 196-205. https://doi.org/10.23919/ICS.2024.3506511

Download PDF (7) HTML (75)

Knowledge map

Save

The object detection algorithm based on convolutional neural networks (CNNs) significantly enhances accuracy by expanding network scale. As network parameters increase, large-scale networks demand substantial memory resources, making deployment on hardware challenging. Although most neural network accelerators utilize off-chip storage, frequent access to external memory restricts processing speed, hindering the ability to meet the frame rate requirements for embedded systems. This creates a trade-off in which the speed and accuracy of embedded target detection accelerators cannot be simultaneously optimized. In this paper, we propose PODALA, an energy-efficient accelerator developed through the algorithm-hardware co-design methodology. For object detection algorithm, we develop an optimized algorithm combined with the inverse-residual structure and depth wise separable convolution, effectively reducing network parameters while preserving high detection accuracy. For hardware accelerator, we develop a custom layer fusion technique for PODALA to minimize memory access requirements. The overall design employs a streaming hardware architecture that combines a computing array with a refined ping-pong output buffer to execute different layer fusion computing modes efficiently. Our approach substantially reduces memory usage through optimizations in both algorithmic and hardware design. Evaluated on the Xilinx ZCU102 FPGA platform, PODALA achieves 78 frames per second (FPS) and 79.73 GOPS/W energy efficiency, underscoring its superiority over state-of-the-art solutions.

Select

Special Issue on Selected Papers from ICTA2023

SFTT: A 325 FPS Computational and Hardware Efficient Corner-Detection Accelerator Design for SLAM Applications

WEIYI ZHANG, CHAOYANG DING, XIAORUI MO, FEI SHAO, YIYANG WANG, YUSHI GUO, LITING NIU, CHENG NIAN, FASIH UD DIN FARRUKH, CHUN ZHANG

Integrated Circuits and Systems. 2024, 1(2): 66-79. https://doi.org/10.23919/ICS.2024.3449791

Download PDF (8) HTML (78)

Knowledge map

Save

Simultaneous Localization and Mapping (SLAM) is the process by which a mobile robot can build a map of the surrounding environment and compute its own location. Feature point extraction is one of the key components of a SLAM system. The extraction accuracy and efficiency of corner detection directly affect the overall accuracy and throughput of the system. However, the complexity of corner detection algorithms makes it challenging to achieve real-time implementation and efficient, low-cost hardware design, especially for mobile robots. Harris corner detection class algorithms including Harris and GFTT (Good Feature to Track) have improved accuracy. However, those algorithms require high resource consumption and latency when implemented on hardware platforms. The GFTT achieves higher accuracy than Harris while requiring higher computational complexity. To address the throughput problem, SFTT (Simple Feature to Track), a new Harris class detection algorithm is proposed, and the corresponding hardware accelerator is designed. The proposed SFTT significantly reduced the computational complexity compared with the Harris algorithm and GFTT. Experiments have shown SFTT also achieved slightly higher accuracy compared with the two algorithms. Furthermore, the GFTT accelerator is designed which reaches up to 325 fps at the frequency of 100 MHz. The proposed design has achieved an improvement in throughput by 1.3× times and power efficiency by 1.7× times as compared to state-of-the-art design.

Select

BASER: Bit-Wise Approximate Compressor Configurable In-SRAM-Computing for Energy-Efficient Neural Network Acceleration With Data-Aware Weight Remapping Method

SHUNQIN CAI, LIUKAI XU, DENGFENG WANG, ZHI LI, WEIKANG QIAN, LIANG CHANG, YANAN SUN

Integrated Circuits and Systems. 2024, 1(2): 80-91. https://doi.org/10.23919/ICS.2024.3419630

Download PDF (9) HTML (84)

Knowledge map

Save

SRAM-based computing-in-memory (SRAM-CIM) is expected to solve the “Memory Wall” problem. For the digital domain SRAM-CIM, full-precision digital logic has been utilized to achieve high computational accuracy. However, the energy and area efficiency advantages of CIM cannot be fully utilized under error-resilient neural networks (NNs) with given quantization bit-width. Therefore, an all-digital Bit-wise Approximate compressor configurable In-SRAM-computing macro for Energy-efficient NN acceleration, with a data-aware weight Remapping method (BASER), is proposed in this paper. Leveraging the NN error resilience property, six energy-efficient bit-wise compressor configurations are presented under 4b/4b and 3b/3b NN quantization, respectively. Concurrently, a data-aware weight remapping approach is proposed to enhance the NN accuracy without supplementary retraining further. Evaluations of VGG-9 and ResNet-18 on CIFAR-10 and CIFAR-100 datasets show that the proposed BASER achieves 1.35x and 1.29x improvement in energy efficiency, as well as limited accuracy loss and improved NN accuracy, as compared to the previous full-precision and approximate SRAM-CIM design, respectively.

Select

Regular Papers

The Decomposition and Combination Paradigms of Chiplet-Based Integrated Chips

FUPING LI, YING WANG, MEIXUAN LU, YUTONG ZHU, HAORAN WANG, ZHUN ZHAO, JUNPEI HUANG, XIAOTONG WEI, XIHAO LIANG, YUJIE WANG, HAOBO XU, HUAWEI LI, XIAOWEI LI, QI LIU, MING LIU, NINGHUI SUN, YINHE HAN

Integrated Circuits and Systems. 2024, 1(1): 18-30. https://doi.org/10.23919/ICS.2024.3451428

Download PDF (5) HTML (69)

Knowledge map

Save

Due to the waning of Moore’s Law, the conventional monolithic chip architectural design is confronting hurdles such as increasing die size and skyrocketing cost. In this post-Moore era, the integrated chip has emerged as a pivotal technology, gaining substantial interest from both the academia and industry. Compared with monolithic chips, the chiplet-based integrated chips can significantly enhance system scalability, curtail costs, and accelerate design cycles. However, integrated chips introduce vast design spaces encompassing chiplets, inter-chiplet connections, and packaging parameters, thereby amplifying the complexity of the design process. This paper introduces the Optimal Decomposition-Combination Theory, a novel methodology to guide the decomposition and combination processes in integrated chip design. Furthermore, it offers a thorough examination of existing integrated chip design methodologies to showcase the application of this theory.

Select

Special Issue on Selected Papers from ICTA2023

A 64-Gb/s 0.33-pJ/Bit PAM4 Receiver Analog Front-End With a Single-Stage Triple-Peaking CTLE Achieving 22.5-dB Boost in 40-nm CMOS Process

GUOQING WANG, ZHAO ZHANG

Integrated Circuits and Systems. 2024, 1(2): 103-108. https://doi.org/10.23919/ICS.2024.3456043

Download PDF (86) HTML (55)

Knowledge map

Save

This work presents a PAM4 receiver analog frontend (AFE) operating up to 64 Gb/s. The electronic integrated circuit (EIC) is fabricated in 40-nm CMOS technology. This AFE is composed of a single-stage Continuous-Time Linear Equalizer (CTLE), a Variable Gain Amplifier (VGA), an input impedance matching network, a buffer stage, and an output buffer. The single-stage triple-peaking CTLE proposed employs current reuse technique and a multi-feedback structure, enabling the adjustment of peaking in the low, mid, and high-frequency bands. Thus, only one-stage CTLE is sufficient to achieve an over-20-dB boost at Nyquist frequency to save power. The VGA adopts an enhanced structure based on the Gilbert cell, where the gain is manipulated by controlling the gate voltage of MOS transistors. The CTLE undergoes variations in its DC gain during the adjustment process to equalize channel losses. The role of the VGA is to stable the DC gain changes induced by the adjustment of the CTLE. The output buffer adopts two stages, aiming to ensure that the gain does not attenuate excessively while maintaining output impedance matching. The AFE consumes 21.1 mW with a supply voltage of 1.5/1 V. It can provide a maximum boost of 22.5 dB, and the data rate reaches up to 64 Gb/s. Additionally, it features peaking adjustment capabilities in the low, mid, and high-frequency bands. Finally, the measurement demonstrates its ability to effectively equalize a channel with a 12-dB loss at the Nyquist frequency of 16 GHz.

Select

Special Section on Selected Papers from ASICON2023

An AC Coupling Ultrasound Analog Front-End Architecture With a Three-Stage DCOC

XIANGCHEN WAN, SIQING WU, XINWEI YU, XINGTAO ZHU, AND FAN YE

Integrated Circuits and Systems. 2024, 1(1): 33-42. https://doi.org/10.23919/ICS.2024.3422708

Download PDF (3) HTML (69)

Knowledge map

Save

This paper presents an AC coupling ultrasound analog front-end (AFE) architecture with a three-stage DC offset correction (DCOC) circuit. In ultrasound systems, the low noise amplifier (LNA), time gain control (TGC), and low pass filter (LPF) constitute the AFE, which achieves low noise, time-varying gain compensation, and filtering for the received ultrasound signal. The inherent asymmetry in LNA, layout asymmetry and the process variation introduce DC offset and the TGC changes it into low-frequency offset drift. The proposed DCOC circuit for LNA is composed of a transconductance amplifier and an off-chip capacitor, while a fully differential operational amplifier and a pseudo resistor are used for other amplification stages. The AC coupling scheme is also used to reduce the offset and drift. The simulation result shows when the DCOC and the AC coupling are adopted, the offset and drift are almost perfectly suppressed. The proposed AFE has been fabricated by a 28-nm CMOS process, and it achieves an 85 dB gain range with low input-referred noise of 2.43 nV/ $\sqrt{Hz}$ at 5 MHz, and it also has a tunable bandwidth of 15/30 MHz and switchable input impedance of 50/100 ohms.

Select

Original article

A Multi-Channel 133-dB DR PPG/ECG SoC for Smart Wearable Devices

SHIHONG ZHOU, XIN WANG, YANXING SUO, XIAO HAN, GUOXING WANG, YANG ZHAO

Integrated Circuits and Systems. 2024, 1(4): 239-246. https://doi.org/10.23919/ICS.2024.3513261

Download PDF (128) HTML (57)

Knowledge map

Save

Real-time monitoring of multimodal vital signs including electrocardiography (ECG) and photoplethysmography (PPG) on wearable devices are attracting increasing interests. Motion artifacts, ambient light interference and sensor-skin contact variability affect signal quality significantly, demanding a multi-channel sensor interface chip with high dynamic range yet low power. A PPG/ECG interface chip is proposed for robust signal optimization. Time-division multiplexing, ambient double sampling and DC current compensation together enhance the dynamic range. Fabricated in a 0.18 μm process, the chip features a 5.63-pArms direct digitized input-referred noise for PPG readout and 365-nVrms for ECG. A cross-scale dynamic range of 133 dB is achieved, providing saturation-free usage for sport wearable devices such as smart watches and rings.

Select

Regular Papers

Design-for-Test Solutions for 3-D Integrated Circuits

SHAO-CHUN HUNG, PARTHO BHOUMIK, ARJUN CHAUDHURI, SANMITRA BANERJEE, KRISHNENDU CHAKRABARTY

Integrated Circuits and Systems. 2024, 1(1): 3-17. https://doi.org/10.23919/ICS.2024.3419629

Download PDF (3) HTML (37)

Knowledge map

Save

As Moore’s Law approaches its limits, 3-D integrated circuits (ICs) have emerged as promising alternatives to conventional scaling methodologies. However, the benefits of 3-D integration in terms of lower power consumption, higher performance, and reduced area are accompanied by testing challenges. The unique vertical stacking of components in 3-D ICs introduces concerns related to the robustness of bonding surfaces. Moreover, immature manufacturing processes during 3-D fabrication can lead to high defect rates in different tiers. Therefore, there is a need for design-for-test solutions to ensure the reliability and performance of 3-D-integrated architectures. In this paper, we provide a comprehensive survey of existing testing strategies for 3-D ICs. We describe recent advances, including research efforts and industry practice, that address concerns related to bonding defects, elevated power supply noise, fault diagnosis, and fault localization specific to the unique characteristics of 3-D ICs.

Select

Special Issue on Selected Papers from ICTA2023

A High-Speed Hardware Architecture of an NTT Accelerator for CRYSTALS-Kyber

JUNYAN SUN, XUEFEI BAI

Integrated Circuits and Systems. 2024, 1(2): 92-102. https://doi.org/10.23919/ICS.2024.3419562

Download PDF (17) HTML (47)

Knowledge map

Save

CRYSTALS-Kyber has emerged as a notable lattice-based post-quantum cryptography (PQC) scheme. As one of the four finalists in NIST’s PQC standardization round three, CRYSTALS-Kyber is the only encryption algorithm demonstrating superior performance compared to other algorithms. The number theoretic transform (NTT) is employed to optimize polynomial multiplication, which constitutes the most complex operation within CRYSTALS-Kyber. This study introduces a high-speed NTT accelerator architecture, featuring a novel butterfly unit and an efficient modular polynomial multiplier. The proposed accelerator utilizes a radix-4-based configurable NTT design, which is capable of executing both forward and inverse NTT operations on a unified architecture. When implemented on the Xilinx Virtex-7 FPGA platform, the proposed architecture achieves an acceleration of 1.02-2.30 times in terms of latency, a throughput improvement of 1.02-2.30 times, and an area throughput improvement of up to 3.30 times, relative to the prior works.

Select

Special Issue on Selected Papers from ICTA2023

Editorial: Special Issue on Selected Papers From ICTA2023

XIAOYAN GUI, LIN CHENG

Integrated Circuits and Systems. 2024, 1(2): 64-65. https://doi.org/10.23919/ICS.2024.3483732

Download PDF (12) HTML (48)

Knowledge map

Save

Select

Original article

Guest Editorial: Special Section on Selected Papers from BioCAS 2024

MILIN ZHANG

Integrated Circuits and Systems. 2024, 1(4): 214-214. https://doi.org/10.23919/ICS.2024.3513736

Download PDF (11) HTML (37)

Knowledge map

Save

Select

Original article

Wearable Sensors for Vital Signs: A Review of 20 Years and Future Directions

YANXING SUO, HAOXIANG GUO, RUIZHE YU, YANG ZHAO

Integrated Circuits and Systems. 2024, 1(4): 215-226. https://doi.org/10.23919/ICS.2024.3511474

Download PDF (2) HTML (34)

Knowledge map

Save

This paper presents a comprehensive review of the advancements of wearables over 20 years and explores future directions in the field. The review focuses on the evolving needs, innovation solutions, and technological progress of these sensors. We categorize the existing literature into key themes, highlighting significant advancements, challenges and future prospects in wearable sensor technology.

Select

Special Section on Selected Papers from ASICON2023

An Efficient Multiplier-Less ProcessingElement on Power-of-2 Dictionary-Based Data Quantization

JIAXIANG LI, MASAO YANAGISAWA, YoUHUA SH

Integrated Circuits and Systems. 2024, 1(1): 53-62. https://doi.org/10.23919/ICS.2024.3423850

Download PDF (2) HTML (29)

Knowledge map

Save

The large-scale neural networks have brought incredible shocks to the world, changing people's lives and offering vast prospects. However, they also come with enormous demands for computational power and storage pressure, the core of its computational requirements lies in the matrix multiplication units dominated by multiplication operations. To address this issue, we propose an area-power-efficient multiplier-less processing element (PE) design. Prior to implementing the proposed PE, we apply a power-of-2 dictionary-based quantization to the model and effectiveness of this quantization method in preserving the accuracy of the original model is confirmed. In hardware design, we present a standard and one variant ‘bi-sign’ architecture of the PE. Our evaluation results demonstrate that the systolic array that implement our standard multiplier-less PE achieves approximately 38% lower power-delay-product and 13% smaller core area compared to a conventional multiplication-and-accumulation PE and the bi-sign PE design can even save 37% core area and 38% computation energy. Furthermore, the applied quantization reduces the model size and operand bit-width, leading to decreased on-chip memory usage and energy consumption for memory accesses. Additionally, the hardware schematic facilitates expansion to support other sparsity-aware, energy-efficient techniques.

Select

Special Section on Selected Papers from ASICON2023

Editorial: Special Section on Selected Papers from ASICON2023

FRANCOIS RIVET,, LIANG QI

Integrated Circuits and Systems. 2024, 1(1): 31-32. https://doi.org/10.23919/ICS.2024.3484397

Download PDF (1) HTML (29)

Knowledge map

Save

Select

Special Section on Selected Papers from ASICON2023

Linearity Performance of Charge Domain In-Memory Computing: Analysis and Calibration

HENG ZHANG, YICHUAN BAI, JUNJIE SHEN, YUAN DU, LI DU

Integrated Circuits and Systems. 2024, 1(1): 43-53. https://doi.org/10.23919/ICS.2024.3422968

Download PDF (7) HTML (25)

Knowledge map

Save

Deep learning has recently gained significant prominence in various real-world applications such as image recognition, natural language processing, and autonomous vehicles. While deep neural networks appear to have different architectures, the main operations within these models are matrix-vector multiplications (MVM). Compute-in-memory (CIM) architectures are promising solutions for accelerating the massive MVM operations by alleviating the frequent data movement issue in traditional processors. Ana log CIM macros leverage current-accumulating or charge-sharing mechanisms to perform multiply-and-add (MAC) computations. Even though they can achieve high throughput and efficiency, the computing accuracy is sacrificed due to the analog nonidealities. To ensure precise MAC calculations, it is crucial to analyze the sources of nonidealities and identify their impacts, along with corresponding solutions. In this paper, comprehensive linearity analysis and dedicated calibration methods for charge domain static-random access memory (SRAM) based in-memory computing circuits are proposed. We analyze nonidealities from three areas based on the mechanism of charge domain computing: charge injection effect, temperature variations, and ADC reference voltage mismatch. By designing a 256 × 256 CIM macro and conducting investigations via post-layout simulation, we conclude that these nonidealities don’t deteriorate the computing linearity, but only cause the scaling and bias drift. To mitigate the scaling and bias drift identified, we propose three calibration methods ranging from the circuit level to the algorithm level, all of which exhibit promising results. The comprehensive analysis and calibration methods can assist in designing CIM macros with more accurate MAC computations, thereby supporting more robust deep learning inference.

Select

Editorial

Editorial: Message from the Editor-in-Chief

Ming Liu, PETER LIAN

Integrated Circuits and Systems. 2024, 1(1): 2-2. https://doi.org/10.23919/ICS.2024.3483733

Download PDF (2) HTML (23)

Knowledge map

Save

Most Viewed

Please choose a citation manager

Content to export

模态框（Modal）标题

Most Viewed

Please choose a citation manager

Content to export