Most Viewed

  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Original article
    YUHAO SHU, BIN NING, YIFEI LI, ZHAODONG LYU, JINCHENG WANG, LINTAO LAN, YUXIN ZHOU, MENGRU ZHANG, HONGTU ZHANG, YAJUN HA
    Integrated Circuits and Systems. 2024, 1(4): 167-177. https://doi.org/10.23919/ICS.2024.3505839

    As integrated circuits advance into the post-Moore era, the improvement of computing performance encounters several challenges, making it difficult to meet the ever-growing computing demands. Cryogenic complementary metal oxide semiconductor (CMOS) based computing systems have emerged as a promising solution for overcoming the existing computing performance bottleneck. By cooling the circuitry to cryogenic temperatures, device leakage and wire resistance can be significantly reduced, leading to further improvements in energy efficiency and performance. Here, we conduct a comprehensive review of the cryogenic CMOS based computing systems across multiple optimization layers, including the CMOS process, modeling, electronic design automation (EDA), circuits, and architecture. Moreover, this review identifies potential future works and applications.

  • Original article
    HUIZHENG WANG, QIZE YANG, TAIQUAN WEI, XINGMAO YU, CHENGRAN LI, JIAHAO FANG, GUANGYANG LU, XU DAI, LIANG LIU, SHENFEI JIANG, YANG HU, SHOUYI YIN, SHAOJUN WEI
    Integrated Circuits and Systems. 2024, 1(4): 178-195. https://doi.org/10.23919/ICS.2024.3515003

    Transformer-based large language models (LLMs) have made significant strides in the field of artificial intelligence (AI). However, training these LLMs imposes immense demands on computational power and bandwidth for hardware systems. Wafer-scale chips (WSCs) offer a promising solution, yet they struggle with limited on-chip memory and complex tensor partitioning. To fully harness the high-bandwidth, low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations, a specialized mapping and architecture co-exploration method is essential. Despite existing efforts in memory optimization and mapping, current approaches fall short for WSC scenarios. To bridge this gap, we introduce TMAC, an architecture-mapping co-exploration framework that integrates recomputation into the design space, fully exploiting optimization opportunities overlooked by existing works. Further, TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme. TMAC then introduces a novel operator-centric encoding scheme (OCES) designed to comprehensively describe the mapping space for training LLMs. Unlike previous studies that focus solely on communication volume analysis based on mapping, TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance. However, fully accounting for these untapped optimization opportunities increases the complexity of the design space. To address this, we streamline the simulation process, reducing the time needed for exploration. Compared to AccPar, Deepspeed and Megatron, TMAC delivers a 3.1×, 2.9×, 1.6× performance gain. In terms of memory usage, TMAC requires 3.6×, 3.1× less memory than AccPar and Deepspeed, respectively and is comparable to Megatron’s full recomputation method.

  • Original article
    NIDHI SHARMA, VINAYAK HANDE, DEVARSHI MRINAL DAS
    Integrated Circuits and Systems. 2024, 1(4): 206-213. https://doi.org/10.23919/ICS.2024.3505092

    A low-power, high-speed dynamic comparator with the addition of a cross-coupled pair in the pre-amplifier stage, followed by a strong-arm latch, is presented. The proposed modification increases the pre-amplifier's differential and common-mode gains, improving the latch's differential and common-mode input voltage, resulting in faster regeneration with 22% speed improvement as compared to conventional comparator at small input differential voltages ( Vi,id ). The proposed technique boosts the comparator's speed and helps achieve 21% lower energy per conversion delay product (EDP) compared to the literature. Analytical modeling of the delay that proves the improvement in the speed of the proposed comparator is also presented and verified with the simulation results. The proposed comparator's delay is insensitive to the common-mode voltage (Vi,cm ). The proposed comparator is fabricated in 180-nm CMOS technology and measurement shows less than 160 ps relative CLK-Q delay with 81 fJ.ns EDP and 0.8 mV input-referred rms noise with 1.8 V supply. To demonstrate the scalability of the proposed technique to advanced technology nodes, the proposed design is also simulated in 65-nm CMOS technology with a 1.1 V supply for 5 GHz frequency. For Vi,cm of 0.3 V and Vi,id of 1 mV and 10 mV, the proposed comparator exhibits a 40.69 ps and 32.41 ps delay and has 3.74 fJ.ns and 2.78 fJ.ns EDP respectively.

  • Special Issue on Selected Papers from ICTA2023
    GUOQING WANG, ZHAO ZHANG
    Integrated Circuits and Systems. 2024, 1(2): 103-108. https://doi.org/10.23919/ICS.2024.3456043

    This work presents a PAM4 receiver analog frontend (AFE) operating up to 64 Gb/s. The electronic integrated circuit (EIC) is fabricated in 40-nm CMOS technology. This AFE is composed of a single-stage Continuous-Time Linear Equalizer (CTLE), a Variable Gain Amplifier (VGA), an input impedance matching network, a buffer stage, and an output buffer. The single-stage triple-peaking CTLE proposed employs current reuse technique and a multi-feedback structure, enabling the adjustment of peaking in the low, mid, and high-frequency bands. Thus, only one-stage CTLE is sufficient to achieve an over-20-dB boost at Nyquist frequency to save power. The VGA adopts an enhanced structure based on the Gilbert cell, where the gain is manipulated by controlling the gate voltage of MOS transistors. The CTLE undergoes variations in its DC gain during the adjustment process to equalize channel losses. The role of the VGA is to stable the DC gain changes induced by the adjustment of the CTLE. The output buffer adopts two stages, aiming to ensure that the gain does not attenuate excessively while maintaining output impedance matching. The AFE consumes 21.1 mW with a supply voltage of 1.5/1 V. It can provide a maximum boost of 22.5 dB, and the data rate reaches up to 64 Gb/s. Additionally, it features peaking adjustment capabilities in the low, mid, and high-frequency bands. Finally, the measurement demonstrates its ability to effectively equalize a channel with a 12-dB loss at the Nyquist frequency of 16 GHz.

  • Original article
    RUNTIAN YANG, YUHAN HOU, SAVANNA BLADE, YINFEI LI, VANSH TYAGI, GLORIA-EDITH BOUDREAULT-MORALES, JOSÉ ZARIFFA, XILIN LIU
    Integrated Circuits and Systems. 2024, 1(4): 227-238. https://doi.org/10.23919/ICS.2024.3512503

    Monitoring rehabilitation progress at home over the long term following a spinal cord injury (SCI) is crucial for maximizing therapeutic outcomes and enhancing the quality of life of affected individuals. Comprehensive monitoring requires collecting a range of physiological data, including surface electromyography (sEMG) and exercise motion data. Currently, assessments typically take place in clinical settings, which can be both costly and inconvenient for patients. There is a lack of accessible, user-friendly systems that allow individuals with SCI to independently gather this data at home. Additionally, video recordings may be necessary to verify that patients are positioning the sensors correctly and performing the exercises accurately. To bridge this gap, we have developed a self-contained, multi-modal sensor system that captures sEMG and motion data, along with depth-sensing video to track patient exercises while ensuring privacy by minimizing identifiable details. The system includes a configurable number of wireless, multisensor wearable patches that are easy to attach and comfortable for extended use, along with a time-of-flight depth-sensing camera. The multi-modal data is streamed and synchronized in real-time on a Raspberry Pi, establishing an innovative platform to support SCI rehabilitation and adaptable for various clinical monitoring applications.

  • Special Issue on Selected Papers from ICTA2023
    WEIYI ZHANG, CHAOYANG DING, XIAORUI MO, FEI SHAO, YIYANG WANG, YUSHI GUO, LITING NIU, CHENG NIAN, FASIH UD DIN FARRUKH, CHUN ZHANG
    Integrated Circuits and Systems. 2024, 1(2): 66-79. https://doi.org/10.23919/ICS.2024.3449791

    Simultaneous Localization and Mapping (SLAM) is the process by which a mobile robot can build a map of the surrounding environment and compute its own location. Feature point extraction is one of the key components of a SLAM system. The extraction accuracy and efficiency of corner detection directly affect the overall accuracy and throughput of the system. However, the complexity of corner detection algorithms makes it challenging to achieve real-time implementation and efficient, low-cost hardware design, especially for mobile robots. Harris corner detection class algorithms including Harris and GFTT (Good Feature to Track) have improved accuracy. However, those algorithms require high resource consumption and latency when implemented on hardware platforms. The GFTT achieves higher accuracy than Harris while requiring higher computational complexity. To address the throughput problem, SFTT (Simple Feature to Track), a new Harris class detection algorithm is proposed, and the corresponding hardware accelerator is designed. The proposed SFTT significantly reduced the computational complexity compared with the Harris algorithm and GFTT. Experiments have shown SFTT also achieved slightly higher accuracy compared with the two algorithms. Furthermore, the GFTT accelerator is designed which reaches up to 325 fps at the frequency of 100 MHz. The proposed design has achieved an improvement in throughput by 1.3× times and power efficiency by 1.7× times as compared to state-of-the-art design.

  • SHUNQIN CAI, LIUKAI XU, DENGFENG WANG, ZHI LI, WEIKANG QIAN, LIANG CHANG, YANAN SUN
    Integrated Circuits and Systems. 2024, 1(2): 80-91. https://doi.org/10.23919/ICS.2024.3419630

    SRAM-based computing-in-memory (SRAM-CIM) is expected to solve the “Memory Wall” problem. For the digital domain SRAM-CIM, full-precision digital logic has been utilized to achieve high computational accuracy. However, the energy and area efficiency advantages of CIM cannot be fully utilized under error-resilient neural networks (NNs) with given quantization bit-width. Therefore, an all-digital Bit-wise Approximate compressor configurable In-SRAM-computing macro for Energy-efficient NN acceleration, with a data-aware weight Remapping method (BASER), is proposed in this paper. Leveraging the NN error resilience property, six energy-efficient bit-wise compressor configurations are presented under 4b/4b and 3b/3b NN quantization, respectively. Concurrently, a data-aware weight remapping approach is proposed to enhance the NN accuracy without supplementary retraining further. Evaluations of VGG-9 and ResNet-18 on CIFAR-10 and CIFAR-100 datasets show that the proposed BASER achieves 1.35x and 1.29x improvement in energy efficiency, as well as limited accuracy loss and improved NN accuracy, as compared to the previous full-precision and approximate SRAM-CIM design, respectively.

  • Regular Papers
    FUPING LI, YING WANG, MEIXUAN LU, YUTONG ZHU, HAORAN WANG, ZHUN ZHAO, JUNPEI HUANG, XIAOTONG WEI, XIHAO LIANG, YUJIE WANG, HAOBO XU, HUAWEI LI, XIAOWEI LI, QI LIU, MING LIU, NINGHUI SUN, YINHE HAN
    Integrated Circuits and Systems. 2024, 1(1): 18-30. https://doi.org/10.23919/ICS.2024.3451428

    Due to the waning of Moore’s Law, the conventional monolithic chip architectural design is confronting hurdles such as increasing die size and skyrocketing cost. In this post-Moore era, the integrated chip has emerged as a pivotal technology, gaining substantial interest from both the academia and industry. Compared with monolithic chips, the chiplet-based integrated chips can significantly enhance system scalability, curtail costs, and accelerate design cycles. However, integrated chips introduce vast design spaces encompassing chiplets, inter-chiplet connections, and packaging parameters, thereby amplifying the complexity of the design process. This paper introduces the Optimal Decomposition-Combination Theory, a novel methodology to guide the decomposition and combination processes in integrated chip design. Furthermore, it offers a thorough examination of existing integrated chip design methodologies to showcase the application of this theory.

  • Original article
    SHIHONG ZHOU, XIN WANG, YANXING SUO, XIAO HAN, GUOXING WANG, YANG ZHAO
    Integrated Circuits and Systems. 2024, 1(4): 239-246. https://doi.org/10.23919/ICS.2024.3513261

    Real-time monitoring of multimodal vital signs including electrocardiography (ECG) and photoplethysmography (PPG) on wearable devices are attracting increasing interests. Motion artifacts, ambient light interference and sensor-skin contact variability affect signal quality significantly, demanding a multi-channel sensor interface chip with high dynamic range yet low power. A PPG/ECG interface chip is proposed for robust signal optimization. Time-division multiplexing, ambient double sampling and DC current compensation together enhance the dynamic range. Fabricated in a 0.18 μm process, the chip features a 5.63-pArms direct digitized input-referred noise for PPG readout and 365-nVrms for ECG. A cross-scale dynamic range of 133 dB is achieved, providing saturation-free usage for sport wearable devices such as smart watches and rings.

  • Special Issue on Selected Papers from ICTA2023
    SHIJIE LI, RUICHANG MA, MINGXING DENG, JIAMIN XUE, WEI DENG, BAOYONG CHI, HAIKUN JIA
    Integrated Circuits and Systems. 2024, 1(2): 109-118. https://doi.org/10.23919/ICS.2024.3423852

    This paper presents a 32 Gbps wireline transceiver that not only supports the JESD204 C standard but also maintains back-compatibility with JESD204B with minimal additional circuitry. Additionally, a pattern-filtered phase detector (PFPD) is proposed to circumvent the side effect of ambiguous sampling clock phase caused by loop-unrolled 1st post-cursor tap equalization scheme in the decision-feedback equalization (DFE). A 16 GHz external half-rate clock is injected into an on-chip injection-locked ring oscillator to distribute the 16 GHz clock for both the receiver and the transmitter. Multiple on-chip adaption engines and calibration loops are also added to assist the whole system work properly, such as tap weight and desired level adaption engine integrated into the decision-feedback equalizer, duty cycle distortion correction and IQ-mismatch correction. Fabricated in 28 nm CMOS process, the proposed transceiver demonstrates its ability to operate within a signaling range from 312.5 Mbps to 32 Gbps, achieving a BER below 10−12 over a 14.9 dB channel loss at Nyquist frequency. It occupies an aggregated area of 1.4 mm2 and consumes 203 mW at 32 Gbps, in which 50 mW for the transmitter (TX) and 153 mW for the receiver (RX), therefore end up achieving 6.34pJ/bit power efficiency at 32 Gbps.

  • Special Issue on Selected Papers from ICTA2023
    XIAOYAN GUI, LIN CHENG
    Integrated Circuits and Systems. 2024, 1(2): 64-65. https://doi.org/10.23919/ICS.2024.3483732
  • Special Issue on Selected Papers from ICTA2023
    JUNYAN SUN, XUEFEI BAI
    Integrated Circuits and Systems. 2024, 1(2): 92-102. https://doi.org/10.23919/ICS.2024.3419562

    CRYSTALS-Kyber has emerged as a notable lattice-based post-quantum cryptography (PQC) scheme. As one of the four finalists in NIST’s PQC standardization round three, CRYSTALS-Kyber is the only encryption algorithm demonstrating superior performance compared to other algorithms. The number theoretic transform (NTT) is employed to optimize polynomial multiplication, which constitutes the most complex operation within CRYSTALS-Kyber. This study introduces a high-speed NTT accelerator architecture, featuring a novel butterfly unit and an efficient modular polynomial multiplier. The proposed accelerator utilizes a radix-4-based configurable NTT design, which is capable of executing both forward and inverse NTT operations on a unified architecture. When implemented on the Xilinx Virtex-7 FPGA platform, the proposed architecture achieves an acceleration of 1.02-2.30 times in terms of latency, a throughput improvement of 1.02-2.30 times, and an area throughput improvement of up to 3.30 times, relative to the prior works.

  • Original article
    JUNLU ZHOU, LIANG CHANG, HAODONG FAN, HAORAN LI, YANCHENG CHEN, SHUISHENG LIN, JUN ZHOU
    Integrated Circuits and Systems. 2024, 1(3): 157-165. https://doi.org/10.23919/ICS.2024.3496614

    Deep learning typically requires large amounts of labeled data and often struggles with generalization, posing challenges for intelligent systems. In the real world, most electrocardiogram (ECG) signals are unlabeled, which limits the use of smart devices in ECG-related applications. Unsupervised learning methods, such as contrastive learning, have emerged as a solution to this constraint. However, most contrastive learning encoders rely on deep neural networks with many parameters, making them unsuitable for hardware implementation. This article introduces a hardware-friendly universal ECG encoder with around 1k parameters based on contrastive learning and a fine-tuning framework for ECG-related tasks. We apply the encoder to a dual-task system for ECG-based arrhythmia classification and authentication, achieving 98.2% and 99.7% accuracy on the MIT-BIH dataset, respectively, with FAR of 0.274 and FRR of 0.707 for authentication. We propose a dynamic averaging template concatenation technique to improve neural network generalization significantly. We also develop an energy-efficient hardware architecture optimized for the entire system, successfully implementing it on an FPGA.

  • Original article
    TING YUE, LIANG CHANG, HAOBO XU, CHENGZHI WANG, SHUISHENG LIN, JUN ZHOU
    Integrated Circuits and Systems. 2024, 1(4): 196-205. https://doi.org/10.23919/ICS.2024.3506511

    The object detection algorithm based on convolutional neural networks (CNNs) significantly enhances accuracy by expanding network scale. As network parameters increase, large-scale networks demand substantial memory resources, making deployment on hardware challenging. Although most neural network accelerators utilize off-chip storage, frequent access to external memory restricts processing speed, hindering the ability to meet the frame rate requirements for embedded systems. This creates a trade-off in which the speed and accuracy of embedded target detection accelerators cannot be simultaneously optimized. In this paper, we propose PODALA, an energy-efficient accelerator developed through the algorithm-hardware co-design methodology. For object detection algorithm, we develop an optimized algorithm combined with the inverse-residual structure and depth wise separable convolution, effectively reducing network parameters while preserving high detection accuracy. For hardware accelerator, we develop a custom layer fusion technique for PODALA to minimize memory access requirements. The overall design employs a streaming hardware architecture that combines a computing array with a refined ping-pong output buffer to execute different layer fusion computing modes efficiently. Our approach substantially reduces memory usage through optimizations in both algorithmic and hardware design. Evaluated on the Xilinx ZCU102 FPGA platform, PODALA achieves 78 frames per second (FPS) and 79.73 GOPS/W energy efficiency, underscoring its superiority over state-of-the-art solutions.

  • Original article
    MILIN ZHANG
    Integrated Circuits and Systems. 2024, 1(4): 214-214. https://doi.org/10.23919/ICS.2024.3513736
  • Regular Papers
    SHAO-CHUN HUNG, PARTHO BHOUMIK, ARJUN CHAUDHURI, SANMITRA BANERJEE, KRISHNENDU CHAKRABARTY
    Integrated Circuits and Systems. 2024, 1(1): 3-17. https://doi.org/10.23919/ICS.2024.3419629

    As Moore’s Law approaches its limits, 3-D integrated circuits (ICs) have emerged as promising alternatives to conventional scaling methodologies. However, the benefits of 3-D integration in terms of lower power consumption, higher performance, and reduced area are accompanied by testing challenges. The unique vertical stacking of components in 3-D ICs introduces concerns related to the robustness of bonding surfaces. Moreover, immature manufacturing processes during 3-D fabrication can lead to high defect rates in different tiers. Therefore, there is a need for design-for-test solutions to ensure the reliability and performance of 3-D-integrated architectures. In this paper, we provide a comprehensive survey of existing testing strategies for 3-D ICs. We describe recent advances, including research efforts and industry practice, that address concerns related to bonding defects, elevated power supply noise, fault diagnosis, and fault localization specific to the unique characteristics of 3-D ICs.

  • Special Section on Selected Papers from ASICON2023
    XIANGCHEN WAN, SIQING WU, XINWEI YU, XINGTAO ZHU, AND FAN YE
    Integrated Circuits and Systems. 2024, 1(1): 33-42. https://doi.org/10.23919/ICS.2024.3422708

    This paper presents an AC coupling ultrasound analog front-end (AFE) architecture with a three-stage DC offset correction (DCOC) circuit. In ultrasound systems, the low noise amplifier (LNA), time gain control (TGC), and low pass filter (LPF) constitute the AFE, which achieves low noise, time-varying gain compensation, and filtering for the received ultrasound signal. The inherent asymmetry in LNA, layout asymmetry and the process variation introduce DC offset and the TGC changes it into low-frequency offset drift. The proposed DCOC circuit for LNA is composed of a transconductance amplifier and an off-chip capacitor, while a fully differential operational amplifier and a pseudo resistor are used for other amplification stages. The AC coupling scheme is also used to reduce the offset and drift. The simulation result shows when the DCOC and the AC coupling are adopted, the offset and drift are almost perfectly suppressed. The proposed AFE has been fabricated by a 28-nm CMOS process, and it achieves an 85 dB gain range with low input-referred noise of 2.43 nV/ $\sqrt{Hz}$ at 5 MHz, and it also has a tunable bandwidth of 15/30 MHz and switchable input impedance of 50/100 ohms.

  • Special Section on Selected Papers from ASICON2023
    JIAXIANG LI, MASAO YANAGISAWA, YoUHUA SH
    Integrated Circuits and Systems. 2024, 1(1): 53-62. https://doi.org/10.23919/ICS.2024.3423850

    The large-scale neural networks have brought incredible shocks to the world, changing people's lives and offering vast prospects. However, they also come with enormous demands for computational power and storage pressure, the core of its computational requirements lies in the matrix multiplication units dominated by multiplication operations. To address this issue, we propose an area-power-efficient multiplier-less processing element (PE) design. Prior to implementing the proposed PE, we apply a power-of-2 dictionary-based quantization to the model and effectiveness of this quantization method in preserving the accuracy of the original model is confirmed. In hardware design, we present a standard and one variant ‘bi-sign’ architecture of the PE. Our evaluation results demonstrate that the systolic array that implement our standard multiplier-less PE achieves approximately 38% lower power-delay-product and 13% smaller core area compared to a conventional multiplication-and-accumulation PE and the bi-sign PE design can even save 37% core area and 38% computation energy. Furthermore, the applied quantization reduces the model size and operand bit-width, leading to decreased on-chip memory usage and energy consumption for memory accesses. Additionally, the hardware schematic facilitates expansion to support other sparsity-aware, energy-efficient techniques.

  • Original article
    YANXING SUO, HAOXIANG GUO, RUIZHE YU, YANG ZHAO
    Integrated Circuits and Systems. 2024, 1(4): 215-226. https://doi.org/10.23919/ICS.2024.3511474

    This paper presents a comprehensive review of the advancements of wearables over 20 years and explores future directions in the field. The review focuses on the evolving needs, innovation solutions, and technological progress of these sensors. We categorize the existing literature into key themes, highlighting significant advancements, challenges and future prospects in wearable sensor technology.

  • Original article
    CHAO WANG, YUXIN JI, JIAJIE HUANG, LIANG QI, YONGFU LI
    Integrated Circuits and Systems. 2024, 1(3): 127-136. https://doi.org/10.23919/ICS.2024.3497916

    This brief presents an ultra-low voltage single-ended level shifter (LS) with a stacked current mirror and an improved split-controlled inverter as an output driver to enable wide-range voltage conversion. At the ultra-low input supply voltages, VDDL, the differential LS circuit will gradually be dysfunctional as the inverter produces limited voltage swings at the output. Some prior works have replaced the inverter with a pass transistor, whose gate is connected to the lower supply voltage, VDDL, to ensure the proper operation of the current mirror in its pull-up network (PUN). This requires the use of the “tie-high” standard cell to prevent gate breakdown in the pass transistor but it is unable to function properly at ultra-supply voltage. To solve this problem, we proposed to connect the pass transistor gate to the input transistor’s drain. The proposed LS circuit and prior single-ended LS circuit works have been fabricated in 55nm CMOS technology and a total of 10 chips for each circuit have been measured. The proposed LS circuit operates with a single input signal with a supply voltage of 100mV at a frequency of 1MHz. With a VDDL of 200mV and VDDH of 1.2V, the measured propagation delay is 182.1ns and the energy per transition (EPT) is around 4.35∼5.44 pJ. It has achieved a 1.08∼ 2.25× improvement in the Figure of Merit (FoM) than prior multi-supply works and a maximum improvement of 1134× compared to prior single-supply work. The FoM is based on the ratio between propagation delay and level conversion differences, which enables us to understand the circuit’s

    ability to operate efficiently under wide signal-level conversion.

  • Original article
    ZESONG JIANG, MUHAN ZHANG, QINGYUN LIU, RUNZE LIU
    Integrated Circuits and Systems. 2024, 1(3): 137-143. https://doi.org/10.23919/ICS.2024.3499944

    This study explores the potential of Field-Programmable Gate Arrays (FPGAs) within the realm of cryogenic computing, which promises enhanced performance and power efficiency by reducing leakage power and wire resistance at low temperatures. Prior research has mainly adapted commercial FPGAs for cryogenic temperatures without fully exploiting the technology’s benefits, necessitating significant design efforts for each application scenario. By characterizing FPGA performance in cryogenic conditions and examining the influence of architectural parameters, we propose a Bayesian optimization-based framework for systematic FPGA architecture exploration to identify FPGA architectures that are optimally suited for cryogenic applications. The architectures we developed, aimed at operating efficiently at 77K, significantly outperform conventional FPGAs designed for room-temperature conditions in performance and power consumption.

  • Original article
    FENG YANG, DUY-HIEU BUI, YANG ZHAO, LIANG QI, JINGHUA ZHANG, XUAN-TU TRAN, YONGFU LI
    Integrated Circuits and Systems. 2024, 1(3): 120-126. https://doi.org/10.23919/ICS.2024.3482310

    The rapid growth of Internet-of-Things (IoT) applications necessitates the development of cost-effective solutions and accelerated design cycles. While digital circuit design has witnessed significant automation, analog design still heavily relies on experienced engineers. To bridge this gap, synthesizable solutions that integrate digital and analog design automation are crucial for efficient IoT development. This paper explores the historical context and current challenges in circuit automation and electronic design automation (EDA) tools, specifically focusing on analog and mixed-signal circuits. As successive approximation register (SAR) ADCs play a critical role in IoT applications, it is important to critically examines the state-of-the-art synthesizable SAR ADCs, discussing the strengths and limitations of different design techniques for various circuit blocks. These findings provide valuable insights for researchers and industry practitioners, informing future research directions in the field of synthesizable SAR ADC design automation.

  • Original article
    YIFAN SONG, YULONG MENG, BINHAN CHEN, CHEN SONG, KANG YI
    Integrated Circuits and Systems. 2024, 1(3): 144-156. https://doi.org/10.23919/ICS.2024.3458897

    Recently, large Transformer models have achieved impressive results in various natural language processing tasks but require enormous parameters and intensive computations, necessitating deployment on multi-device systems. Current solutions introduce complicated topologies with dedicated high-bandwidth interconnects to reduce communication overhead. To deal with the complexity problem in system architecture and reduce the overhead of inter-device communications, this paper proposes SALTM, a multi-device system based on a unidirectional ring topology and a 2-D model partitioning method considering quantization and pruning. First, a 1-D model partitioning method is proposed to reduce the amount of communication. Then, the block distributed on each device is further partitioned in the orthogonal direction, introducing a task-level pipeline to overlap communication and computation. To further explore the SALTM’s performance on a real large model like GPT-3, we develop an analytical model to evaluate the performance and communication overhead. Our simulation shows that a BERT model with 110 million parameters, implemented by SALTM on four FPGAs can achieve 9.65× and 1.12× speedups compared to CPU and GPU, respectively. The simulation also shows that the execution time of 4-FPGA SALTM is 1.52× that of an ideal system with infinite inter-device bandwidth. For GPT-3 with 175 billion parameters, our analytical model predicts that SALTM comprising 16 VC1502 FPGAs and 16 A30 GPUs can achieve inference latency of 287 ms and 164 ms, respectively.

  • Special Section on Selected Papers from ASICON2023
    HENG ZHANG, YICHUAN BAI, JUNJIE SHEN, YUAN DU, LI DU
    Integrated Circuits and Systems. 2024, 1(1): 43-53. https://doi.org/10.23919/ICS.2024.3422968

    Deep learning has recently gained significant prominence in various real-world applications such as image recognition, natural language processing, and autonomous vehicles. While deep neural networks appear to have different architectures, the main operations within these models are matrix-vector multiplications (MVM). Compute-in-memory (CIM) architectures are promising solutions for accelerating the massive MVM operations by alleviating the frequent data movement issue in traditional processors. Ana log CIM macros leverage current-accumulating or charge-sharing mechanisms to perform multiply-and-add (MAC) computations. Even though they can achieve high throughput and efficiency, the computing accuracy is sacrificed due to the analog nonidealities. To ensure precise MAC calculations, it is crucial to analyze the sources of nonidealities and identify their impacts, along with corresponding solutions. In this paper, comprehensive linearity analysis and dedicated calibration methods for charge domain static-random access memory (SRAM) based in-memory computing circuits are proposed. We analyze nonidealities from three areas based on the mechanism of charge domain computing: charge injection effect, temperature variations, and ADC reference voltage mismatch. By designing a 256 × 256 CIM macro and conducting investigations via post-layout simulation, we conclude that these nonidealities don’t deteriorate the computing linearity, but only cause the scaling and bias drift. To mitigate the scaling and bias drift identified, we propose three calibration methods ranging from the circuit level to the algorithm level, all of which exhibit promising results. The comprehensive analysis and calibration methods can assist in designing CIM macros with more accurate MAC computations, thereby supporting more robust deep learning inference.

  • Special Section on Selected Papers from ASICON2023
    FRANCOIS RIVET,, LIANG QI
    Integrated Circuits and Systems. 2024, 1(1): 31-32. https://doi.org/10.23919/ICS.2024.3484397
  • Editorial
    Ming Liu, PETER LIAN
    Integrated Circuits and Systems. 2024, 1(1): 2-2. https://doi.org/10.23919/ICS.2024.3483733