Most Viewed

  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Co-Optimization for Large Language Models: Advances in Algorithm and Hardware
    SHAOBO LUO, ALBERT YU, ZHIYUAN XIE, HONG HUANG, MINGQIANG HUANG, KAI LI, YUK KAN PUN, ZHIRU GUO, SHUWEI LI, YIMING ZHU, CHANGHAI MAN, HUIYUAN SUN, TUNG-HAN CHANG, ZIYI GUAN, QIYUAN ZHANG, TINGTING WANG, GUANQI PENG, WENJUN CHEN, YAN SUN, GENGXIN CHEN, MEI YAN, HAO YU
    Integrated Circuits and Systems. 2025, 2(2): 67-80. https://doi.org/10.23919/ICS.2025.3552542

    Precision medicine is revolutionizing global healthcare by enabling personalized diagnostics, disease prediction, and tailored treatment strategies. While the integration of genomics and data science holds immense potential to optimize precision therapeutic outcomes, a critical challenge lies in translating gene sequencing data into actionable insights for in vitro diagnostics. This bottleneck is largely attributed to the limitations of edge-side intelligent processing and automation. Despite advancements in gene sequencing technologies and bioinformatics tools, the workflow from sample collection to diagnostic report generation remains fragmented, inefficient, and lacks of intelligence. To address these challenges, we introduce an embodied LLM NGS sequencer on the edge for real-time, on-site smart genetic diagnostics. This instrument integrates a streamlined and comprehensive pipeline with deep learning networks for primary data analysis, machine learning for secondary data processing, and a large language model (LLM) optimized for tertiary data interpretation. The LLM is enhanced through quantization and compression, facilitating deployment on FPGA/GPU to accelerate diagnostic workflows. Experimental results showcased the superior performance by achieving a 13.72% increase in throughput, a 99.50% Q30%, and enable smart diagnostic on the edge with the performance up to 75 tokens/s. This work enables immediate, on-site DNA analysis, hence dramatically improving precision medicine’s accessibility and efficiency, and significantly advances diagnostic accuracy, automation, establishing a robust platform for AI-driven personalized medicine and setting a new benchmark for the future of healthcare delivery.

  • Regular Papers
    FEI LIU, LANGYUAN WANG, SHUYU ZHANG, HANLU ZHANG, NA YAN
    Integrated Circuits and Systems. 2025, 2(3): 110-121. https://doi.org/10.23919/ICS.2025.3582894

    This paper presents a single-inductor-multiple-output (SIMO) buck/boost/buck-boost converter for wearable electronic devices.Aiming at high light-load efficiency and low ripple, the converter applies fully asynchronous burst mode control. The circuit enters sleep mode intermittently during light loads, significantly reducing static power consumption. The peak inductor current is fixed, effectively limiting the maximum output ripple. The converter features three conversion modes: buck, boost, and auto-gain buck-boost. DC analysis is conducted to derive expressions for output ripple and maximum load in relation to the peak inductor current. AC stability analysis is performed with small signal perturbation and linearization methods, proving the stability of all three modes. Measured results indicate that the converter achieves a peak efficiency of 91.0% at an output power of 77.5 mW. The maximum output ripple is 27.0 mV, and the overshoot or undershoot during load transients is not observed. Compared with existing converters, it exhibits higher efficiency and lower ripple, along with a fast load transient response, offering a highly efficient power management solution for wearable devices.

  • Regular Papers
    SHENGNAN ZHANG, YIFAN ZHAO, XINGLONG YU, JUN HAN
    Integrated Circuits and Systems. 2025, 2(3): 149-157. https://doi.org/10.23919/ICS.2025.3579338

    SPHINCS+ is a hash-based digital signature scheme that has been selected for post-quantum cryptography(PQC) standardization announced by the U.S. National Institute of Standards and Technology (NIST) in 2022. Although SPHINCS+ offers significant security against quantum attacks, its relatively slow computation times present a major obstacle to its practical deployment. To address this challenge, improving the computational efficiency of SPHINCS+ becomes a critical task. The cryptographic operations in SPHINCS+ rely on tweakable hash functions, with various hash algorithms available for selection. Among these, SHA-3 stands out as a widely adopted and NIST-standardized hash function, making it a preferred choice for implementation in SPHINCS+. In this work, we propose a dedicated coprocessor that integrates a SHA-3 accelerator along with its associated peripheral structure. This coprocessor is designed to extend the RISC-V instruction set by incorporating seven custom instructions, enabling efficient software-hardware co-acceleration. Furthermore, we investigate the parallelizable components within SPHINCS+, specifically the FORS and WOTS+ Algorithms, to identify means for optimization. By leveraging thread-level parallelism through multi-core programming, we achieve significant improvements in performance. To validate the design, synthesis is performed using TSMC 28-nm CMOS technology at 800 MHz. Compared to the benchmark results from the ARM Cortex-M4 processor, our approach achieves an impressive 23.1× speedup in the overall single-core performance of SPHINCS+, with an additional 3.4× speedup for the verification process by utilizing multi-core acceleration.

  • Regular Papers
    JUNZHAN LIU, JINYAO MI, YANG LIU, LIANG ZHANG, HE ZHANG, WANG KANG
    Integrated Circuits and Systems. 2025, 2(3): 102-109. https://doi.org/10.23919/ICS.2025.3567939

    Computing-in-memory (CIM) offers a promising solution to the memory wall issue. Magnetoresistive random-access memory (MRAM) is a favored medium for CIM due to its non-volatility, high speed, low power, and technology maturity. However, MRAM has continuously encountered the challenge of an insufficient high-resistance state (HRS) to low-resistance state (LRS) ratio, which affects the result accuracy of CIM. In this paper, based on SOT devices, we propose a 5T2M bit-cell structure that increases the high-to-low current ratio by modulating the sub-threshold operation region. Besides, by jointly using high-resistance devices (M_ level), the power consumption of the bit-cell array can be significantly reduced. Simultaneously, we have designed a compatible multi-bit implementation and macro architecture to support AI edge inference acceleration. This work was simulated under a 40-nm foundry process and a physically verified SOT-MTJ model. The results show that under the same high-to-low resistance ratio, a 52.6× high-to-low current ratio can be achieved, along with a 38.6%-98% bit-cell array power reduction.

  • Regular Papers
    DAYAN ZHOU, YUGUO XIANG, FAN YE
    Integrated Circuits and Systems. 2025, 2(3): 131-138. https://doi.org/10.23919/ICS.2025.3571821

    This paper presents a fully digital foreground calibration method for pipeline-SAR analog-todigital converters (ADCs) using sine-fit based on the Extended Kalman Filter (EKF). The sine-fit technique provides a reference output, while an adaptive Least Mean Square (LMS) algorithm iteratively adjusts the reconstruction weights to correct mismatches and nonlinearities. The EKF significantly reduces hardware complexity by enabling real-time estimation without requiring extensive data storage. A modeled 12-bit pipeline-SAR ADC is used to evaluate the method’s effectiveness. Simulation results demonstrate that the proposed calibration scheme improves the spurious-free dynamic range (SFDR) and signal-to-noise-anddistortion ratio (SNDR) by 33.6 dB and 18.8 dB, respectively.

  • Special Issue Papers
    YAN ZHOU, JUN WANG
    Integrated Circuits and Systems. 2025, 2(1): 28-35. https://doi.org/10.23919/ICS.2025.3547674

    This study presents a comprehensive thermal stress analysis of critical components in an embedded multi-die interconnect bridge (EMIB) within a chiplet package using finite element analysis (FEA). We systematically evaluated key design parameters—including bump diameter-to-pitch ratios, bump distribution patterns, EMIB thickness, number of EMIBs, and aspect ratios—to assess their impact on stresses. An ABAQUS-based FEA model was used to simulate thermal loading with a 165 °C temperature increase. The results indicate that a bump diameter-to-pitch ratio of 0.3 optimizes stress distribution, while a peripheral bump arrangement is superior in stress reduction compared to other patterns. Thinner EMIBs linearly reduce maximum principal stress, whereas multiple EMIBs and aspect ratio variations have minimal effects. These findings offer practical guidelines for optimizing EMIB design in chiplet packages, emphasizing the importance of bump geometry, distribution patterns, and EMIB thickness for improved reliability.

  • Regular Papers
    SHUYI XIANG, KA’NAN WANG, RENJIE TANG, YUKUN HE, ZHENGYANG YE, XI’AN CHEN, XIAOYAN GUI
    Integrated Circuits and Systems. 2025, 2(2): 93-98. https://doi.org/10.23919/ICS.2025.3564576

    This paper presents a 130 GBaud four-to-one analog multiplexer (AMUX) with four-level pulse-amplitude modulation (PAM-4) in a 130-nm SiGe BiCMOS process. The architecture comprises two stages of the two-to-one AMUX. The four quarter-rate signals are fed into the first-stage AMUX circuit after equalization by continuous-time linear equalizers (CTLE) to produce two-way half-rate signals through time interleaving. The AMUX core circuit of the second stage is based on the Gilbert cell. Compared to the conventional sampling method where the clock signal is centered within 1UI of the data signal, the secondstage AMUX in this design aligns the rising edge of the clock signal with the transition edge of the data signal during sampling. This approach avoids the idle dummy branches in the conventional design, thereby significantly improving the energy efficiency. The AMUX generates two full-rate data signals spaced by 1-UI for subsequent feed-forward equalization (FFE). A two-tap FFE is designed with the transconductance (Gm) cell to compensate for the channel loss. As for the clock chain, the half-rate clock is provided by an external high speed clock source. It will pass through a voltage-controlled delay line (VCDL) to regulate the timing relationship between the clock and data signals in the second stage. And the two-way quarter-rate clocks in quadrature phases need to be generated from the half-rate clock for the two AMUXs in the first stage. Finally, a 130 GBaud PAM-4 signal is generated with a power consumption of 1 W.

  • Special Issue Papers
    HANRU YANG, TIANRUI LYU, XIAO HUANG, JIANPING GUO
    Integrated Circuits and Systems. 2025, 2(1): 13-21. https://doi.org/10.23919/ICS.2025.3553458

    This work presents a bandgap voltage reference (BGR) with source-sink dual current compensation achieving a low temperature coefficient (TC) over the automotive temperature range from −40 to 125 °C. The two compensation currents are the inverted-V current (IinvV) and the high-low temperature linear current (IHLT), which appear in the form of sourcing and sinking currents, respectively. This design introduces an inverted-V current to mitigate the degradation of the compensation effect caused by temperature range drifts. By exploiting the characteristics of IinvV and IHLT exhibiting the same drift trend, the dual current compensation achieves the compensation performance over the entire automotive temperature range while mitigating the impact of temperature range drifts, thereby optimizing the overall compensation effect. The measured results show that it achieves the best TC of 2.0 ppm/°C and an average consumption current of 44 μA at room temperature. Moreover, the linear sensitivity (LS) is 0.04%/V and power supply rejection (PSR) is −60 dB at 1 Hz at room temperature.

  • Regular Papers
    LINA WANG, JIANZHENG LI, WEIMIN HU, YAJIE QIN
    Integrated Circuits and Systems. 2025, 2(3): 122-130. https://doi.org/10.23919/ICS.2025.3569486

    This paper presents a highly integrated wearable electrochemical sensor chip for sweat monitoring, incorporating both a current readout circuit and a programmable excitation waveform generator circuit. The chip is fabricated using a 0.11 μm standard CMOS process. The design utilizes a high-resolution and wide dynamic range current readout circuit for multimodality electrochemical sensing. A bidirectional current sensing potentiostat, based on a cascode current mirror, is presented. The circuit achieves bidirectional current sensing while isolating the sensing electrode from the subsequent circuitry, enhancing its versatility for various electrochemical measurement techniques. Additionally, the implementation of a current feedback loop, in conjunction with an automatic amplitude control method and a current-mode digital-to-analog converter, not only extends the dynamic range of the input current but also effectively eliminates the background currents. This design achieves 101 dB current dynamic range and 123 pA current resolution in the detection current range of ±15 μA with an R2 linearity of 0.9999. It also attains a nonlinearity of 0.07%, ensuring minimal distortion. The current readout circuit consumes 12 μA of static current from a 1.5 V supply.

  • Co-Optimization for Large Language Models: Advances in Algorithm and Hardware
    CHENG ZHANG, XINGYU ZHU, LONGHAO CHEN, TINGJIE YANG, EVENS PAN, GUOSHENG YU, YANG ZHAO, XIGUANG WU, BO LI, WEI MAO, GENQUAN HAN
    Integrated Circuits and Systems. 2025, 2(2): 49-57. https://doi.org/10.23919/ICS.2025.3568404

    Large language models (LLMs) have exhibited remarkable performance across a broad spectrum of tasks, yet their extensive computational and memory requirements present substantial challenges for deployment in resource-constrained scenarios. To address the challenges, this work introduces software and hardware co-optimization strategies aimed at enhancing the inference performance of LLMs on ARM CPU-based platforms. A mixed-precision quantization technique is employed, preserving the precision of critical weights to maintain model accuracy while quantizing non-essential weights to INT8, thereby reducing the model’s memory footprint. This work also capitalizes on the SIMD instruction set of ARM CPUs to efficiently process model data. Furthermore, the inference framework is optimized by fusing components of the attention computation and streamlining the dequantization process through modifications to the scaling factor. These enhancements result in a significant reduction in model memory usage and improved throughput during the prefill and decode stages. The efficacy of the proposed approach is demonstrated through the optimization of the Qwen-1.8B model on Armv9, with only a 0.66% decrease in accuracy and a reduction in memory usage to 58.8% of the baseline, while achieving a 4.09× and 15.23× increase in inference performance for the prefill and decode stages over the baseline, respectively.

  • Regular Papers
    TAO ZHONG, YUEKANG GUO, JING JIN, JIANJUN ZHOU
    Integrated Circuits and Systems. 2025, 2(3): 139-148. https://doi.org/10.23919/ICS.2025.3563318

    The evolution of 5G and beyond wireless networks has intensified the demand for millimeterwave technology to support high-throughput applications. This paper introduces a novel energy-efficient digital beamforming receiver architecture that integrates multi-stage noise-shaping (MASH) delta-sigma modulators (DSMs) with bit-stream processing (BSP), effectively addressing the significant propagation losses and dynamic electromagnetic interference associated with millimeter-wave (mm-wave) systems. The novel architecture achieves enhanced dynamic range without increasing signal bit-width, thereby ensuring low power consumption and a compact design. Unlike traditional analog and hybrid beamforming methods, the proposed approach utilizes digital-domain processing for precise beamforming, simplified local oscillator networks, and improved integration. System-level simulations with a 9-antenna beamforming receiver array demonstrate the architecture’s capability for accurate beamforming across angles from 30° to 150° and effective dual-target detection. Furthermore, the P2S-BSP architecture reduces digital circuitry area by 50% compared to previous implementations while maintaining energy efficiency. These advancements highlight the proposed architecture as a scalable solution for future mm-wave applications, including intelligent transportation systems, radar, and high-density mobile networks.

  • Regular Papers
    YUYANG LIU, RUNYE DING, YUJIE CHEN, PUJIN XIE, YAO LIU, ZHIYI YU
    Integrated Circuits and Systems. 2025, 2(3): 158-166. https://doi.org/10.23919/ICS.2025.3550116

    Since the discovery of speculative execution attacks based on side channels, there has been a long history of research on their attack mechanisms and defense principles. To explore TLB side channels, we constructed a System-on-Chip (SoC) centered around the XuanTie C910 processor on a Virtex UltraScale+ HBM VCU128 FPGA and ran the Linux operating system on this platform. We successfully implemented the Spectre-v1 attack targeting the multi-level TLB structure of the XuanTie C910 processor, identifying the second-level TLB as the primary target of the attack. In addition, we proposed a defense mechanism called TLBshield-v1, which employs a 50-percent block rate policy on the write-back channel from the Page Table Walker to the second-level TLB, thereby mitigating all attacks based on the second-level TLB. We tested a 50-percent block rate policy, which reduced the success rate of the Spectre-v1 attack from 100 percent to 55.7 percent, with a performance overhead of only 1.77 percent. Furthermore, we designed TLBshield-v2, with different block rates of second-level TLB, tested their corresponding performance overheads and security implications, and introduced a normalized evaluation metric, Security-Versus-Performance to determine the optimal design strategy that balances performance overhead and security under varying security requirements.

  • Regular Papers
    FENGSHUO TIAN, KAIXUAN WANG, JUN HAN
    Integrated Circuits and Systems. 2025, 2(3): 167-173. https://doi.org/10.23919/ICS.2025.3583689

    Control flow integrity (CFI) plays an important role in defending against code reuse attacks (CRA). It protects the program’s control flow from being hijacked by restricting control flow transfers during execution. Specifically, backward-edge CFI safeguards return addresses to mitigate Return-Oriented Programming (ROP) attacks. In this work, we implement a backward-edge CFI mechanism that employs the Advanced Encryption Standard (AES) for cryptographic protection of return addresses. We utilize the gem5 simulator for architectural modeling and evaluation. Additionally, we design a dedicated AES hardware accelerator and integrate it into the system through gem5+RTL co-simulation. The AES accelerator is synthesized under TSMC 28 nm technology, which can work at 1GHz, with an area of 10045 μm2 and a power consumption of 1.31 mW. Experimental results indicate that the performance overhead of the backward-edge CFI scheme is less than 0.1%.

  • Special Issue Papers
    XI YANG, YAMIN MAO, LIANG CHANG, HAOJIE WEI, YUANBO WANG, JINGKE WANG, CHAO FAN, ZHONGMOU WU, SHOUZHONG PENG, JUN ZHOU
    Integrated Circuits and Systems. 2025, 2(1): 4-12. https://doi.org/10.23919/ICS.2025.3553460

    General-purpose edge neural networks need a lightweight architecture that effectively balances storage and computing resources. However, SRAM-based computing-in-memory (CIM) architectures face challenges in delivering adequate on-chip storage while fulfilling computing requirements. To overcome this, we introduce a new MRAM-based near-memory computing (NMC) architecture. It retains the costeffective data access benefits of CIM while separating storage and computing at the macro-level, improving deployment adaptability. We refine the NMC macro by incorporating small temporary storage and adopting a layer-fusion approach to enhance data-transfer efficiency. By integrating a high-capacity MRAM into the macro, we attain a storage density of 0.532 um2/bit. Moreover, we enhance the adder tree with a shift module, supporting multiply-and-accumulate (MAC) operations at five distinct depths (8, 9, 16, 32, and 64), which raises resource utilization efficiency to 88.3%. Our architecture achieves an on-chip storage density of 1.49 Mb/mm2 and an energy efficiency of 6.164 TOPS/W.

  • Regular Papers
    SHUAI XIAO, FUYI LI, TING HAO, LANXIANG XIAO, MANLIN XIAO, WEI MAO, GENQUAN HAN
    Integrated Circuits and Systems. 2025, 2(2): 81-92. https://doi.org/10.23919/ICS.2025.3571019

    Computing-in-Memory (CIM) architectures have emerged as a pivotal technology for nextgeneration artificial intelligence (AI) and edge computing applications. By enabling computations directly within memory cells, CIM architectures effectively minimize data movement and significantly enhance energy efficiency. In the CIM system, the analog-to-digital converter (ADC) bridges the gap between efficient analog computation and general digital processing, while influencing the overall accuracy, speed and energy efficiency of the system. This review presents theoretical analyses and practical case studies on the performance requirements of ADCs and their optimization methods in CIM systems, aiming to provide ideas and references for the design and optimization of CIM systems. The review comprehensively explores the relationship between the design of CIM architectures and ADC optimization, and raises the issue of design trade-offs between low power consumption, high speed operation and compact integration design. On this basis, novel customized ADC optimization methods are discussed in depth, and a large number of current CIM systems and their ADC optimization examples are reviewed, with optimization methods summarized and classified in terms of power consumption, speed, and area. In the final part, this review analyzes energy efficiency, ENOB, and frequency scaling trends, demonstrating how advanced processes enable ADCs to balance speed, power, and area trade-offs, guiding ADC optimization for next-gen CIM systems.

  • Co-Optimization for Large Language Models: Advances in Algorithm and Hardware
    ZHE WEN, LIANG XU, MEIQIWANG
    Integrated Circuits and Systems. 2025, 2(2): 58-66. https://doi.org/10.23919/ICS.2025.3575371

    In recent years, the exponential growth in Large Language Model (LLM) parameter sizes has significantly increased computational complexity, with inference latency emerging as a prominent challenge. The primary bottleneck lies in the token-by-token prediction process during autoregressive decoding, resulting in substantial delays. Therefore, enhancing decoding efficiency while maintaining accuracy has become a critical research objective. This paper proposes an Adaptive Parallel Layer-Skipping Speculative Decoding (APLS) method, which leverages speculative decoding techniques by employing a Small-Scale Model (SSM) for preliminary inference and validating the predictions using the original LLM. This approach effectively balances the high precision of LLMs with the efficiency of SSMs. Notably, our SSM does not require additional training but is instead derived through a simplification of the original large-scale model. By incorporating parallelization and a layer-skipping structure, the inference process dynamically bypasses certain redundant transformation layers, significantly improving GPU utilization and inference speed without compromising performance. Furthermore, to address challenges such as window size limitations and memory fragmentation in long-text processing, this paper introduces progressive layer reduction and key-value cache deletion techniques to further optimize the performance of SSMs. Experimental results demonstrate that the proposed method achieves a 2.51 × improvement in efficiency during autoregressive decoding. As this approach eliminates the need for additional training of SSM, it offers a significant competitive advantage in high-cost model compression environments.

  • Co-Optimization for Large Language Models: Advances in Algorithm and Hardware
    ZHONGFENG WANG
    Integrated Circuits and Systems. 2025, 2(2): 47-48. https://doi.org/10.23919/ICS.2025.3577274
  • Special Issue Papers
    YANG ZHANG, LONGYUAN KANG, XIANGRUI WANG, ENYI YAO
    Integrated Circuits and Systems. 2025, 2(1): 22-27. https://doi.org/10.23919/ICS.2025.3553464

    As a potential solver for combinatorial optimization problems (COPs), the convergence speed and accuracy of Ising machines still have room to be improved at the level of algorithm and architecture design. In this paper, a novel parallel stochastic cellular automata tempering (PSCAT) algorithm is proposed, which combines the high parallel efficiency of stochastic cellular automata annealing (SCA) with the more balanced Monte Carlo sampling of parallel tempering (PT), to enhance the performance of fully-connected Ising machines. To achieve an area-efficient hardware design, a modified temperature exchange probability is applied to reduce the number of replicas and the utilization of the spin update module is improved by reducing the flip decision block. Additionally, the proposed coefficient access pattern effectively reduces memory overhead by sharing the weight matrix. The design prototype with 2,048 spins and 8 replicas is validated on FPGA. Using the K2000 max-cut problem as a benchmark, our design achieves a solution accuracy of 98.94% within 0.5 ms, which is higher than two state-of-the-art works.

  • Editorial
    ZHIYI YU, MINGYU WANG
    Integrated Circuits and Systems. 2025, 2(3): 100-101. https://doi.org/10.23919/ICS.2025.3586794
  • Special Issue Papers
    KUN HE, HUANPENG WANG, YUE TANG, JIE LIU, MUSHENG LIANG, YUEHANG XU
    Integrated Circuits and Systems. 2025, 2(1): 36-45. https://doi.org/10.23919/ICS.2025.3547670

    Current thermoelectric models under electrical stress often neglect the critical impact of crack formation, limiting their predictive accuracy for solder ball reliability. To study the impact of cracks under electrical stress conditions, this study designed an electrical stress-induced failure experiment, applying stepwise current loading to the devices under tests (DUTs) to obtain resistance-time curves, with computed tomography (CT) revealing cracks in the solder balls. Based on these experimental results, a thermoelectric coupling model was developed to predict the temperature-resistance relationship of heterogeneous interconnect structures, incorporating crack factors observed during the experiment. The thermoelectric coupling model demonstrated high accuracy, achieving a maximum error of less than 2.5%. By incorporating the effects of crack formation under high electrical stress, the model provides precise predictions of solder ball resistance evolution.

  • Editorial
    LIN CHENG, BO ZHAO
    Integrated Circuits and Systems. 2025, 2(1): 2-3. https://doi.org/10.23919/ICS.2025.3555024