09 September 2024 Volume 1 Issue 4
  
  • Select all
    |
    Original article
  • Original article
    YUHAO SHU, BIN NING, YIFEI LI, ZHAODONG LYU, JINCHENG WANG, LINTAO LAN, YUXIN ZHOU, MENGRU ZHANG, HONGTU ZHANG, YAJUN HA

    As integrated circuits advance into the post-Moore era, the improvement of computing performance encounters several challenges, making it difficult to meet the ever-growing computing demands. Cryogenic complementary metal oxide semiconductor (CMOS) based computing systems have emerged as a promising solution for overcoming the existing computing performance bottleneck. By cooling the circuitry to cryogenic temperatures, device leakage and wire resistance can be significantly reduced, leading to further improvements in energy efficiency and performance. Here, we conduct a comprehensive review of the cryogenic CMOS based computing systems across multiple optimization layers, including the CMOS process, modeling, electronic design automation (EDA), circuits, and architecture. Moreover, this review identifies potential future works and applications.

  • Original article
    HUIZHENG WANG, QIZE YANG, TAIQUAN WEI, XINGMAO YU, CHENGRAN LI, JIAHAO FANG, GUANGYANG LU, XU DAI, LIANG LIU, SHENFEI JIANG, YANG HU, SHOUYI YIN, SHAOJUN WEI

    Transformer-based large language models (LLMs) have made significant strides in the field of artificial intelligence (AI). However, training these LLMs imposes immense demands on computational power and bandwidth for hardware systems. Wafer-scale chips (WSCs) offer a promising solution, yet they struggle with limited on-chip memory and complex tensor partitioning. To fully harness the high-bandwidth, low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations, a specialized mapping and architecture co-exploration method is essential. Despite existing efforts in memory optimization and mapping, current approaches fall short for WSC scenarios. To bridge this gap, we introduce TMAC, an architecture-mapping co-exploration framework that integrates recomputation into the design space, fully exploiting optimization opportunities overlooked by existing works. Further, TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme. TMAC then introduces a novel operator-centric encoding scheme (OCES) designed to comprehensively describe the mapping space for training LLMs. Unlike previous studies that focus solely on communication volume analysis based on mapping, TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance. However, fully accounting for these untapped optimization opportunities increases the complexity of the design space. To address this, we streamline the simulation process, reducing the time needed for exploration. Compared to AccPar, Deepspeed and Megatron, TMAC delivers a 3.1×, 2.9×, 1.6× performance gain. In terms of memory usage, TMAC requires 3.6×, 3.1× less memory than AccPar and Deepspeed, respectively and is comparable to Megatron’s full recomputation method.

  • Original article
    TING YUE, LIANG CHANG, HAOBO XU, CHENGZHI WANG, SHUISHENG LIN, JUN ZHOU

    The object detection algorithm based on convolutional neural networks (CNNs) significantly enhances accuracy by expanding network scale. As network parameters increase, large-scale networks demand substantial memory resources, making deployment on hardware challenging. Although most neural network accelerators utilize off-chip storage, frequent access to external memory restricts processing speed, hindering the ability to meet the frame rate requirements for embedded systems. This creates a trade-off in which the speed and accuracy of embedded target detection accelerators cannot be simultaneously optimized. In this paper, we propose PODALA, an energy-efficient accelerator developed through the algorithm-hardware co-design methodology. For object detection algorithm, we develop an optimized algorithm combined with the inverse-residual structure and depth wise separable convolution, effectively reducing network parameters while preserving high detection accuracy. For hardware accelerator, we develop a custom layer fusion technique for PODALA to minimize memory access requirements. The overall design employs a streaming hardware architecture that combines a computing array with a refined ping-pong output buffer to execute different layer fusion computing modes efficiently. Our approach substantially reduces memory usage through optimizations in both algorithmic and hardware design. Evaluated on the Xilinx ZCU102 FPGA platform, PODALA achieves 78 frames per second (FPS) and 79.73 GOPS/W energy efficiency, underscoring its superiority over state-of-the-art solutions.

  • Original article
    NIDHI SHARMA, VINAYAK HANDE, DEVARSHI MRINAL DAS

    A low-power, high-speed dynamic comparator with the addition of a cross-coupled pair in the pre-amplifier stage, followed by a strong-arm latch, is presented. The proposed modification increases the pre-amplifier's differential and common-mode gains, improving the latch's differential and common-mode input voltage, resulting in faster regeneration with 22% speed improvement as compared to conventional comparator at small input differential voltages ( Vi,id ). The proposed technique boosts the comparator's speed and helps achieve 21% lower energy per conversion delay product (EDP) compared to the literature. Analytical modeling of the delay that proves the improvement in the speed of the proposed comparator is also presented and verified with the simulation results. The proposed comparator's delay is insensitive to the common-mode voltage (Vi,cm ). The proposed comparator is fabricated in 180-nm CMOS technology and measurement shows less than 160 ps relative CLK-Q delay with 81 fJ.ns EDP and 0.8 mV input-referred rms noise with 1.8 V supply. To demonstrate the scalability of the proposed technique to advanced technology nodes, the proposed design is also simulated in 65-nm CMOS technology with a 1.1 V supply for 5 GHz frequency. For Vi,cm of 0.3 V and Vi,id of 1 mV and 10 mV, the proposed comparator exhibits a 40.69 ps and 32.41 ps delay and has 3.74 fJ.ns and 2.78 fJ.ns EDP respectively.

  • Original article
    MILIN ZHANG
  • Original article
    YANXING SUO, HAOXIANG GUO, RUIZHE YU, YANG ZHAO

    This paper presents a comprehensive review of the advancements of wearables over 20 years and explores future directions in the field. The review focuses on the evolving needs, innovation solutions, and technological progress of these sensors. We categorize the existing literature into key themes, highlighting significant advancements, challenges and future prospects in wearable sensor technology.

  • Original article
    RUNTIAN YANG, YUHAN HOU, SAVANNA BLADE, YINFEI LI, VANSH TYAGI, GLORIA-EDITH BOUDREAULT-MORALES, JOSÉ ZARIFFA, XILIN LIU

    Monitoring rehabilitation progress at home over the long term following a spinal cord injury (SCI) is crucial for maximizing therapeutic outcomes and enhancing the quality of life of affected individuals. Comprehensive monitoring requires collecting a range of physiological data, including surface electromyography (sEMG) and exercise motion data. Currently, assessments typically take place in clinical settings, which can be both costly and inconvenient for patients. There is a lack of accessible, user-friendly systems that allow individuals with SCI to independently gather this data at home. Additionally, video recordings may be necessary to verify that patients are positioning the sensors correctly and performing the exercises accurately. To bridge this gap, we have developed a self-contained, multi-modal sensor system that captures sEMG and motion data, along with depth-sensing video to track patient exercises while ensuring privacy by minimizing identifiable details. The system includes a configurable number of wireless, multisensor wearable patches that are easy to attach and comfortable for extended use, along with a time-of-flight depth-sensing camera. The multi-modal data is streamed and synchronized in real-time on a Raspberry Pi, establishing an innovative platform to support SCI rehabilitation and adaptable for various clinical monitoring applications.

  • Original article
    SHIHONG ZHOU, XIN WANG, YANXING SUO, XIAO HAN, GUOXING WANG, YANG ZHAO

    Real-time monitoring of multimodal vital signs including electrocardiography (ECG) and photoplethysmography (PPG) on wearable devices are attracting increasing interests. Motion artifacts, ambient light interference and sensor-skin contact variability affect signal quality significantly, demanding a multi-channel sensor interface chip with high dynamic range yet low power. A PPG/ECG interface chip is proposed for robust signal optimization. Time-division multiplexing, ambient double sampling and DC current compensation together enhance the dynamic range. Fabricated in a 0.18 μm process, the chip features a 5.63-pArms direct digitized input-referred noise for PPG readout and 365-nVrms for ECG. A cross-scale dynamic range of 133 dB is achieved, providing saturation-free usage for sport wearable devices such as smart watches and rings.