An Efficient Multiplier-Less ProcessingElement on Power-of-2 Dictionary-Based Data Quantization

JIAXIANG LI; MASAO YANAGISAWA; YoUHUA SH

doi:10.23919/ICS.2024.3423850

Integrated Circuits and Systems >

2024 , Vol. 1 >Issue 1: 53 - 62

DOI: https://doi.org/10.23919/ICS.2024.3423850

Special Section on Selected Papers from ASICON2023

An Efficient Multiplier-Less ProcessingElement on Power-of-2 Dictionary-Based Data Quantization

JIAXIANG LI ,
MASAO YANAGISAWA ,
YoUHUA SH

Expand

Deparment of Electronic and Physical Systems. Graduate School of Fundamental Science and Engineering. Waseda University, Tokyo 169-8.55,Japan

YOUHUA SHl (e-mail: youhua.shi@islab.cs.waseda.ac.jp).

Jiaxiang Li (Student Member, IEEE) received the B.S. degree in microelectronic science and engineering from Sichuan University, Chengdu, China, in 2019, and the M.S. degree in electronic engineering (circuit and system) from the University of California at Irvine, Irvine, CA, USA, in 2021. He is currently working toward the Dr.Eng. degree with Waseda University, Tokyo, Japan. His research interests include energy efficient digital circuit and neural network hardware accelerator design.

Masao Yanagisawa (Member, IEEE) received the B.Eng., M.Eng., and Dr. Eng. degrees in electrical engineering from Waseda University, Tokyo, Japan, in 1981, 1983, and 1986, respectively. From 1986 to 1987, he was with the University of California at Berkeley, Berkeley, CA, USA. He joined Takushoku University, in 1987. He joined Waseda University, in 1991, where he is currently a Professor with the Faculty of Science and Engineering. His research interests include combinatorics and graph theory, computational geometry, LSI design and verification, and bioinformatics.

Youhua Shi (Member, IEEE) received the B.S. and M.S. degrees in electric engineering from Southeast University, Nanjing, China, in 1999 and 2002, respectively, and the Dr.Eng. degree in electronics, information, and communication engineering from Waseda University, Tokyo, Japan, in 2005. He is currently a Professor with the Faculty of Science and Engineering, Waseda University. His research interests include various aspects of integrated system design, such as design-for-reliability, energy harvesting, and intelligent system design.

Received date: 2024-02-28

Revised date: 2024-05-14

Accepted date: 2024-06-27

Online published: 2024-11-27

Supported by

Waseda University Open Imnovation Ecosystem Program for Pionring Research (W-SPRING) underGrant umber(JPMISP2128)

Fold

Abstract

The large-scale neural networks have brought incredible shocks to the world, changing people's lives and offering vast prospects. However, they also come with enormous demands for computational power and storage pressure, the core of its computational requirements lies in the matrix multiplication units dominated by multiplication operations. To address this issue, we propose an area-power-efficient multiplier-less processing element (PE) design. Prior to implementing the proposed PE, we apply a power-of-2 dictionary-based quantization to the model and effectiveness of this quantization method in preserving the accuracy of the original model is confirmed. In hardware design, we present a standard and one variant ‘bi-sign’ architecture of the PE. Our evaluation results demonstrate that the systolic array that implement our standard multiplier-less PE achieves approximately 38% lower power-delay-product and 13% smaller core area compared to a conventional multiplication-and-accumulation PE and the bi-sign PE design can even save 37% core area and 38% computation energy. Furthermore, the applied quantization reduces the model size and operand bit-width, leading to decreased on-chip memory usage and energy consumption for memory accesses. Additionally, the hardware schematic facilitates expansion to support other sparsity-aware, energy-efficient techniques.

Key words： Al accelerators; approximate computing，efficient-computing; model quantization; multiplier-less processing element

Cite this article

JIAXIANG LI , MASAO YANAGISAWA , YoUHUA SH . An Efficient Multiplier-Less ProcessingElement on Power-of-2 Dictionary-Based Data Quantization[J]. Integrated Circuits and Systems, 2024 , 1(1) : 53 -62 . DOI: 10.23919/ICS.2024.3423850

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	P. P. Ray, "ChatGPT: A comprehensive review on background applications key challenges bias ethics limitations and future scope", IoT Cyber-Phys. Syst., vol. 3, pp. 121-154, 2023.

[2]	"Video generation models as world simulators", Feb. 2024, [online] Available: https://openai.com/research/video-generation-models-as-world-simulators.

[3]	A. Krizhevsky, I. Sutskever and G. Hinton, "ImageNet classification with deep convolutional neural networks", Proc. Adv. Neural Inform. Process. Syst., pp. 1097-1105, 2012.

[4]	"AI and compute", May 2018, [online] Available: https://openai.com/research/ai-and-compute#fn-D.

[5]	N. Mellempudi, S. Srinivasan, D. Das and B. Kaul, "Mixed precision training with 8-bit floating point", 2019.

[6]	I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv and Y. Bengio, "Binarized neural networks", Proc. Neural Inf. Process. Syst., pp. 4107-4115, 2016.

[7]	Y.-J. Lin and T. S. Chang, "Data and hardware efficient design for convolutional neural network", IEEE Trans. Circuits Syst. I Regular Papers, vol. 65, no. 5, pp. 1642-1651, May 2018.

[8]	J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler and Z. Zhang, "SNAP: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference", IEEE J. Solid-State Circuits, vol. 56, no. 2, pp. 636-647, Feb. 2021.

[9]	C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang and H. Shen, "An efficient hardware accelerator for structured sparse convolutional neural networks on FPGAs", IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 9, pp. 1953-1965, Sep. 2020.

[10]	A. Vigneshwar and G. A. Sathish Kumar, "Approximate multiplier for low power applications", Int. J. Eng. Res. Technol., vol. 4, no. 14, pp. 1-5, 2016.

[11]	B. Fang et al., "Approximate multipliers based on a novel unbiased approximate 4-2 compressor", Integration, vol. 81, pp. 17-24, Nov. 2021.

[12]	S. Perri, F. Spagnolo, F. Frustaci and P. Corsonello, "Designing energy-efficient approximate multipliers", J. Low Power Electron. Appl., vol. 12, no. 4, Sep. 2022.

[13]	S. S. Sarwar, S. Venkataramani, A. Raghunathan and K. Roy, "Multiplier-less artificial neurons exploiting error resiliency for energy-efficient neural computing", Proc. IEEE Des. Automat. Test Eur. Conf. Exhibit., pp. 145-150, 2016.

[14]	M. Liu, Y. He and H. Jiao, "An LUT-based multiplier array for systolic array-based convolutional neural network accelerator", Proc. IEEE Asia Pacific Conf. Circuits Syst., pp. 55-59, 2022.

[15]	E. H. Lee, D. Miyashita, E. Chai, B. Murmann and S. S. Wong, "LogNet: Energy-efficient neural networks using logarithmic computation", Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 5900-5904, 2017.

[16]	M.-H. Hsieh, Y.-T. Liu and T.-D. Chiueh, "A multiplier-less convolutional neural network inference accelerator for intelligent edge devices", IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 11, no. 4, pp. 739-750, Dec. 2021.

[17]	P. U. Joshi, D. Khushlani and R. Khobragade, "Power-area efficient computing technique for approximate multiplier with carry prediction", Proc. IEEE Int. Conf. Emerg. Trends Eng. Tech., pp. 1-4, 2023.

[18]	B. Zhao, Y. Wang, H. Zhang, J. Zhang, Y. Chen and Y. Yang, "4-bit CNN quantization method with compact LUT-based multiplier implementation on FPGA", IEEE Trans. Instrum. Meas., vol. 72, 2023.

[19]	T.-J. Yang, Y.-H. Chen, J. Emer and V. Sze, "A method to estimate the energy consumption of deep neural networks", Proc. IEEE 51st Asilomar Conf. Signals Syst. Comput., pp. 1916-1920, 2017.

[20]	N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit", Proc. Symp. Comput. Architecture, pp. 1-12, 2017.

[21]	Y.-H. Chen, T. Krishna, J. S. Emer and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks", IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017.

[22]	Y. Nahshan et al., "Loss aware post-training quantization", Mach. Learn., vol. 110, no. 11/12, pp. 3245-3262, Dec. 2021.

[23]	S. Gupta, S. Ullah, K. Ahuja, A. Tiwari and A. Kumar, "ALigN: A highly accurate adaptive layerwise Log_2_Lead quantization of pre-trained neural networks", IEEE Access, vol. 8, pp. 118899-118911, 2020.

[24]	Z. Liu, Q. Liu, S. Yan and R. C. C. Cheung, "An efficient FPGA-based depthwise separable convolutional neural network accelerator with hardware pruning", ACM Trans. Reconfigurable Tech. Syst., vol. 17, no. 1, pp. 1-20, 2024.

[25]	S. Yun and A. Wong, "Do all mobilenets quantize poorly? gaining insights into the effect of quantization on depthwise separable convolutional networks through the eyes of multi-scale distributional dynamics", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2447-2456, 2021.

[26]	L. Bai, Y. Zhao and X. Huang, "A CNN accelerator on FPGA using depthwise separable convolution", IEEE Trans. Circuits Syst. II Exp. Briefs, vol. 65, no. 10, pp. 1415-1419, Oct. 2018.

[27]	X. Wu, Y. Ma, M. Wang and Z. Wang, "A flexible and efficient FPGA accelerator for various large-scale and lightweight CNNs", IEEE Trans. Circuits Syst. I Regular Papers, vol. 69, no. 3, pp. 1185-1198, Mar. 2022.

[28]	V. H. Kim and K. K. Choi, "A reconfigurable CNN-based accelerator design for fast and energy-efficient object detection system on mobile FPGA", IEEE Access, vol. 11, pp. 59438-59445, 2023.

[29]	W. Huang et al., "FPGA-based high-throughput CNN hardware accelerator with high computing resource utilization ratio", IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 8, pp. 4069-4083, Aug. 2022.

[30]	M. Wang, X. Wu, J. Lin and Z. Wang, "An FPGA-based accelerator enabling efficient support for CNNs with arbitrary kernel sizes", Proc. IEEE Int. Symp. Circuits Syst., pp. 1-5, 2024.

[31]	B.Wu, T. Yu, K. Chen, andW. Liu, “Edge-side fine-grained sparse CNN accelerator with efficient dynamic pruning scheme,” IEEE Trans. Circuits Syst. I, Regular Papers, vol. 71, no. 3, pp. 1285–1298, Mar. 2024.

[32]	M. Dampfhoffer, T. Mesquida, A. Valentian, and L. Anghel, “Are SNNs really more energy-efficient than ANNs? An in-depth hardwareaware study,” IEEE Trans. Emerg. Topics Comput. Intell., vol. 7, no. 3, pp. 731–741, Jun. 2023.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References