Original article

PODALA: Power-Efficient Object Detection Accelerator With Customized Layer Fusion Engine

  • TING YUE 1 ,
  • LIANG CHANG , 1 ,
  • HAOBO XU 2 ,
  • CHENGZHI WANG 3 ,
  • SHUISHENG LIN 1 ,
  • JUN ZHOU , 1
Expand
  • 1 School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
  • 2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • 3 National Innovation Institute of Defense Technology, Academy of Management Studies, Beijing 100071, China
+ CORRESPONDING AUTHOR: LIANG CHANG; JUN ZHOU (e-mail: ; ).

Received date: 2024-10-19

  Revised date: 2024-11-08

  Accepted date: 2024-11-15

  Online published: 2025-01-09

Supported by

the National Natural Science Foundation of China(62104025, 62104229, 62104259)

Abstract

The object detection algorithm based on convolutional neural networks (CNNs) significantly enhances accuracy by expanding network scale. As network parameters increase, large-scale networks demand substantial memory resources, making deployment on hardware challenging. Although most neural network accelerators utilize off-chip storage, frequent access to external memory restricts processing speed, hindering the ability to meet the frame rate requirements for embedded systems. This creates a trade-off in which the speed and accuracy of embedded target detection accelerators cannot be simultaneously optimized. In this paper, we propose PODALA, an energy-efficient accelerator developed through the algorithm-hardware co-design methodology. For object detection algorithm, we develop an optimized algorithm combined with the inverse-residual structure and depth wise separable convolution, effectively reducing network parameters while preserving high detection accuracy. For hardware accelerator, we develop a custom layer fusion technique for PODALA to minimize memory access requirements. The overall design employs a streaming hardware architecture that combines a computing array with a refined ping-pong output buffer to execute different layer fusion computing modes efficiently. Our approach substantially reduces memory usage through optimizations in both algorithmic and hardware design. Evaluated on the Xilinx ZCU102 FPGA platform, PODALA achieves 78 frames per second (FPS) and 79.73 GOPS/W energy efficiency, underscoring its superiority over state-of-the-art solutions.

Cite this article

TING YUE , LIANG CHANG , HAOBO XU , CHENGZHI WANG , SHUISHENG LIN , JUN ZHOU . PODALA: Power-Efficient Object Detection Accelerator With Customized Layer Fusion Engine[J]. Integrated Circuits and Systems, 2024 , 1(4) : 196 -205 . DOI: 10.23919/ICS.2024.3506511

[1]
A. R. Pathak, M. Pandey, and S. Rautaray, “Application of deep learning for object detection,” Procedia Comput. Sci., vol. 132, pp. 1706-1717, 2018.

[2]
Y. Gong et al., “22.7 DL-VOPU: An energy-efficient domain-specific deep-learning-based visual object processing unit supporting multi-scale semantic feature extraction for mobile object detection/tracking applications,” in Proc. IEEE Int. Solid-State Circuits Conf., 2023, pp. 1-3.

[3]
Z. Hu, J. Zeng, X. Zhao, L. Zhou, and L. Chang, “SuperHCA: An efficient deep-learning edge super-resolution accelerator with sparsity-aware heterogeneous core architecture,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 71, no. 12, pp. 5420-5431, Dec. 2024

[4]
W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 21-37.

[5]
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440-1448.

[6]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, Jun. 2017.

[7]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 779-788.

[8]
M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in Proc. 9th Annu. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1-12.

[9]
A.-J. Huang, J.-H. Hung, and T.-S. Chang, “Memory bandwidth efficient design for super-resolution accelerators with structure adaptive fusion and channel-aware addressing,” IEEE Trans. Very Large Scale Integration Syst., vol. 31, no. 6, pp. 802-811, Jun. 2023.

[10]
K. Xu et al., “A dedicated hardware accelerator for real-time acceleration of YOLOv2,” J. Real-Time Image Process., vol. 18, pp. 481-492, 2021.

[11]
X. Xie, J. Lin, Z. Wang, and J. Wei, “An efficient and flexible accelerator design for sparse convolutional neural networks,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 7, pp. 2936-2949, Jul. 2021.

[12]
P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2001, pp. 1-1.

[13]
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2005, pp. 886-893.

[14]
P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1-8.

[15]
P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Cascade object detection with deformable part models,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 2241-2248.

[16]
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627-1645, Sep. 2010.

[17]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580-587.

[18]
R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, pp. 142-158, Jan. 2016.

[19]
P. Adarsh, P. Rathi, and M. Kumar, “YOLO V3-tiny: Object detection and recognition using one stage improved model,” in Proc. 6th IEEE Int. Conf. Adv. Comput. Commun. Syst., 2020, pp. 687-694.

[20]
J. Redmon and A. Farhadi, “YOLOv3:An incremental improvement,” in Computer Vision and Pattern Recognition. Berlin/Heidelberg, Germany: Springer, 2018, vol. 1804, pp. 1-6.

[21]
A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” 2020, arXiv:2004.10934.

[22]
T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2999-3007.

[23]
H. Law and J. Deng, “CornerNet: Detecting objects as paired key-points,” Int. J. Comput. Vis., vol. 128, pp. 642-656, 2018.

[24]
Z.-Q. Zhao, P. Zheng, S. T. Xu, and X. Wu, “Object detection with deep learning: A review,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11, pp. 3212-3232, Nov. 2019.

[25]
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 213-229.

[26]
K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904-1916, Sep. 2015.

[27]
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 936-944.

[28]
L. Yu, Q. Zhao, and Z. Wang, “Attention mechanism driven YOLOv3 on FPGA acceleration for efficient vision based defect inspection,” in Proc. 5th Int. Conf. Comput. Sci. Appl. Eng., 2021, pp. 1-5.

[29]
D. Pestana et al., “A full featured configurable accelerator for object detection with YOLO,” IEEE Access, vol. 9, pp. 75864-75877, 2021.

[30]
J. Zhang et al., “A low-latency FPGA implementation for real-time object detection,” in Proc. IEEE Int. Symp. Circuits Syst., 2021, pp. 1-5.

[31]
Z. Zhou, Y. Liu, and Y. Xu,“Design and implementation of YOLOv3-tiny accelerator based on PYNQ-Z2 heterogeneous platform,” in Proc. 4th Int. Conf. Electron. Inf. Technol. Comput. Eng., 2020, pp. 1097-1102.

[32]
Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang, “Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2,” IEEE Access, vol. 8, pp. 116569-116585, 2020.

[33]
A. Ahmad, M. A. Pasha, and G. J. Raza, “Accelerating tiny YOLOv3 using FPGA-based hardware/software co-design,” in Proc. IEEE Int. Symp. Circuits Syst., Oct. 2020, pp. 1-5.

[34]
H. Zhang, J. Jiang, Y. Fu, and Y. Chang, “YOLOv3-tiny object detection SOC based on FPGA platform,” in Proc. 6th IEEE Int. Conf. Integr. Circuits Microsyst., 2021, pp. 291-294.

[35]
S. Ki, J. Park, and H. Kim, “Dedicated FPGA implementation of the Gaussian tiny YOLOv3 accelerator,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 70, no. 10, pp. 3882-3886, Oct. 2023.

[36]
A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017, arXiv:1704.04861.

[37]
X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6848-6856.

[38]
N. Ma, X. Zhang, H. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 116-131.

[39]
S. Qian, C. Ning, and Y. Hu, “MobileNetV3 for image classification,” in Proc. IEEE 2nd Int. Conf. Big Data, Artif. Intell. Internet Things Eng., 2021, pp. 490-497.

[40]
M. Tan and Q. V. Le, “EfficientNetV2: Smaller models and faster training,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10096-10106.

[41]
M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510-4520.

[42]
T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 740-755.

[43]
M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” Int. J. Comput. Vis., vol. 31, pp. 98-136, 2015.

[44]
A. Wang et al., “YOLOv10: Real-time end-to-end object detection,” 2024, arXiv:2405.14458.

[45]
L. Lai, N. Suda, and V. Chandra, “Deep convolutional neural network inference with floating-point weights and fixed-point activations,” 2017, arXiv:1703.03073.

[46]
A. Montgomerie-Corcoran, P. Toupas, Z. Yu, and C.-S. Bouganis, “SA-TAY: A streaming architecture toolflow for accelerating YOLO models on FPGA devices,” in Proc. IEEE Int. Conf. Field Program. Technol., 2023, pp. 179-187.

[47]
W. Lee, K. Kim, W. Ahn, J. Kim, and D. Jeon, “A real-time object detection processor with XNOR-based variable-precision computing unit,” IEEE Trans. Very Large Scale Integration Syst., vol. 31, no. 6, pp. 749-761, Jun. 2023.

Outlines

/