eViTBins: Edge-Enhanced Vision-Transformer Bins for Monocular Depth Estimation on Edge Devices

Yutong She, Peng Li, Mingqiang Wei, Dong Liang, Yiping Chen, Haoran Xie, Fu Lee Wang

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Monocular depth estimation (MDE) remains a fundamental yet not well-solved problem in computer vision. Current wisdom of MDE often achieves blurred or even indistinct depth boundaries, degenerating the quality of vision-based intelligent transportation systems. This paper presents an edge-enhanced vision transformer bins network for monocular depth estimation, termed eViTBins. eViTBins has three core modules to predict monocular depth maps with exceptional smoothness, accuracy, and fidelity to scene structures and object edges. First, a multi-scale feature fusion module is proposed to circumvent the loss of depth information at various levels during depth regression. Second, an image-guided edge-enhancement module is proposed to accurately infer depth values around image boundaries. Third, a vision transformer-based depth discretization module is introduced to comprehend the global depth distribution. Meanwhile, unlike most MDE models that rely on high-performance GPUs, eViTBins is optimized for seamless deployment on edge devices, such as NVIDIA Jetson Nano and Google Coral SBC, making it ideal for real-time intelligent transportation systems applications. Extensive experimental evaluations corroborate the superiority of eViTBins over competing methods, notably in terms of preserving depth edges and global depth representations.

Original languageEnglish
Pages (from-to)20320-20334
Number of pages15
JournalIEEE Transactions on Intelligent Transportation Systems
Volume25
Issue number12
DOIs
Publication statusPublished - 2024

Keywords

  • adaptive depth bins
  • edge AI
  • Edge-enhanced vision transformer
  • monocular depth estimation
  • traffic monitoring
  • unmanned aerial vehicle

Fingerprint

Dive into the research topics of 'eViTBins: Edge-Enhanced Vision-Transformer Bins for Monocular Depth Estimation on Edge Devices'. Together they form a unique fingerprint.

Cite this