An End-to-End Transformer Model for 3D Object Detection

We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D specific operators with hand-tuned hyperparameters. Nevertheless, 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. Through extensive experiments, we show 3DETR outperforms the well-established and highly optimized VoteNet baselines on the challenging ScanNetV2 dataset by 9.5%. Furthermore, we show 3DETR is applicable to 3D tasks beyond detection, and can serve as a building block for future research.


Ishan Misra

Rohit Girdhar

Armand Joulin


I. Misra, R. Girdhar and A. Joulin
An End-to-End Transformer Model for 3D Object Detection
IEEE/CVF International Conference on Computer Vision (ICCV), 2021 (Oral Presentation)
[arXiv] [code/models] [BibTex]


3DETR achieves comparable or better performance to these improved baselines despite having fewer hand-coded 3D or detection specific decisions.

Method ScanNetV2 SUN RGB-D
AP25 AP50 AP25 AP50
BoxNet 49.0 21.1 52.4 25.1
3DETR 62.7 37.5 58.0 30.3
VoteNet 60.4 37.5 58.3 33.4
3DETR-m 65.0 47.0 59.1 32.7
H3DNet 67.2 48.1 60.1 39.0

Detection results for scenes from the val set of the SUN RGB-D dataset. 3DETR does not use color information (used only for visualization) and predicts boxes from point clouds. 3DETR can detect objects even with single-view depth scans and predicts amodal boxes e.g., the full extent of the bed (top left) including objects missing in the ground truth (top right).


We thank Zaiwei Zhang for helpful discussions and Laurens van der Maaten for feedback on the paper.