An End-to-End Transformer Model for 3D Object Detection

Misra, Ishan; Girdhar, Rohit; Joulin, Armand

An End-to-End Transformer Model for 3D Object Detection


We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D specific operators with hand-tuned hyperparameters. Nevertheless, 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. Through extensive experiments, we show 3DETR outperforms the well-established and highly optimized VoteNet baselines on the challenging ScanNetV2 dataset by 9.5%. Furthermore, we show 3DETR is applicable to 3D tasks beyond detection, and can serve as a building block for future research.

People

Ishan Misra

Rohit Girdhar

Armand Joulin

Paper

I. Misra, R. Girdhar and A. Joulin
An End-to-End Transformer Model for 3D Object Detection
IEEE/CVF International Conference on Computer Vision (ICCV), 2021 (Oral Presentation)
[arXiv] [code/models] [BibTex]

Results

3DETR achieves comparable or better performance to these improved baselines despite having fewer hand-coded 3D or detection specific decisions.

Method	ScanNetV2		SUN RGB-D
Method	AP25	AP50	AP25	AP50
BoxNet	49.0	21.1	52.4	25.1
3DETR	62.7	37.5	58.0	30.3
VoteNet	60.4	37.5	58.3	33.4
3DETR-m	65.0	47.0	59.1	32.7
H3DNet	67.2	48.1	60.1	39.0

Detection results for scenes from the val set of the SUN RGB-D dataset. 3DETR does not use color information (used only for visualization) and predicts boxes from point clouds. 3DETR can detect objects even with single-view depth scans and predicts amodal boxes e.g., the full extent of the bed (top left) including objects missing in the ground truth (top right).

Acknowledgements

We thank Zaiwei Zhang for helpful discussions and Laurens van der Maaten for feedback on the paper.