ShapeR

Robust Conditional 3D Shape Generation from Casual Captures

Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu^†, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel

Meta Reality Labs Research ^†Simon Fraser University

Read Paper arXiv Video Code & Model Data

Metric Generative Shape Reconstruction

From an input image sequence, ShapeR preprocesses per-object multimodal data (SLAM points, images, captions). A rectified flow transformer then conditions on these inputs to generate meshes object-centrically, producing a full metric scene reconstruction.

Conditioned on off-the-shelf preprocessed inputs—SLAM points, 3D instances, and text—ShapeR infers per-object meshes to reconstruct the entire scene.

While monolithic methods fuse the scene into one block, ShapeR reconstructs individual objects. This allows you to interact with and manipulate specific objects in the scene.

How It Works

ShapeR performs generative, object-centric 3D reconstruction from image sequences by leveraging multimodal inputs and robust training strategies. First, off-the-shelf SLAM and 3D instance detection are used to compute 3D points and object instances. For each object, sparse points, relevant images, 2D projections, and VLM captions are extracted to condition a rectified flow model, which denoises a latent VecSet to produce the 3D shape. The use of multimodal conditioning, along with heavy on-the-fly compositional augmentations and curriculum training, ensures the robustness of ShapeR in real-world scenarios.

Multimodal Conditioning

ShapeR conditions on a range of modalities, including the object's posed multiview images, SLAM points, text descriptions, and 2D point projections.

Compositional Augmentation

ShapeR leverages single-object pretraining with extensive augmentations, simulating realistic backgrounds, occlusions, and noise across images and SLAM inputs.

Two Stage Curriculum Training

ShapeR is fine-tuned on object-centric crops from Aria Synthetic Environment scenes, which feature realistic image occlusions, SLAM point cloud noise, and inter-object interaction.

Watch the full presentation to learn more details.

For even more detail, refer to the paper.

ShapeR Evaluation Dataset

ShapeR comes with a new evaluation dataset of in-the-wild sequences with paired posed multi-view images, SLAM point clouds, and individually complete 3D shape annotations for 178 objects across 7 diverse scenes. In contrast to existing real-world 3D reconstruction datasets which are either captured in controlled setups or have merged object and background geometries or incomplete shapes, this dataset is designed to capture real-world challenges like occlusions, clutter, and variable resolution and viewpoints to enable realistic, in-the-wild evaluation.

ShapeR Dataset Overview — Category distribution of 178 objects across 7 sequences along with examples showing ground-truth mesh, representative frame, aligned mesh, and 2D projection.

Data Annotation Process — To obtain pseudo-ground truth, we capture the object in isolation (left) and generate geometry via image-to-3D modeling (mid). The mesh is then manually aligned to the original sequence and verified against 2D projections and point clouds (right).

How is it Different to SAM3D Objects?

SAM 3D Objects marks a significant improvement in shape generation, but it lacks metric accuracy and requires interaction. Since it can only exploit a single view, it can sometimes fail to preserve correct aspect ratios, relative scales, and object layouts in complex scenes such as shown in the example here.

ShapeR solves this by leveraging image sequences and multimodal data (such as SLAM points). By integrating multiple posed views, ShapeR automatically produces metrically accurate and consistent reconstructions. Unlike interactive single-image methods, ShapeR robustly handles casually captured real-world scenes, generating high-quality metric shapes and arrangements without requiring user interaction.

Notably, ShapeR achieves this while trained entirely on synthetic data, whereas SAM 3D exploits large-scale labeled real image-to-3D data. This highlights two different axes of progress: where SAM 3D uses large-scale real data for robust single-view inference, ShapeR utilizes multi-view geometric constraints to achieve robust, metric scene reconstruction.

The two approaches can be combined. By conditioning the second stage of SAM 3D with the output of ShapeR, we can merge the best of both worlds: the metric accuracy and robust layout of ShapeR, and the textures and robust real-world priors of SAM 3D.

Performance on Non-Aria Data

Although trained on simulated data with visual-inertial SLAM points, ShapeR generalizes to other data sources without finetuning. For instance, it can reconstruct complete objects in ScanNet++ scenes. Furthermore, by leveraging tools like MapAnything to generate metric points, ShapeR can even produce metric 3D shapes from monocular images without retraining.

ShapeR on ScanNet++ — ShapeR results on ScanNet++, showing complete shape prediction even beyond the ground truth scanned scene.

ShapeR on iPhone captures — Reconstruction on images captured with an iPhone, with metric depth maps and poses acquired via Map Anything. ShapeR runs on top to get scene reconstruction.

Citation

If you find this research helpful, please consider citing our paper:

# TODO: add bibtex