Robust Conditional 3D Shape Generation from Casual Captures
From an input image sequence, ShapeR preprocesses per-object multimodal data (SLAM points, images, captions). A rectified flow transformer then conditions on these inputs to generate meshes object-centrically, producing a full metric scene reconstruction.
Conditioned on off-the-shelf preprocessed inputs—SLAM points, 3D instances, and text—ShapeR infers per-object meshes to reconstruct the entire scene.
While monolithic methods fuse the scene into one block, ShapeR reconstructs individual objects. This allows you to interact with and manipulate specific objects in the scene.
ShapeR performs generative, object-centric 3D reconstruction from image sequences by leveraging multimodal inputs and robust training strategies. First, off-the-shelf SLAM and 3D instance detection are used to compute 3D points and object instances. For each object, sparse points, relevant images, 2D projections, and VLM captions are extracted to condition a rectified flow model, which denoises a latent VecSet to produce the 3D shape. The use of multimodal conditioning, along with heavy on-the-fly compositional augmentations and curriculum training, ensures the robustness of ShapeR in real-world scenarios.
ShapeR conditions on a range of modalities, including the object's posed multiview images, SLAM points, text descriptions, and 2D point projections.
ShapeR leverages single-object pretraining with extensive augmentations, simulating realistic backgrounds, occlusions, and noise across images and SLAM inputs.
ShapeR is fine-tuned on object-centric crops from Aria Synthetic Environment scenes, which feature realistic image occlusions, SLAM point cloud noise, and inter-object interaction.
For even more detail, refer to the paper.
ShapeR comes with a new evaluation dataset of in-the-wild sequences with paired posed multi-view images, SLAM point clouds, and individually complete 3D shape annotations for 178 objects across 7 diverse scenes. In contrast to existing real-world 3D reconstruction datasets which are either captured in controlled setups or have merged object and background geometries or incomplete shapes, this dataset is designed to capture real-world challenges like occlusions, clutter, and variable resolution and viewpoints to enable realistic, in-the-wild evaluation.
SAM 3D Objects marks a significant improvement in shape generation, but it lacks metric accuracy and requires interaction. Since it can only exploit a single view, it can sometimes fail to preserve correct aspect ratios, relative scales, and object layouts in complex scenes such as shown in the example here.
ShapeR solves this by leveraging image sequences and multimodal data (such as SLAM points). By integrating multiple posed views, ShapeR automatically produces metrically accurate and consistent reconstructions. Unlike interactive single-image methods, ShapeR robustly handles casually captured real-world scenes, generating high-quality metric shapes and arrangements without requiring user interaction.
Notably, ShapeR achieves this while trained entirely on synthetic data, whereas SAM 3D exploits large-scale labeled real image-to-3D data. This highlights two different axes of progress: where SAM 3D uses large-scale real data for robust single-view inference, ShapeR utilizes multi-view geometric constraints to achieve robust, metric scene reconstruction.
The two approaches can be combined. By conditioning the second stage of SAM 3D with the output of ShapeR, we can merge the best of both worlds: the metric accuracy and robust layout of ShapeR, and the textures and robust real-world priors of SAM 3D.
Although trained on simulated data with visual-inertial SLAM points, ShapeR generalizes to other data sources without finetuning. For instance, it can reconstruct complete objects in ScanNet++ scenes. Furthermore, by leveraging tools like MapAnything to generate metric points, ShapeR can even produce metric 3D shapes from monocular images without retraining.
If you find this research helpful, please consider citing our paper:
# TODO: add bibtex