1University of Utah, 2Carnegie Mellon University, 3NVIDIA
Abstract: Constructing 3D representations of object geometry is critical for many downstream robotics tasks, particularly tabletop manipulation problems. These representations must be built from potentially noisy partial observations. In this work, we focus on the problem of reconstructing a multi-object scene from a single RGBD image, generally from a fixed camera in the scene. Traditional scene representation methods generally cannot infer the geometry of unobserved regions of the objects from the image. Attempts have been made to leverage deep learning to train on a dataset of observed objects and representations, and then generalize to new observations. However, this can be brittle to noisy real-world observations and objects not contained in the dataset, and cannot reason about their confidence. We propose BRRP, a reconstruction method that leverages preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction. In order to make our method more efficient, we introduce the concept of retrieval-augmented prior, where we retrieve relevant components of our prior distribution during inference. The prior is used to estimate the geometry of occluded portions of the in-scene objects. Our method produces a distribution over object shape that can be used for reconstruction or measuring uncertainty. We evaluate our method in both simulated scenes and in the real world. We demonstrate the robustness of our method against deep learning-only approaches while being more accurate than a method without an informative prior.
The ability to construct internal representations of its operating environment is key for robot autonomy. These representations need to be particularly fine-grained for robotic manipulation, which often requires closely interacting with and avoiding objects. These interactions make it necessary for robots to develop an understanding of the geometry within their vicinity. Explicit 3D representations of the geometry of the scene are often required for the robust usage of downstream grasping and motion planning algorithms. These representations must be built from observations that are both noisy and, due to occlusion, only contain partial information of the scene. In our case, we focus on the problem of building a 3D representation of multi-object scenes from a single RGBD camera image.
In this work, we introduce a novel Bayesian approach for robustly reconstructing multi-object tabletop scenes by leveraging object-level shape priors. We present Bayesian Reconstruction with Retrieval-augmented Priors (BRRP). BRRP is resilient to many of the pitfalls of learning-based methods while still being able to leverage an informative prior to more accurately reconstruct known objects.
To motivate retrieval-augmented priors, consider the problem of Bayesian inference with a mixture model acting as the prior distribution. Given some data, we would like to infer a posterior distribution over hypotheses. If we have a mixture model as a prior distribution, then: P(H | D) \propto P(D | H) \sum_{c = 1}^C P(H | c). If our prior distribution has a lot of components, it may be inefficient to fully evaluate. This could be a serious problem for algorithms like SVGD, which requires iteratively computing the gradient of both the likelihood and prior. Inspired by retrieval-augmented generation, the insight behind retrieval-augmented priors is to determine which subset of the prior distribution components to retrieve and use given some detection result R. Conditioning on this detection result, we have a new posterior distribution, P(H | D, R). Making an independence assumption, P(H | D, R) \propto P(D | H) \cdot \mathbb E_{c \sim P(c | R)} [P(H | c)]. Comparing to the first equation, the expectation now replaces the true prior. Then, we can use a top-k approximation for the expectation: P(H | D, R) \propto P(D | H) \sum_{c \in \text{topk}} P(H | c) P(c | R) This means that we only need to evaluate a subset of the prior distribution components.
Overview of BRRP method. We begin with a segmented RGBD image and (a) feed cropped images of each segment into CLIP to get object probabilities. Then, we retrieve and (b) register the the top-k objects in the prior. This gives us a set of registered prior samples. We also (c) compute negative samples based on the observed segmented point cloud. Finally, (d) we run SVGD optimization to recover a posterior distribution over Hilbert map weights. We can use this distribution to both reconstruct the scene as well as measure uncertainty.
Above is an example of qualitative reconstructions on real world scenes. We compare against an occupancy version of PointSDF as well as the V-PRISM method. Our method (BRRP) does a better job at being robust to a lot of the pitfalls of the aforementioned methods.
Above is a plot of Chamfer distance on procedurally generated scenes (lower is better). We use the same baselines as the real world scenes, but compare methods quantitatively in simulated scenes. Our method outperforms the baselines.
Here, we visualize the uncertainty of our method. Because our method is probabilistic, we can recover principled uncertainy about object shape