Perception, Planning, and Control
© Russ Tedrake, 2020-2024
Last modified .
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Fall 2024 semester.
Previous Chapter | Table of contents | Next Chapter |
In the last chapter, we developed an initial solution to moving objects around, but we made one major assumption that would prevent us from using it on a real robot: we assumed that we knew the initial pose of the object. This chapter is going to be our first pass at removing that assumption, by developing tools to estimate that pose using the information obtained from the robot's depth cameras. The tools we develop here will be most useful when you are trying to manipulate known objects (e.g. you have a mesh file of their geometry) and are in a relatively uncluttered environment. But they will also form a baseline for the more sophisticated methods we will study.
The approach that we'll take in this chapter is based on geometry. Not all perception systems these days use geometry explicitly; many deep learning approaches learn more direct maps from pixels. But many deep learning approaches do build heavily on geometry, and indeed the last few years have also brought a revolution in geometry processing (dense reconstruction, text to 3D, etc.). Some of this was fueled initially by applications in autonomous driving and virtual/augmented reality in addition to manipulation, which brought advanced sensors and fantastic new algorithms.
So we will start with geometry because it can provide a solid foundation for our explorations in perception. I must state one caveat -- we will use "pose estimation" as our goal for this chapter because it is concrete and immediately useful. But I would say that I try not to use explicit pose estimation as a concept in our most advanced manipulation systems; in addition to assuming known objects, accurately estimating the pose of an object can be difficult (due to partial views and/or occlusions) and it is often more than we actually need to achieve our manipulation goals.
Just as we had many choices when selecting the robot arm and hand, we have many choices for instrumenting our robot/environment with sensors. Even more so than our robot arms, the last few years have seen incredible improvements in the quality and reductions in cost and size for these sensors. This is largely thanks to the cell phone industry, but the race for autonomous cars has been fueling high-end sensors as well.
These changes in hardware quality have caused sometimes dramatic changes in our algorithmic approaches. For example, estimation can be much easier when the resolution and frame rate of our sensors is high enough that not much can change in the world between two images; this undoubtedly contributed to the revolutions in the field of "simultaneous localization and mapping" (SLAM) we have seen over the last decade or so.
One might think that the most important sensors for manipulation are the touch sensors (you might even be right!). But in practice today, most of the emphasis is on camera-based and/or range sensing. At very least, we should consider this first, since our touch sensors won't do us much good if we don't know where in the world we need to touch.
Traditional cameras, which we think of as a sensor that outputs a color image at some framerate, play an important role. But robotics makes heavy use of sensors that make an explicit measurement of the distance (between the camera and the world) or depth; sometimes in addition to color and sometimes in lieu of color.
Primarily RGB-D (ToF vs projected texture stereo vs ...) cameras and Lidar
The cameras we are using in this course are Intel RealSense D415.
Monocular depth.
Learned stereo
There are a number of levels of fidelity at which one can simulate a
camera like the D415. We'll start our discussion here using an "ideal"
RGB-D camera simulation -- the pixels returned in the depth image
represent the true geometric depth in the direction of each pixel
coordinate. In RgbdSensor
system, which can be wired up directly to the
SceneGraph
.
The signals and systems abstraction here is encapsulating a lot of complexity. Inside the implementation of that system is a complete rendering engine, like one that you will find in high-end computer games. Drake actually supports multiple rendering engines; for the purposes of this class we will primarily use an OpenGL-based renderer (built on VTK) that is suitable for real-time simulation. Drake can also connect to high-end physically-based rendering (PBR) engines which use ray tracing, such as the Cycles renderer provided by Blender (see the Drake-Blender project). These are potentially useful for rendering higher quality images e.g. to train a deep perception system or make a beautiful presentation.
As a simple example of depth cameras in drake, I've constructed a
scene with a single object (the mustard bottle from the YCB
dataset), and added an RgbdSensor
to the diagram. Once
this is wired up, we can simply evaluate the output ports in order to
obtain the color and depth images:
Please make sure you spend a minute with the MeshCat visualization (available here). You'll see a camera and the camera frame, and you'll see the mustard bottle as always... but it might look a little funny. That's because I'm displaying both the ground truth mustard bottle model and the point cloud rendered from the cameras. You can use the MeshCat GUI to uncheck the ground truth visualization (image right), and you'll be able to see just the point cloud.
Remember that, in addition to looking at the source code if you like, you can always inspect the block diagram to understand what is happening at the "systems level". Here is the diagram used in this example.
In the HardwareStation
workflow, if cameras are declared
in the scenario data (e.g. Yaml description), then the
RgbdSensors
are added automatically inside the
HardwareStation
diagram, and their output ports exposed as
outputs (in order to match the true interface that we have with the
hardware).
Real depth sensors, of course, are far from ideal -- and errors in depth returns are not simple Gaussian noise, but rather are dependent on the lighting conditions, the surface normals of the object, and the visual material properties of the object, among other things. The color channels are an approximation, too. Even our high-end renderers can only do well if we add sufficiently accurate geometries, materials and lighting sources to our scene, and it is very hard to capture all of the nuances of a real environment. We will examine real sensor data, and the gaps between modeled sensors and real sensors, after we start understanding some of the basic operations.
The view of the mustard bottle in the example above makes one primary challenge of working with cameras very clear. Cameras don't get to see everything! They only get line of sight. If you want to get points from the back of the mustard bottle, you'll need to move the camera. And if the mustard bottle is sitting on the table, you'll have to pick it up if you ever want to see the bottom. This situation gets even worse in cluttered scenes, where we have our views of the bottle occluded by other objects. Even when the scene is not cluttered, a head-mounted or table-mounted camera is often blocked by the robot's hand when it goes to do manipulation -- the moments when you would like the sensors to be at their best is often when they give no information!
It is quite common to mount a depth camera on the robot's wrist in order to help with the partial views. But it doesn't completely resolve the problem. All of our depth sensors have both a maximum range and a minimum range. The RealSense D415 returns matches from about 0.3m to 1.5m. That means for the last 30cm of the approach to the object -- again, where we might like our cameras to be at their best -- the object effectively disappears!
There are many different representations for 3D geometry, each of which
can be more or less suitable for different computations. Often we can
convert between these representations; though sometimes that conversion can
be lossy. Examples include triangulated surface meshes and tetrahedral
volumetric meshes, voxel/occupancy grids, and implicit representations of
geometry like the signed distance functions and Neural Radiance Fields
(NeRFs), which have become very popular again recently as representations
in deep learning for 3D vision
The data returned by a depth camera takes the form of an image, where
each pixel value is a single number that represents the distance between the
camera and the nearest object in the environment along the pixel direction.
If we combine this with the basic information about the camera's intrinsic
parameters (e.g. lens parameters, stored in the CameraInfo
class in Drake) then we can transform this depth image representation
into a collection of 3D points, $s_i$. I use $s$ here because they are
commonly referred to as the "scene points" in the algorithms we will present
below. These points have a pose and (optionally) a color value or other
information attached; this is the point cloud representation.
By default, their pose is described in the camera frame, $^CX^{s_i}$, but
the DepthImageToPointCloud
system in Drake also accepts $^PX^C$
to make it easy to transform them into another frame (such as the world
frame).
As depth sensors have become so pervasive the field has built up libraries of tools for performing basic geometric operations on point clouds, and that can be used to transform back and forth between representations. We implement many of of the basic operations directly in Drake. There are also open-source tools like the Open3D library that are available if you need more. Many older projects used the Point Cloud Library (PCL), which is now defunct but still has some very useful documentation.
It's important to realize that the conversion of a depth image into a point cloud does lose some information -- specifically the information about the ray that was cast from the camera to arrive at that point. In addition to declaring "there is geometry at this point", the depth image combined with the camera pose also implies that "there is no geometry in the straight line path between camera and this point". We will make use of this information in some of our algorithms, so don't discard the depth images completely! More generally, we will find that each of the different geometry representations have strengths and weaknesses -- it is very common to keep multiple representations around and to convert back and forth between them.
Let us begin to address the primary objective of the chapter -- we have a known object somewhere in the robot's workspace, we've obtained a point cloud from our depth cameras. How do we estimate the pose of the object, $X^O$?
One very natural and well-studied formulation of this problem comes from the literature on point cloud registration (also known as point set registration or scan matching). Given two point clouds, point cloud registration aims to find a rigid transform that optimally aligns the two point clouds. For our purposes, that suggests that our "model" for the object should also take the form of a point cloud (at least for now). We'll describe that object with a list of model points, $m_i$, with their pose described in the object frame: $^OX^{m_i}.$
Our second point cloud will be the scene points, $s_i$, that we obtain from our depth camera, transformed (via the camera intrinsics) into the camera coordinates, $^CX^{s_i}$. I'll use $N_m$ and $N_s$ for the number of model and scene points, respectively.
Let us assume, for now, that we also know $X^C.$ Is this reasonable? In the case of a wrist-mounted camera, this would amount to solving the forward kinematics of the arm. In an environment-mounted camera, this is about knowing the pose of the cameras in the world. Small errors in estimating those poses, though, can lead to large artifacts in the estimated point clouds. We therefore take great care to perform camera extrinsics calibration; I've linked to our calibration code in the appendix. Note that if you have a mobile manipulation platform (an arm mounted on a moving base), then all of these discussions still apply, but you likely will perform all of your registration in the robot's frame instead of the world frame.
The second major assumption that we will make in this section is "known correspondences". When I say "correspondence" here, I mean that for each point in the scene point cloud, we can pair it with a specific point in the model point cloud; for instance, this might be the case if each point had a unique color that we could perceive reliably through the camera. This is not a reasonable assumption in practice, but it helps us get started. To represent this mapping in our equations, I'll use a correspondence vector $c \in [1,N_m]^{N_s}$, where $c_i = j$ denotes that scene point $s_i$ corresponds with model point $m_j$. Note that we do not assume that these correspondences are one-to-one. Every scene point corresponds to some model point, but the converse is not true, and multiple scene points may correspond to the same model point.
As a result, the point-cloud registration problem is simply an (inverse) kinematics problem. We can write the model points and the scene points in a common frame (here the world frame), $${}^W p^{m_{c_i}} = {}^W X^O \: {}^Op^{m_{c_i}} = {}^W X^C \: {}^Cp^{s_i},$$ leaving a single unknown transform, $X^O$, for which we must solve. In the previous chapter I argued for using differential kinematics instead of inverse kinematics; why is my story different here? Differential kinematics can still be useful for perception, for example if we want to track the pose of an object after we acquire its initial pose. But unlike the robot case, where we can read the joint sensors to get our current state, in perception we need to solve this harder problem at least once.
What does a solution for $X^O$ look like? Each model point gives us three constraints, when $p^{s_i} \in \Re^3.$ The exact number of unknowns depends on the particular representation we choose for the pose, but almost always this problem is dramatically over-constrained. Treating each point as hard constraints on the relative pose would also be very susceptible to measurement noise. As a result, it will serve us better to try to find a pose that describes the data, e.g., in a least-squares sense: $$\min_{X \in \mathrm{SE}(3)} \sum_{i=1}^{N_s} \| X \: {}^Op^{m_{c_i}} - X^C \: {}^Cp^{s_i}\|^2.$$ Here I have used the notation $\mathrm{SE}(3)$ for the "special Euclidean group," which is the standard notation saying the $X$ must be a valid rigid transform.
To proceed, let's pick a particular representation for the pose to work
with. I will use 3x3 rotation matrices here; the approach I'll describe
below also has an equivalent in quaternion form
Let's think about the type of optimization problem we've formulated so far. Given our decision to use rotation matrices, the term $p + R{}^Op^{m_{c_i}} - X^C {^Cp^{s_i}}$ is linear in the decision variables (think $Ax \approx b$), making the objective a convex quadratic. How about the constraints? The first constraint is a quadratic equality constraint (to see it, rewrite it as $RR^T=I$ which gives 9 constraints, each quadratic in the elements of $R$). The determinant constraint is cubic in the elements of $R$.
I would like you to have a graphical sense for this optimization problem, but it's hard to plot the objective function with 9+3 decision variables. To whittle it down to the essentials that I can plot, let's consider the case where the scene points are known to differ from the model points by a rotation only (this is famously known as the Wahba problem). To further simplify, let's consider the problem in 2D instead of 3D.
In 2D, we can actually linearly parameterize a rotation matrix with just two variables: $$R = \begin{bmatrix} a & -b \\ b & a\end{bmatrix}.$$ With this parameterization, the constraints $RR^T=I$ and the determinant constraints are identical; they both require that $a^2 + b^2 = 1.$
Is that what you expected? I generated this plot using just two model/scene points, but adding more will only change the shape of the quadratic, but not its minimum. And on the interactive version of the plot, I've added a slider so that you control the parameter, $\theta$, that represents the ground truth rotation described by $R$. Try it out!
In the case with no noise in the measurements (e.g. the scene points are exactly the model points modified by a rotation matrix), then the minimum of the objective function already satisfies the constraint. But if you add just a little noise to the points, then this is no longer true, and the constraint starts to play an important role.
The geometry should be roughly the same in 3D, though clearly in much higher dimensions. But I hope that the plot makes it perhaps a little less surprising that this problem has an elegant numerical solution based on the singular value decomposition (SVD).
What about translations? There is a super important insight that allows us to decouple the optimization of rotations from the optimization of translations. The insight is this: the relative position between points is affected by rotation, but not by translation. Therefore, if we write the points relative to some canonical point on the object, we can solve for rotations alone. Once we have the rotation, then we can back out the translations easily. For our least squares objective, there is even a "correct" canonical point to use -- the central point (like the center of mass if all points have equal mass) under the Euclidean metric.
Therefore, to obtain the solution to the problem, \begin{align} \min_{p\in\Re^3,R\in\Re^{3\times3}} \quad & \sum_{i=1}^{N_s} \| p + R \: {}^Op^{m_{c_i}} - p^{s_i}\|^2, \\ \subjto \quad & RR^T = I,\end{align} first we define the central model and scene points, $\bar{m}$ and $\bar{s}$, with positions given by $${}^Op^\bar{m} = \frac{1}{N_s} \sum_{i=1}^{N_s} {}^Op^{m_{c_i}}, \qquad p^\bar{s} = \frac{1}{N_s} \sum_{i=1}^{N_s} p^{s_i}.$$ If you are curious, these come from taking the gradient of the Lagrangian with respect to $p$, and setting it equal to zero: $$\sum_{i=1}^{N_s} 2 (p + R \: {}^Op^{m_{c_i}} - p^{s_i}) = 0 \Rightarrow p^* = \frac{1}{N_s} \sum_{i=1}^{N_s} p^{s_i} - R^*\left(\frac{1}{N_s} \sum_{i=1}^{N_s} {}^Op^{m_{c_i}}\right),$$ but don't forget the geometric interpretation above. This is just another way to see that we can substitute the right-hand side in for $p$ in our objective and solve for $R$ by itself.
To find $R$, compose the data matrix $$W = \sum_{i=1}^{N_s} (p^{s_i} -
p^\bar{s}) ({}^Op^{m_{c_i}} - {}^Op^\bar{m})^T, $$ and use SVD to compute $W
= U \Sigma V^T$. The optimal solution is \begin{gather*} R^* = U D V^T,\\
p^* = p^\bar{s} - R^* {}^Op^\bar{m},\end{gather*} where $D$ is the diagonal
matrix with entries $[1, 1, \det(UV^T)]$
In the example code, I've made a random object (based on a random set of points in the $x$-$y$ plane), and perturbed it by a random 2D transform. The blue points are the model points in the model coordinates, and the red points are the scene points. The green dashed lines represent the (perfect) correspondences. On the right, I've plotted both points again, but this time using the estimated pose to put them both in the world frame. As expected, the algorithm perfectly recovers the ground truth transformation.
What is important to understand is that once the correspondences are given we have an efficient and robust numerical solution to estimating the pose.
So how do we get the correspondences? In fact, if we were given the pose of the object, then figuring out the correspondences is actually easy: the corresponding point on the model is just the nearest neighbor / closest point to the scene point when they are transformed into a common frame.
This observation suggests a natural iterative scheme, where we start with some initial guess of the object pose and compute the correspondences via closest points, then use those correspondences to update the estimated pose. This is one of the famous and often used (despite its well-known limitations) algorithms for point cloud registration: the iterative closest point (ICP) algorithm.
To be precise, let's use $\hat{X}^O$ to denote our estimate of the object
pose, and $\hat{c}$ to denote our estimated correspondences. The
"closest-point" step is given by $$\forall i, \hat{c}_i = \argmin_{j \in
N_m}\| \hat{X}^O \: {}^Op^{m_j} - p^{s_i} \|^2.$$ In words, we want to
find the point in the model that is the closest in Euclidean distance to the
transformed scene points. This is the famous "nearest neighbor" problem,
and we have good numerical solutions (using optimized data structures) for
it. For instance, Open3D uses FLANN
Although solving for the pose and the correspondences jointly is very difficult, ICP leverages the idea that if we solve for them independently, then both parts have good solutions. Iterative algorithms like this are a common approach taken in optimization for e.g. bilinear optimization or expectation maximization. It is important to understand that this is a local solution to a non-convex optimization problem. So it is subject to getting stuck in local minima.
Here is ICP running on the random 2D objects. Blue are the model points, red are the scene points, green dashed lines are the correspondences. I encourage you to run the code yourself.
I've included one of the animation where it happens to converge to the true optimal solution. But it gets stuck in a local minima more often than not! I hope that stepping through will help your intuition. Remember that once the correspondences are correct, the pose estimation will recover the exact solution. So every one of those steps before convergence has at least a few bad correspondences. Look closely!
Intuition about these local minima has motivated a number of ICP
variants, including point-to-plane ICP, normal ICP, ICP that use color
information, feature-based ICP, etc. A particular variant based on
"point-pair features" has been highly effective in a (nearly) annual
object pose estimation challenge
The example above is sufficient to demonstrate the problems that ICP can have with local minima. But we have so far been dealing with unreasonably perfect point clouds. Point cloud registration is an important topic in many fields, and not all of the approaches you will find in the literature are well suited to dealing with the messy point clouds we have to deal with in robotics.
I've added a number of parameters for you to play with in the notebook to add outliers (scene points that do not actually correspond to any model point), to add noise, and/or to restrict the points to a partial view. Please try them by themselves and in combination.
Once we have a reasonable estimate, $X^O$, then detecting outliers is straight-forward. Each scene point contributes a term to the objective function according to the (squared) distance between the scene point and its corresponding model point. Scene points that have a large distance can be labeled as outliers and removed from the point cloud, allowing us to refine the pose estimate without these distractors. You will find that many ICP implementations, such as this one from Open3d include a "maximum correspondence distance" parameter for this purpose.
This leaves us with a "chicken or the egg" problem -- we need a
reasonable estimate of the pose to detect outliers, but for very messy
point clouds it may be hard to get a reasonable estimate in the presence
of those outliers. So where do we begin? One common approach that was
traditionally used with ICP is an algorithm called RANdom SAmple Consensus
(RANSAC)
There is another important and clever idea to know. Remember the
important observation that we can decouple the solution for rotation and
translation by exploiting the fact that the relative position
between points depends on the rotation but is invariant to translation.
We can take that idea one step further and note that the relative
distance between points is actually invariant to both rotation
and translation. This has traditionally been used to separate the
estimation of scale from the estimation of translation and rotation (I've
decided to ignore scaling throughout this chapter, since it doesn't make
immediate sense for our proposed problem). In
Here is a simple example of the maximum-clique idea from
The model (on the left) is a simple triangle with vertex points labeled [$A$, $B$, $C$]. The scene (on the right) has the same triangle, with points labeled [$a$, $b$, $c$] which will have exactly three pairwise distances. But it also has a pyramid that will create a total of six pairwise distance matches (three of length 3 and three of length 4). Will the spurious correspondences on the pyramid cause a larger clique that the ground truth correspondences [$A-a$, $B-b$, $C-c$] triangle?
Let's construct the pairwise distance graph using all-to-all correspondences as the initial guess. We'll get a node in that graph for every possible correspondence, e.g. I've used the label $A-e$ to denote a possible correspondence of model point $A$ with scene point $e$. Then we make an edge in the graph between $(A-e)$ and $(B-f)$ iff the distance between $A$ and $B$ equals the distance between $e$ and $f$.
Now the value of checking for a clique becomes clear. The largest clique involving correspondences with the pyramid is only size 2, but the largest clique for the scene triangle is size 3 (I've colored it red). So for this example, the maximum clique does, in fact, recover the ground-truth correspondences.
These are two examples of a zoo of algorithms / heuristics for
removing outliers. One more example to mention is
Not all outliers are due to bad returns from the depth camera, very
often they are due to other objects in the scene. Even in the most basic
case, our object of interest will be resting on a table or in a bin. If
we know the geometry of the table/bin a priori, then we can subtract away
points from that region. Alternatively, we can Crop
the point cloud to a region of interest. This can be sufficient for very
simple or uncluttered scenes, and will be just enough for us to accomplish
our goal for this chapter.
More generally, if we have multiple objects in the scene, we may still want to find the object of interest (mustard bottle?). If we had truly robust point cloud registration algorithms, then one could imagine that the optimal registration would allow us to correspond the model points with just a subset of the scene points; point cloud registration could solve the object detection problem! Unfortunately, our algorithms aren't strong enough.
As a result, we almost always use some other algorithm to "segment" the
scene in a number of possible objects, and run registration independently
on each segment. There are numerous geometric approaches to
segmentation
The problem of outliers or multiple objects in the scene challenges our current approach to correspondences. So far, we've used the notation $c_i = j$ to represent the correspondence of every scene point to a model point. But note the asymmetry in our algorithm so far (scene $\Rightarrow$ model). I chose this direction because of the motivation of partial views from the mustard bottle example from the chapter opening. Once we start thinking about tables and other objects in the scene, though, maybe we should have done model $\Rightarrow$ scene? You should verify for yourself that this would be a minor change to the ICP algorithm. But what if we have multiple objects and partial views?
It would be simple enough to allow $c_i$ to take a special value for "no correspondence", and indeed that helps. In that case, we would hope that, at the optimum, the scene points corresponding to the model of interest would be mapped to their respective model points, and the scene points from outliers and other objects get labelled as "no correspondence". The model points that are from the occluded parts of the object would simply not have any correspondences associated with them.
There is an alternative notation that we can use which is slightly more general. Let's denote the correspondence matrix $C\in \{0,1\}^{N_s \times N_m}$, with $C_{ij} = 1$ iff scene point $i$ corresponds with model point $j$. Using this notation, we can write our estimation step (given known correspondences) as: $$\min_{X\in \mathrm{SE}(3)} \sum_{i=1}^{N_s} \sum_{j=1}^{N_m} C_{ij} \|X \: {}^Op^{m_j} - p^{s_i}\|^2.$$ This subsumes our previous approach: if $c_i = j$ then set $C_{ij} = 1$ and the rest of the row equal to zero. If the scene point $i$ is an outlier, then set the entire row to zero. The previous notation was a more compact representation for the true asymmetry in the "closest point" computation that we did above, and made it clear that we only needed to compute $N_s$ distances. But this more general notation will take us to the next round.
What if we relax our strict binary notion of correspondences, and think
of them instead as correspondence weights $C_{ij} \in [0,1]$ instead of
$C_{ij} \in \{0, 1\}$? This is the main idea behind the "coherent point
drift" (CPD) algorithm
The CPD algorithm is now very similar to ICP, alternating between assigning the correspondence weights and updating the pose estimate. The pose estimation step is almost identical to the procedure above, with the "central" points now $${^Op^{\bar{m}}} = \frac{1}{N_C} \sum_{i,j} C_{ij} \: {}^Op^{m_j}, \quad p^{\bar{s}} = \frac{1}{N_C} \sum_{i,j} C_{ij} p^{s_i}, \quad N_C = \sum_{i,j} C_{ij},$$ and the data matrix now: $$W = \sum_{i,j} C_{ij} \left(p^{s_i} - p^{\bar{s}}\right) \left({}^Op^{m_j} - {}^Op^{\bar{m}}\right)^T.$$ The rest of the updates, for extracting $R$ using SVD and solving for $p$ given $R$, are the same. You won't see the sums explicitly in the code, because each of those steps has a particularly nice matrix form if we collect the points into a matrix with one point per column.
The probabilistic interpretation also gives us a strategy for
determining the covariance of the Gaussians on each iteration.
The word on the street is the CPD is considerably more robust than ICP
with its hard correspondences and quadratic objective; the Gaussian model
mitigates the effect of outliers by setting their correspondence weight
to nearly zero. But it is also more expensive to compute all of the
pairwise distances for large point clouds. In
All of the algorithms we've discussed so far have exploited the SVD
solution to the pose estimate given correspondences, and alternate between
estimating the correspondences and estimating the pose. There is another
important class of algorithms that attempt to solve for both
simultaneously. This makes the optimization problem nonconvex, which
suggests they will still suffer from local minima like we saw in the
iterative algorithms. But many authors argue that the solution times
using nonlinear solvers can be on par with the iterative algorithms (e.g.
Another major advantage of using nonlinear optimization is that these solvers can accommodate a wide variety of objective functions to achieve the same sort of robustness to outliers that we saw from CPD. I've plotted a few of the popular objective functions above. The primary characteristic of these functions is that they taper, so that larger distances eventually do not result in higher cost. Importantly, like in CPD, this means that we can consider all pairwise distances in our objective, because if outliers add a constant value (e.g. 1) to the cost, then they have no gradient with respect to the decision variables and no impact on the optimal solution. We can therefore write, for our favorite loss function, $l(x)$, an objective of the form, e.g. $$\min \sum_{i,j} \left[ l\left(\| X^O \: {}^Op^{m_j} - p^{s_i} \|\right)\right].$$ And what are the decision variables? We can also exploit the additional flexibility of the solvers to use minimal parameterizations -- e.g. $\theta$ for rotations in 2D, or Euler angles for rotations in 3D. The objective function is more complicated but we can get rid of the constraints.
To understand precisely what we are giving up, let's consider a warm-up problem where we use nonlinear optimization on the minimal parameterizations for the pose estimation problem. As long as we don't add any constraints, we can still separate the solutions for rotation from the solutions from translation. So consider the problem: $$\min_\theta \sum_{j} \left\| \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \left({}^Op^{m_{c_i}} - {}^Op^{\bar{m}}\right) - \left( p^{s_i} - p^{\bar{s}}\right)\right\|^2.$$ We now have a complicated, nonlinear objective function. Have we introduced local minima into the problem?
As a thought experiment, consider the problem where all of our model points lie on a perfect circle. For any one scene point $i$, in the rotation-only optimization, the worst case is when our estimate $\theta$ is 180 degrees ($\pi$ radians) away from the optimal solution. The cost for that point would be $4r^2$, where $r$ is the radius of the circle. In fact, using the law of cosines, we can actually write the squared distance for the point for any error, $\theta_{err} = \theta - \theta^*,$ as $$\text{distance}^2 = r^2 + r^2 - 2 r^2 \cos\theta_{err}.$$ And in the case of the circle, every other model point contributes the same cost. This is not a convex function, but every minima is a globally optimal solution (just wrapped around by $2\pi$).
In fact, even if we have a more complicated, non-circular, geometry, then this same argument holds. Every point will incur an error as it rotates around the circle, but may have a different radius. The error for all of the model points will decrease as $\cos\theta_{err}$ goes to one. So every minima is a globally optimal solution. We haven't actually introduced local minima (yet)!
It is the correspondences, and other constraints that we might add to the problem, that really introduce local minima.
There is an exceptionally nice trick for speeding up the computation in these nonlinear optimizations, but it does require us to go back to thinking about the minimum distance from a scene point to the model, as opposed to the pairwise distance between points. This, I think, is not a big sacrifice now that we have stopped enumerating correspondences explicitly. Let's slightly modify our objective above to be of the form $$\min \sum_{i} \left[ l\left( \min_j \| X^O \: {}^Op^{m_j} - p^{s_i} \|\right)\right].$$ In words, we can an apply arbitrary robust loss function, but we will only apply it to the minimum distance between the scene point and the model.
The nested $\min$ functions look a little intimidating from a computational perspective. But this form, coupled with the fact that in our application the model points are fixed, actually enables us to do some pre-computation to make the online step fast. First, let's move our optimization into the model frame by changing the inner term to $$\min_j \|{}^Op^{m_j} - {}^OX^W p^{s_i}\|.$$ Now realize that for any 3D point $x$, we can precompute the minimum distance from $x$ to any point on our model, call it $\phi(x).$ The term above is just $\phi({^OX^Wp^{s_i}}).$ This function of 3D space, sometimes called a distance field, and the closely related signed distance function (SDF) and level sets are a common representation in geometric modeling. And they are precisely what we need to quickly compute the cost from many scene points.
As the figure I've included above suggests, we can use the
precomputed distance functions to represent the minimum distance to a
mesh model instead of just a point cloud. And with some care, we can
define alternative distance-like functions in 3D space that have smooth
subgradients which can work better with nonlinear optimization. There
is a rich literature on this topic; see
Is there any hope of exploiting the beautiful structure that we found in the pose estimation with known correspondences with an algorithm that searches for correspondences and pose simultaneously?
There are some algorithms that claim global optimality for the ICP
objective, typically using a branch and bound strategy
In recent work, TEASER
We've explored two main lineages of algorithms for the pose estimation -- one based on the beautiful SVD solutions and the other based on nonlinear optimization. As we will see, non-penetration and free-space constraints are, in most cases, non-convex constraints, so are a better match for the nonlinear optimization approach. But there are some examples of convex non-penetration constraints (e.g., when points must not penetrate a half-plane) and it is possible to include these in our convex optimization approach. I'll illustrate both versions here with a simple example.
For simplicity, I will restrict this example to 2D, where we can parameterize the rotation matrices: $$R(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix},$$ and will solve the problem with known correspondences. Please do remember, though, that we can also include the correspondence search in the nonlinear optimization framework.
Let's solve the following optimization: \begin{align*} \min_{p,\theta}
\quad & \sum_i \| p + R(\theta) \: {}^Op^{m_{c_i}} - p^{s_i} \|^2, \\
\subjto \quad & {}^W p^{m_i} \ge 0, \quad \forall i \in [1, N_m].
\end{align*} In order to add the nonlinear costs and constraints, we use
pass a python function and the decision variables to be bound with that
function to the MathematicalProgram
AddCost
and
AddConstraint
methods. I've provided an implementation in
this notebook:
For a convex parameterization of the rotation matrices in this 2D example, we can leverage our convenient form of the rotation matrices: $$R = \begin{bmatrix} a & -b \\ b & a \end{bmatrix}.$$ Recall that the rotation matrix constraints for this form reduce to $a^2+b^2=1$. These are non-convex constraints, but we can relax them to $a^2 + b^2 \le 1.$ This is almost exactly the problem we visualized in this earlier example, except that we will add translations here. The relaxation is exactly changing the non-convex unit circle constraint into the convex unit disk constraint. Based on that visualization, we have reason to be optimistic that the unit disk relaxation could be tight.
Let's solve the following optimization: \begin{align*} \min_{p,a,b} \quad & \sum_i \| p + R \: {}^Op^{m_{c_i}} - p^{s_i} \|^2, \\ \subjto \quad & a^2 + b^2 \le 1, \\ & {}^W p^{m_i} \ge 0, \quad \forall i \in [1, N_m], \end{align*} where $R$ depends on $a, b$ as above. Just as the addition of constraints forced us to move from the pseudo-inverse solution to a numerical optimization-based solution in the last chapter, the solution to this problem is no longer given simply by SVD. As formulated, this optimization falls under the category of a Second-Order Cone Program (SOCP), though we need to use a slack variable to put it into the standard form with a linear objective.
Please be careful. Now that we have a constraint that depends on $p$, our original approach to solving for $R$ independently is no longer valid (at least not without modification). I've provided a simple implementation in the notebook.
This is a slightly exaggerated case, where the scene points are really pulling the box into the constraints. We can see that the relaxation is not tight in this situation; the box is being shrunk slightly to explain the data.
Constrained optimizations like this can be made relatively efficient, and are suitable for use in an ICP loop for modest-sized point clouds. But the major limitation is that we can only take this approach for convex non-penetration constraints (or convex approximations of the constraints), like the "half-plane" constraints we used here. It is probably not very suitable for non-penetration between two objects in the scene.
For the record, the fact that I was able to easily add this example here is actually pretty special. I don't know of another toolbox that brings together the advanced optimization algorithms with the geometry / physics / dynamical systems in the way that Drake does.
There is another beautiful idea that I first saw in
Ultimately, as much as I prefer the convex formulations with their potential for deeper understanding (by us) and for stronger guarantees (for the algorithms), the ability to add non-penetration constraints and free-space constraints is simply too valuable to ignore. Deep learning methods can now often provide a good initial guess, which can mitigate some of the concerns about local minima. I hope that, if you come back to these notes in a year or two, I will be able to report that we have strong results for these more complex formulations. But for now, in most applications, I will steer you towards the nonlinear optimization approaches and taking advantage of these constraints.
Now we can put all of the pieces together. In the notebook, I've created an example with the mustard bottle in one bin. First, I use ICP to localize its pose. Then, I plan a simple trajectory like we did in the last chapter to pick it up and place it in the second bin. Finally, I use differential inverse kinematics to execute it. Very satisfying!
Although I very much appreciate the ability to think rigorously about these geometric approaches, we need to think critically about whether we are solving for the right objectives.
Even if we stay in the realm of pure geometry, it is not clear that a least-squares objective with equal weights on the points is correct. Imagine a tall skinny book laying flat on the table -- we might only get a very small number of returns from the edges of the book, but those returns contain proportionally much more information than the slew of returns we might get from the book cover. It is no problem to include relative weights in the estimation objectives we have formulated here, but we don't yet have very successful geometry-based algorithms for deciding what those weights should be in any general way. (There has been a lot of research in this direction, but it's a hard problem.)
Please realize, though, that as beautiful as geometry is, we are so far
all but ignoring the most important information that we have: the color
values! While it is possible to put color and other features into an
ICP-style pipeline, it is very hard to write a simple distance metric in
color space (the raw color values for a single object might be very
different in different lighting conditions, and the color values of
different objects can look very similar). Advances in computer vision,
especially those based on deep learning over the last few years, have
brought new life to this problem. When I asked my students recently "If
you had to give up one of the channels, either depth or color, which would
you give up?" the answer was a resounding "I'd give up depth; don't take
away my color!" That's a big change from just a few years ago. As recently
as 2019, in the Benchmark for 6D Object Pose Estimation (a nearly annual
competition), geometric pose estimation was still outperforming
deep-learning based approaches
But deep learning and geometry can (should?) work nicely together. The
winners in the 2020 version of the object pose estimation challenge used
deep learning to make the initial guess at the pose, but still used
geometry perception (a variant of ICP) and the depth channel for refining
the estimate
Consider the problem of point cloud registration with known correspondences. Points can be directly mapped to corresponding model points to specify the position and orientation of an object in the scene. In most cases, we have far more points than we have decision variables on the object's pose. Therefore, treating each point correspondence as an equality constraint would make the problem over-constrained. This raises a natural question:
We have a beautiful formulation for the point cloud registration problem with known correspondences -- it has a quadratic objective and (with the determinant constraint relaxed) a quadratic equality constraint. Sometimes we have objects that are rotationally symmetric -- think of box that is a perfect cube. How can the quadratic objective capture a problem where there should be equally good solutions at rotations that are 90 degrees apart?
If you run the ICP example code which has randomized object geometry
and ground truth object pose, you can't help but notice that the algorithm
does a quite reasonable job of estimating translation, but (at least for
the geometries I've generated here) does a fairly lousy job with rotation.
One mitigation is to run ICP multiple times with different initial
estimates of the rotation. This is a reasonable strategy for any
nonconvex optimization problem with local minima, but is particularly
useful here since even when the point clouds are quite complex, the
dimensionality of the search space is low. In particular, we can generate
a reasonable coverage of 2D or even 3D rotations with a modest number of
samples (for 3D, consider using
UniformlyRandomRotationMatrix
). Furthermore, running ICP
from multiple initial conditions can be done in parallel.
Try implementing ICP from multiple initial estimates $\hat{X}^O$, sampling only in rotation. If you only keep the best one (lowest estimation error), then how much does it improve the performance over a single ICP run?
Consider the case of point registration where the rotation component of $X^O$ is known, but not the translation. (In other words, the opposite of Example 4.2.)
Specifically, say your scene points ${}^WX^C{}^Cp^{s_i}={}^Wp^{s_i}$ are defined as follows:
$$ \begin{aligned} {}^Wp^{s_0}&=(1,5) \\ {}^Wp^{s_1}&=(3,10) \\ {}^Wp^{s_2}&=(5,10) \end{aligned} $$ Which can be plotted as follows:And your model points are defined as follows:
$$ \begin{aligned} {}^Op^{m_0}&=(-2,-5) \\ {}^Op^{m_1}&=(0,0) \\ {}^Op^{m_2}&=(2,0) \end{aligned} $$ Which can be plotted as follows:As you can see, both triangles are in the same orientation, so $R^O= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$. However, we still need to solve for the translation component of $X^O$.
(In this example, we can easily separate translation and rotation because the orientation of the model already matches the scene, but this decoupling actually works in the general case as well. See the explanation following Example 4.2 in the textbook.)
We saw that ICP has many local minima. But what are these local minimas, and can we say something about how wrong our initial guess has to be until we arrive at one of these local minima? Although the analysis can get complicated for general geometry, let's start by analyzing a simple example of a planar two-point ICP.
The model points and the scene points in their corresponding frame are given as:
$${}^Op^{m_1}=\begin{pmatrix} 1 \\ 0 \end{pmatrix} \qquad {}^Op^{m_2}=\begin{pmatrix} -1 \\ 0 \end{pmatrix} \qquad {}^Wp^{s_1}=\begin{pmatrix} 1 \\ 0 \end{pmatrix} \qquad {}^Wp^{s_2}=\begin{pmatrix} -1 \\ 0 \end{pmatrix}.$$The true correspondence is given by their numbering, but note that we don't know the true correspondence - ICP simply determines it based on nearest neighbors. Given our vanilla-ICP cost (sum of pairwise distances squared), we can parametrize the 2D pose $X^O$ with $p_x,p_y,a,b$ and write down the resulting optimization as:
$$\begin{aligned} \min_{p_x,p_y,a,b} \quad & \sum_i \bigg\| \begin{pmatrix} p_x \\ p_y \end{pmatrix} + \begin{pmatrix} a & -b \\ b & a \end{pmatrix} {}^Op^{m_{c_i}} - {}^Wp^{s_i} \bigg\|^2 \\ \text{s.t.} \quad & a^2 + b^2 = 1. \end{aligned}$$For this exercise, you will implement the ICP algorithm to match pointclouds of two Stanford bunnies. You will work exclusively in . You will be asked to complete the following steps:
For this exercise, you will remove the environmental background from the Stanford bunny pointcloud using the RANSAC algorithm. You will work exclusively in . You will be asked to complete the following steps:
In this question you'll explore a technique for handling outliers in ICP and think through how the distance metric impacts the robustness of ICP to outliers. There are two parts to this question and each part has three subquestions.
Let's go back to the example from Chapter 3, but relax the assumption that we have the pose of the red foam brick. You will be given a simulated raw pointcloud from our previous setup from a depth camera. Your task is to perform segmentation on this raw pointcloud data, and perform ICP to estimate the pose of the brick. You will work exclusively in . You will be asked to complete the following steps:
In this problem we will explore signed distance functions and their composition! As a reminder, the signed distance function of a set gives the minimum distance of a point $p$ to the boundary of that set, where the sign of the value is based on whether the point $p$ is inside or outside the boundary. Points that are on the interior of the boundary are negative. and points that are on the exterior of the boundary are positive.
For a 2D example, consider the green box visualized below. The center of the box, at the origin, has a signed distance value of -1, since it is distance one from any of the edges of the box and it lies on the interior of the box. Points along the perimeter of the box have a signed distance value of 0. Any points outside the box have a signed distance value of $+d$ where $d$ is the smallest distance between that point and the perimeter of the box.
In the figure below we draw two objects: a shaded red object and a shaded blue object (the purple region is two objects overlapping and the objects should be treated as infinite half planes). Let $\phi_r(p)$ be the signed distance value of a point $p$ to boundary of the red object and $\phi_b(p)$ be the signed distance value of point $p$ to the boundary blue object. As an example $\phi_b(p_0) = -3, \phi_b(p_1) = +2$.
One incorrect method to compute the SDF value for the union of two objects is to simply take the minimum of the SDF value for each of the objects. (It is worth taking a minute to convince yourself why this method is incorrect!). While this method can produce the incorrect distance, it will produce a value with the correct sign (again, convince yourself this is true!).
Therefore, this "incorrect" method can tell us if we are in penetration or out of penetration for the union of our set of objects. This makes it useful for computing a non-penetration constraint! We'll refer to the SDF generated by this "incorrect-but-useful" method as Penetration-Constraint-SDF.
Let's see how we can use this!
Previous Chapter | Table of contents | Next Chapter |