Perception, Planning, and Control
© Russ Tedrake, 2020-2024
Last modified .
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Fall 2024 semester.
Previous Chapter | Table of contents | Next Chapter |
In the previous chapter, we discussed deep-learning approaches to object detection and (instance-level) segmentation; these are general-purpose tasks for processing RGB images that are used broadly in computer vision. Detection and segmentation alone can be combined with geometric perception to, for instance, estimate the pose of a known object in just the segmented point cloud instead of the entire scene, or to run our point-cloud grasp selection algorithm only on the segmented point cloud in order to pick up an object of interest.
One of the most amazing features of deep learning for perception is that we can pre-train on a different dataset (like ImageNet or COCO) or even a different task and then fine-tune on our domain-specific dataset or task. But what are the right perception tasks for manipulation? Object detection and segmentation are a great start, but often we want to know more about the scene/objects to manipulate them. That is the topic of this chapter.
There is a potential answer to this question that we will defer to a later chapter: learning end-to-end "visuomotor" policies, sometimes affectionately referred to as "pixels to torques". Here I want us to think first about how we can combine a deep-learning-based perception system with the powerful existing (model-based) tools that we have been building up for planning and control.
I'll start with the deep-learning version of a perception task we've already considered: object pose estimation.
We discussed pose estimation at some length in the geometric perception chapter, and had a few take-away messages. Most importantly, the geometric approaches have only a very limited ability to make use of RGB values; but these are incredibly valuable for resolving a pose. Geometry alone doesn't tell the full story. Another subtle lesson was that the ICP loss, although conceptually very clean, does not sufficiently capture the richer concepts like non-penetration and free-space constraints. As the original core problems in 2D computer vision started to feel "solved", we've seen a surge of interest/activity from the computer vision community on 3D perception, which is great for robotics!
The conceptually simplest version of this problem is that we would like to estimate the pose of a known object from a single RGB image. How should we train a mustard-bottle-specific (for example) deep network which takes an RGB image in, and outputs a pose estimate? Of course, if we can do this, we can also apply the idea to e.g. the images cropped from the bounding box output of an object recognition / instance segmentation system.
Once again, we must confront the question of how best to represent a
pose. Many initial architectures discretized the pose space into bins and
formulated pose estimation as a classification problem, but the trend
eventually shifted towards pose
regression
To output a single pose, works like
Perhaps more substantially, many works have pointed out that
outputting a single "correct" pose is simply not sufficient
No matter which pose representation is coming out of the network and
the ground truth labels, one must choose the appropriate loss function.
Quaternion-based loss functions can be used to compute the geodesic
distance between two orientations, and should certainly be more
appropriate than e.g. a least-squares metric on Euler angles. More
expensive, but potentially more suitable is to write the loss function in
terms of a reconstruction error so that the network is not artificially
penalized for e.g. symmetries which it could not possibly address
Training a network to output an entire distribution over pose brings
up additional interesting questions about the choice for the loss
function. While it is possible to train the distribution based on only
the statistics of the data labeled with ground truth pose (again,
choosing maximum likelihood loss vs mean-squared error), it is also
possible to use our understanding of the symmetries to provide more
direct supervision. For example,
Although pose estimation is a natural task, and it is straightforward
to plug an estimated pose into many of our robotics pipelines, I feel
pretty strongly that this is often not the right choice for connecting
perception to planning and control. Although some attempts have been made
to generalize pose to categories of objects
In
Two of the earliest and most successful examples of this were
Transporter nets
KeyPoint Affordances for Category-Level Robotic Manipulation (kPAM)
My coverage above is necessarily incomplete and the field is moving fast. Here is a quick "shout out" to a few other very relevant ideas.
More coming soon...
In this problem you will further explore Dense Object Nets, which were introduced in lecture. Dense Object Nets are able to quickly learn consistent pixel-level representations for visual understanding and manipulation. Dense Object Nets are powerful because the representations they predict are applicable to both rigid and non-rigid objects. They can also generalize to new objects in the same class and can be trained with self-supervised learning. For this problem you will work in to first implement the loss function used to train Dense Object Nets, and then predict correspondences between images using a trained Dense Object Net.
Previous Chapter | Table of contents | Next Chapter |