Modern Perception in Autonomous Vehicles

Cedric Warny
14 min readApr 18, 2024

In this post, I describe modern perception techniques and how they can be fruitfully used in autonomous vehicles.

Modular vs end-to-end self-driving

Perception is a crucial capability of autonomous vehicles. It’s responsible for perceiving infrastructure and road agents, upstream of issuing control commands to the self-driving vehicle. Perception is typically split into two modules: detection and tracking. Detection is responsible for locating agents in one “video frame” and representing it, typically as a box. Tracking is responsible for stitching the boxes together belonging to the same agent into a history. Once you have a history for all surrounding agents, you pass that information to modules downstream of perception, namely prediction and planning.

While common, this approach of detecting and tracking an agent as a sequence of boxes has significant downsides. It introduces a representational bottleneck for downstream modules in the form of a poor representation of the agents in a scene. Indeed, a box is a series of 7 numbers: position (x and y), heading (angle the agent is pointing at), velocity (vx, vy), and shape (width and length). (The Z axis for all those states typically doesn’t matter, as agents are often represented in a “bird’s eye view”). Modules downstream of detection have to carry out the complex tasks of tracking, prediction, and planning based on this basic representation of agents.

This approach can also be described as modular, with clear contracts between the modules. The detection module has a contract with the tracking module to provide a set of boxes at each frame. The tracking module in turn has a contract with the prediction module to provide a set of tracks going back a few seconds in the past. The prediction module has a contract with the planning module to provide the state of each agent up to several seconds into the future. Notice how human-readable those contracts are. That makes sense since they were created by humans. But what’s convenient for humans may not be convenient for a machine.

This is a common pattern. Humans break down a problem into symbols they can manipulate and pass around between modules with human-defined APIs. Those symbols represent elements of a solution as a human sees them, and they usually correspond to heavily simplified representations. Representing agents as boxes seems totally natural to a human, but isn’t obvious for a machine. This representational choice, though, unwittingly influences the way we modularize our perception stack. If you represent detections as boxes, it’s only natural to represent tracks as sequences of such boxes, which in turn defines the tracking problem as the problem of linking box-like detections from one frame to the next. It almost feels like the only correct definition of the tracking problem: spit out a N by M matrix linking N tracks to M detections. Early representational choices constrain your solution space, often without you even noticing. The reality is, there are many more ways of defining the tracking problem.

The counterintuitive lesson of the modern deep learning era is to resist making representational choices, and to let the neural network “grow” its own rich representations organically. In autonomous driving, this is usually associated with “going end-to-end”, meaning having a single neural network take raw sensor data on one end and spitting out control commands on the other end without human-readable representations between several modules. In addition, the more end-to-end the self-driving stack becomes, the more we can literally — and not just metaphorically— backpropagate mistakes in our actual goal all the way to our inputs. Our actual goal isn’t to draw boxes around surrounding agents. Our actual goal is to press the accelerator, hit the brakes, turn the wheel appropriately. Going end-to-end allows us to actually backpropagate errors in those tasks rather than backpropagating errors in box drawings.

Partial end-to-end

While true end-to-end self-driving that maintains a rich representation of a scene throughout the autonomous vehicle stack is a tall order, this can be de-risked by first maintaining a rich representation between detection and tracking, then decoding agents into a poor representation after tracking instead of after detection, leaving the prediction and planning modules unaffected. Over the past few years, the scientific literature has seen a shift in focus from so-called tracking-by-detection approaches (where detection and tracking are done separately) to so-called tracking-and-detection approaches (where detection and tracking are done simultaneously). An alternative terminology would be explicit tracking vs implicit tracking.

In explicit tracking, objects are first identified in each frame and are represented by various features. Those features go from basic features such as the high-level type of object, its position, speed, heading, and rough shape, to fancier features such as appearance or more fine-grained shape representation (eg, a more complex convex polygon rather than a simple box). In any case, those are explicit, hand-crafted features that characterize an object in a way that is cut off from the scene itself. Here you have a car of roughly 2 by 4 meters, going 13 miles per hour in this or that direction. This description isn’t grounded in a specific scene anymore. We lost many clues from the scene that are important to the tracking, prediction, and planning tasks. A separate process then tries to explicitly associate those ungrounded detections from one frame to the next. In the limit, when the detections are so ungrounded as to just be represented by boxes, the association task merely computes overlaps between the boxes in one frame and the boxes in the next frame, and links up the most overlapping boxes. This isn’t how humans track objects.

In contrast, implicit tracking does not attempt to unground objects from a scene and then associating them temporally. Instead, agents are abstractly represented by the same vector over time. That vector attends to the entire scene and updates itself over time. There no longer is an explicit association step between the poor representations of objects in different frames. The agents’ representation remains grounded in the scene in all its richness.

In terms of the scientific literature, there’s been a noticeable shift away from explicit tracking toward implicit tracking in recent years. The long-term goal of implicit tracking is to be a stepping stone towards end-to-end perception. Poor intermediary representations (eg, boxes) require a lot of engineering ingenuity to ensure performance downstream. Such poor representations are only going to become more problematic as the industry moves away from deploying basic self-driving capabilities toward solving the long tail of edge cases. Edge cases typically involve situations that are resistant to explicit representations. Ultimately, the neural network’s internals should be responsible for how objects are represented. Investing engineering resources to cleverly fit such ambiguous scene elements into our poor representations hurts us in the long term. Implicit tracking is a step in the right direction.

Implicit tracking

I’ll focus my discussion on vision-based detection and tracking, but keep in mind that a self-driving car can use all sorts of other sensors such as radar and lidar. I’ll focus first on the evolution of the object detection literature, before moving to the tracking literature. Indeed, there first needed to be a shift in the object detection literature in order to enable implicit tracking. While the literature on implicit tracking is vast, there are only a few ground-breaking papers that I’m interested in.

In traditional object detection, the image is peppered with a ton of random boxes (called “anchor boxes”) and the neural network learns to “move” and resize those boxes so that they overlap with objects in the image. Since there are so many anchor boxes, each object ends up captured by a ton of overlapping boxes, which we then need to algorithmically suppress (a somewhat expensive post-processing step). The important thing to note here is that the neural net just spits out a ton of boxes and then sort of filters a bunch out. At no point is each object richly represented. At best, we can just say that the neural net has a ton of mutually exclusive poor representations for each object.

Figure 1: Traditional object detection.

We are, however, interested in a single rich representation for each object in an image. This is where the 2020 “End-to-End Object Detection with Transformers” paper (also known as the DETR paper) comes in and constitutes a key departure from the traditional object detection approach. At the highest level, DETR uses a Transformer encoder to encode an image via self-attention amongst all the pixels, then uses a Transformer decoder where vectors representing objects attend to the encoded image to update themselves so that they capture the essence of each object in the image.

Figure 2: DETR architecture.

More precisely, the image first goes through a convolutional neural net (CNN), so that the “pixels” going into the Transformer encoder are actually the post-convolution lower-resolution image features. This image feature extractor is often referred to as a “backbone” neural net. A backbone is often a general-purpose input encoder and can be easily swapped for other architectures. CNNs are a popular backbone for image processing.

Since the Transformer expects a sequence as input, the pixels of the image feature map are flattened before being fed to the encoder, thereby losing the spatial information inherent to an image. To retain that important information, we feed the Transformer the sequence of pixels alongside their positional encodings. This means that every pixel attends to every other pixel in the encoder, as shown in Figure 3. Interestingly, after training, we find that pixels belonging to an object, while attending to all other pixels in the image, mostly pay attention to other pixels belonging to that object rather than unrelated pixels.

Figure 3: We take four pixels and look at those pixels’ attention maps. This highlights that a pixel belonging to an object mostly attends to other pixels belonging to that object rather than unrelated pixels. This illustrates that the DETR encoder has learned to distinguish entities.

Most people who deal with Transformers are used to the decoder part being auto-regressive. This is for instance typically the case in natural language processing systems. In DETR, though, the decoder is not auto-regressive. We feed N vectors (known as “object queries”) into a decoder-only Transformer without any masking. The object queries go through both self-attention amongst themselves and cross-attention with the encoded pixels in a single pass, without recursion as in auto-regressive decoders. Importantly, the object queries are learnable vectors, i.e. the loss is backpropagated all the way through the queries, i.e. the initial values of those queries change over the course of training. Finally, those output queries are decoded into boxes via a simple feedforward network and compared to ground truth.

The main innovation in the DETR paper is that the bounding boxes of the detected objects are directly predicted from the image. This is in contrast to the traditional method of pruning a dense set of pre-defined redundant predicted boxes, which requires costly post-processing to remove redundant detections. In that sense, the predicted boxes pre-exist the image input, and the network simply learns to resize them. The traditional loss therefore is a classic regression loss. DETR doesn’t use anchor boxes and eschews complex post-processing because self-attention amongst the queries avoids duplicate predictions. Indeed, the queries pay attention to each other and quickly realize that loss will be further reduced if they each zero in on different objects in the image. To compute the neural net’s loss, the predicted objects are matched to ground-truth objects on the fly via some basic matching algorithm that does not care in which order the objects are predicted (which is why this loss is known as “set loss”).

The direct (or “end-to-end”) object detection literature has since added a lot of refinement on top of the DETR breakthrough. Many of those refinements are clever tricks, but without major advances. A common practice in subsequent models is to attend to the image features in a sparser manner. While, in DETR, the object queries attend to all the pixels, in many descendants of the DETR paper, each object query only attends to a few pixels. The manner in which they choose which pixels to attend to can be more or less sophisticated, and the subject of many papers in this literature. This general development of sparse attention, though, has been shown to make model training more efficient. Still, it doesn’t really represent a fundamental change to the DETR approach.

Implicit tracking simply extends the concept of the object query to multiple frames. We initialize a set of newborn queries at the beginning of each frame, then queries update themselves frame by frame in an auto-regressive fashion. The decoder head predicts one object candidate from each track query in each frame, and boxes decoded in different frames from the same track query are directly associated. The decoded object candidates from the same query should represent the same object across frames. This overall approach is illustrated in Figure 4 below, sourced from the TrackFormer paper, one of the first papers to extend query-based detection to tracking.

Figure 4: Illustration of implicit tracking. Queries with a high enough detection score are persisted to the next frame, thereby implicitly tracking an object.

Each frame has a fixed pool of object queries fed to the decoder. After passing through the Transformer decoder, each object query is decoded into a box and a classification score capturing the probability distribution over a set of possible classes (pedestrian, car, bike, etc.). If the highest classification score is high enough, the object query is persisted onto the next frame’s pool of object queries (one could call them “track queries” at this point). After going through the decoder in the next frame, the object query updates its internal state, reflecting the movement of the object. At no point in time did we have to produce a linking matrix to explicitly associate detections from one frame to detections in the next frame. Tracking was done implicitly, by persisting the object query and updating it with the latest image features.

This in a nutshell is the core of modern tracking in autonomous vehicles. There are additional complications due to the fact that self-driving cars need to perceive agents in 3D, not 2D, but the fundamentals remain the same. Generally speaking, detectors provide a scene encoder, and the tracker in implicit tracking simply adds a decoder head that attends to the scene encoding via track queries and object queries.

Perception’s GPT2 moment

I alluded to a rapid increase in complexity and engineering after the DETR breakthrough, with subsequent papers adding lots of bells and whistles to deal with efficient training, 3D detections, and better depth perception. This is a common development in the history of AI where breakthroughs simplify the previous era’s algorithms, only for more complexity to be added on top to squeeze more performance out of the new paradigm. The cleverness temptation is a dangerous temptation, and one could call it the paradox of Sutton’s Bitter Lesson: the bitter lesson (that simplicity coupled with scaling data and compute ultimately beats clever engineering) only applies in the long run, but in the short run one needs cleverness to top the leaderboards. So, at any given moment, cleverness always seems to win, even though it’s usually a bad idea.

The main question for designing a ML system nowadays has become: how clever do we want to be? In the long run, simple architectures such as GPT, BERT, ViT beat more complex methods, but in the short run complex architectures win. The risk in “going clever” is to get attached to our engineering ingenuity and ultimately expending resources tweaking it rather than getting rid of it. Amazon Alexa is an extremely complex piece of engineering. While successful for a while, it now cannot compete with ChatGPT, which is a radically simpler system in contrast. Amazon Alexa leadership was not able to see the sea change happening when GPT2 came out, and could not abandon all the clever engineering that went into the system. The release of GPT2 is when Alexa leadership could have seen what was coming, could have steered the ship in the right direction, and, through bold and difficult decision-making, avoided irrelevancy. But they couldn’t do it. This is the risk on the path of cleverness. Only tread it if you can also dispassionately abandon your past carefully constructed castles, ruthlessly rebuild things from scratch.

Alternatively, there is the path of simplicity. You may not achieve state-of-the-art results at any one point in time, but you will be nimble and adaptable in the long run. The path of simplicity bets on simple architecture and scale. The fundamental architecture behind ChatGPT has barely changed since the first GPT paper in 2018. That’s 6 years. An eternity in machine learning. The steady increase in capabilities displayed by the GPT series has mostly been from riding the wave of improvements in compute. Note that when GPT3 came out, it wasn’t state-of-the-art on most of the benchmarks it was evaluated on. Specialized solutions still beat the general solution most of the time. But that is besides the point, as GPT4 blew specialized solutions out of the park on many of the benchmarks that GPT3 was still slightly lagging behind.

Just like the GPT paradigm is an extremely simple solution in the realm of natural language processing, DETR and TrackFormer are examples of simplicity in the realm of perception. DETR is literally just the original Transformer with the elegant and simple innovations of learnable object queries and set prediction. TrackFormer simply extends DETR with auto-regressive object queries which they call track queries, as well as straightforward data augmentation techniques during training. Neither of these techniques top the public detection and tracking leaderboards (eg, nuScenes). Neither is specifically tuned to a 3D environment. DETR is computationally intensive due to the global attention of its queries.

Table 1: Cleverness vs simplicity.

So what could a GPT2 moment for perception look like?

First, I expect self-supervised pre-training to play a major role. But self-supervision in perception will likely look different than self-supervision in language. In this post, I argue that successful self-supervision in perception will likely operate in embedding space, unlike self-supervision in text, which operates in token space. The joint-embedding predictive architectures (JEPAs) put forward by the FAIR group at Meta are elegant and simple, and as such constitute good candidates for a general pre-training strategy.

Second, the pre-training procedure in perception needs to be multimodal, unlike language, which is unimodal. This is because autonomous vehicles typically leverage more sensors than just camera images, and as such we want pre-training procedures that work across modalities (camera, radar, lidar, etc.). JEPAs also fit the bill here, since they operate in embedding space.

Third, I expect the Transformer architecture to be the right perception backbone, due to its being widely perceived as approximating a universal neural computer. In other words, I expect the Vision Transformer (ViT) to be the right backbone.

Fourth, I expect sparsity of attention to play a role. One of the innovations of GPT3 over GPT2 was the alternation of global attention layers with sparse attention layers, which improved computational cost significantly. One clear trend we’ve seen in the perception literature is various attempts at sparse attention, whereby the object or track queries only attend to a few pixels instead of all pixels in the image feature map.

Fifth, pre-training should be done on videos, not still images, as video pre-training has been shown to learn an implicit motion model.

Sixth, pre-training should involve nudging the backbone to develop a 3D world model, at least in the short run. This is only necessary as long as tracking still wants to produce explicit boxes to represent an agent, so that downstream modules like prediction and planning can consume them. Ultimately, though, when the stack truly becomes end-to-end, from sensor inputs all the way to controls, explicit boxes won’t be necessary, and therefore nudging a 3D world model into the image backbone won’t be necessary.

--

--