Teaching AVs the Language of Human Driving Behavior with Trajeglish

Much of the communication between drivers goes beyond turn signals and brake lights. Motioning another car to proceed, looking over to see if another driver is paying attention—even the friendly Jeep wave—all rely on human-based communication rather than vehicle technology.

As autonomous vehicles (AV) must coexist with human drivers for the foreseeable future, they must be able to interpret this behavior to make safe decisions that don’t interrupt the flow of traffic.

To address this challenge in training, developers must be able to predict how the future motion of other vehicles is affected by an AV’s actions. In a recently published paper, the NVIDIA Research team outlines Trajeglish, an approach to traffic modeling that tokenizes motion in the same way language models tokenize words and phrases to simulate realistic multi-vehicle driving scenarios.

When compared with 16 other traffic models in the first iteration (V0) of the Waymo Sim Agents Challenge, this tokenization approach resulted in the most realistic traffic trajectories, showing a 3.3% improvement over the previous state-of-the-art model.

Trajeglish models multi-agent traffic scenarios by breaking each scenario down into tokens, in the same way a language model breaks down a paragraph into words and phrases. By doing so, it can consider each agent and trajectory in relation to each other, predicting motions that cover the full range of possible interactions given their initial locations.

Eight different traffic scenarios showing various predicted trajectories based on the initial positions of the vehicles. — *Figure 1. Scenarios modeled by Trajeglish given only the initial timestep of the driving log. The initial state used to prompt the model is shown in black.*

When given only the initial timestep of a real-world scenario, Trajeglish closely models the log data to realistically simulate how other vehicles react to the actions of the AV.

Modeling human behavior

Simulating human driving behavior is relatively straightforward in single-lane highway scenarios, where there are few intersections, objects, or pedestrians.

Modeling multiple vehicles in urban settings, however, is significantly more difficult given the increase in traffic and road variety. To build traffic models that generalize to a wider range of scenarios, recent approaches aim to imitate driving behavior observed in driving logs.

Doing so in simulation requires sampling realistic actions for an agent at each timestep that are compatible with actions chosen by all other agents at the current timestep—a relationship known as intra-timestep dependence.

While actors in the real world behave independently, intra-timestep dependence in traffic modeling is necessary as driving logs are recorded at discrete timesteps, thus any interaction between timesteps appears as coordinated behavior. Communication that is not generally recorded in log data, such as eye contact or turn signals, also contributes to the appearance of coordination among actors in a recorded scenario.

Trajeglish explicitly models this intra-timestep dependence. It achieves this by tokenizing a given scenario—in the same manner as a language model—enabling the model to predict only the likely trajectories, or tokens, based on the context of the scene. Trajeglish then models the next actions in the timestep by analyzing the distribution of all the tokenized scenarios.

Three images showing three different timesteps, where in each one, a future position is chosen based on its proximity to the current location. — *Figure 2. Trajeglish tokenizes trajectories by iteratively finding the token with minimum corner distance to the next state.*

This process of predicting the next token continuously builds on itself. After a chosen number of tokens are sampled, Trajeglish has enough context to predict scenarios of various lengths with an arbitrary number of agents.

A leading approach

Trajeglish was compared with 16 other models in the V0 leaderboard of the Waymo Sim Agents Challenge. Each model was tasked with simulating 32 scene-consistent trajectories for up to 128 agents at a time, given 1 second of initial driving information.

The challenge evaluated the realism of each simulation based on distribution matching. Several statistics were computed over the simulated scenarios and compared to the same statistics computed on the recorded scenarios. The closer these statistics matched each other, the higher the score.

The only model to use tokenization, Trajeglish produced the most realistic outcomes, according to Waymo’s parameters. Qualitatively, Trajeglish dramatically improved performance in scenarios with dense interaction between agents, such as traffic jams, merging scenarios, and four-way stop intersections.

The Waymo leaderboard evaluated three categories in each simulation: kinematics (such as speed), interactions, or distance to the nearest vehicle, and whether the trajectory remained in the drivable area. Overall realism was a weighted average across these categories.

According to these parameters, Trajeglish improved over the previous state-of-the-art model in overall realism of scenarios by 3.3% and topped the interaction component by 9.9%.

Bar charts comparing overall realism and interaction realism, with Trajeglish displaying the highest results of the top 10 models. — *Figure 3. Trajeglish results compared with other entrants in the Waymo Sim Agents Challenge. Submissions using ensembling are marked with asterisks.*

Conclusion

Human driving behavior can be incredibly nuanced, posing a significant challenge to recreating it in simulation. However, by taking a page from language modeling, which deals with similar complexities in human language, the task becomes more manageable.

As a result, AV developers can leverage higher fidelity traffic models in simulation to accelerate training, testing, and validation.

To learn more, read the full paper and Trajeglish: Learning the Language of Driving Scenarios project page.