Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Waymo Open Sim Agent Challenge (WOSAC) benchmark

We provide a re-implementation of the Waymo Open Sim Agent Challenge (WOSAC), which measures distributional realism of simulated trajectories compared to logged human trajectories. Our version preserves the original logic and metric weighting but uses PyTorch on GPU for the metrics computation, unlike the original TensorFlow CPU implementation. The exact speedup depends on the setup and hardware, but in practice this leads to a substantial speedup (around 30–100×). Evaluating 100 scenarios (32 rollouts + metrics computation) currently completes in under a minute.

Besides speed benefits, the code is also simplified to make it easier to understand and extend.

Note: In PufferDrive, agents are conditioned on a “goal” represented as a single (x, y) position, reflecting that drivers typically have a high-level destination in mind. Evaluating whether an agent matches human distributional properties can be decomposed into: (1) inferring a person’s intended direction from context (1 second in WOSAC) and (2) navigating toward that goal in a human-like manner. We focus on the second component, though the evaluation could be adapted to include behavior prediction as in the original WOSAC.

WOSAC implementation in PufferDrive

Illustration of WOSAC implementation in PufferDrive (RHS) vs. the original challenge (LHS).

Usage

Running a single evaluation from a checkpoint

The [eval] section in drive.ini contains all relevant configurations. To run the WOSAC eval once:

puffer eval puffer_drive --eval.wosac-realism-eval True --load-model-path <your-trained-policy>.pt

The default configs aim to emulate the WOSAC settings as closely as possible, but you can adjust them:

[eval]
map_dir = "resources/drive/binaries/validation" # Dataset to use
num_maps = 100  # Number of maps to run evaluation on. (It will always be the first num_maps maps of the map_dir)
wosac_num_rollouts = 32      # Number of policy rollouts per scene
wosac_init_steps = 10        # When to start the simulation
wosac_control_mode = "control_wosac"  # Control the tracks to predict
wosac_init_mode = "create_all_valid"  # Initialize from the tracks to predict
wosac_goal_behavior = 2      # Stop when reaching the goal
wosac_goal_radius = 2.0      # Can shrink goal radius for WOSAC evaluation

Log evals to W&B during training

During experimentation, logging key metrics directly to W&B avoids a post-training step. Evaluations can be enabled during training, with results logged under a separate eval/ section. The main configuration options:

[train]
checkpoint_interval = 500    # Set equal to eval_interval to use the latest checkpoint

[eval]
eval_interval = 500          # Run eval every N epochs
map_dir = "resources/drive/binaries/training"  # Dataset to use
num_maps = 20 # Number of maps to run evaluation on. (It will always be the first num_maps maps of the map_dir)

Baselines

We provide baselines on a small curated dataset from the WOMD validation set with perfect ground-truth (no collisions or off-road events from labeling mistakes).

MethodRealism meta-scoreKinematic metricsInteractive metricsMap-based metricsminADEADE
Ground-truth (UB)0.8320.6060.8460.96100
π_Base self-play RL0.7370.3190.7890.93810.83411.317
SMART-tiny-CLSFT0.8050.5340.8300.9491.1243.123
π_Random0.4850.2140.6570.4086.47718.286

Table: WOSAC baselines in PufferDrive on 229 selected clean held-out validation scenarios.

✏️ Download the dataset from Hugging Face to reproduce these results or benchmark your policy.

MethodRealism meta-scoreKinematic metricsInteractive metricsMap-based metricsminADEADE
Ground-truth (UB)0.8330.5740.8640.95800
π_Base self-play RL0.7370.3230.7920.9308.5309.088
SMART-tiny-CLSFT0.7950.5040.8320.9321.1822.857
π_Random0.4970.2380.6560.4306.39518.617

Table: WOSAC baselines in PufferDrive on validation 10k dataset.

✏️ Download the dataset from Hugging Face to reproduce these results or benchmark your policy.

Evaluating trajectories

In this section, we describe how we evaluated SMART-tiny-CLSFT in PufferDrive and how you can use this to evaluate your own agent trajectories.

High-level idea

The WOSAC evaluation pipeline takes as input simulated trajectories (sim_trajectories) and ground-truth trajectories, computes summary statistics, and outputs scores based on these statistics (entry point to code here). If you already have simulated trajectories saved as a .pkl file—generated from the same dataset—you can directly use them to compute WOSAC scores.

Command

python pufferlib/ocean/benchmark/evaluate_imported_trajectories.py --simulated-file my_rollouts.pkl

Instructions

  • Rollouts must be generated using the same dataset specified in the config file under [eval] map_dir. The corresponding scenario IDs can be found in the .json files (the scenario_id field).
  • If you have a predefined list of scenario_ids, you can pass them to your dataloader to run inference only on those scenarios.
  • Save the inference outputs in a dictionary with the following fields:
x        : (num_agents, num_rollouts, 81)
y        : (num_agents, num_rollouts, 81)
z        : (num_agents, num_rollouts, 81)
heading  : (num_agents, num_rollouts, 81)
id       : (num_agents, num_rollouts, 81)
  • Recompile the code with MAX_AGENTS=256 set in drive.h.
  • Finally, run: python pufferlib/ocean/benchmark/evaluate_imported_trajectories.py --simulated-file my_rollouts.pkl