Skip to content

Waymo Open Sim Agent Challenge (WOSAC) Benchmark

We provide a re-implementation of the Waymo Open Sim Agent Challenge (WOSAC), which measures distributional realism of simulated trajectories compared to logged human trajectories. Our version preserves the original logic and metric weighting but uses PyTorch on GPU for the metrics computation, unlike the original TensorFlow CPU implementation. The code is also simplified for clarity, making it easier to understand, adapt, and extend.

Note: In PufferDrive, agents are conditioned on a "goal" represented as a single (x, y) position, reflecting that drivers typically have a high-level destination in mind. Evaluating whether an agent matches human distributional properties can be decomposed into: (1) inferring a person's intended direction from context (1 second in WOSAC) and (2) navigating toward that goal in a human-like manner. We focus on the second component, though the evaluation could be adapted to include behavior prediction as in the original WOSAC.

[TODO: ADD bar graphs]

Usage

Running a single evaluation from a checkpoint

The [eval] section in drive.ini contains all relevant configurations. To run the WOSAC eval once:

puffer eval puffer_drive --eval.wosac-realism-eval True --load-model-path <your-trained-policy>.pt

The default configs aim to emulate the WOSAC settings as closely as possible, but you can adjust them:

[eval]
map_dir = "resources/drive/binaries/validation" # Dataset to use
num_maps = 100  # Number of maps to run evaluation on. (It will alwasys be the first num_maps maps of the map_dir)
wosac_num_rollouts = 32      # Number of policy rollouts per scene
wosac_init_steps = 10        # When to start the simulation
wosac_control_mode = "control_wosac"  # Control the tracks to predict
wosac_init_mode = "create_all_valid"  # Initialize from the tracks to predict
wosac_goal_behavior = 2      # Stop when reaching the goal
wosac_goal_radius = 2.0      # Can shrink goal radius for WOSAC evaluation

Log evals to W&B during training

During experimentation, logging key metrics directly to W&B avoids a post-training step. Evaluations can be enabled during training, with results logged under a separate eval/ section. The main configuration options:

[train]
checkpoint_interval = 500    # Set equal to eval_interval to use the latest checkpoint

[eval]
eval_interval = 500          # Run eval every N epochs
map_dir = "resources/drive/binaries/training"  # Dataset to use
num_maps = 20 # Number of maps to run evaluation on. (It will alwasys be the first num_maps maps of the map_dir)

Baselines

We provide baselines on a small curated dataset from the WOMD validation set with perfect ground-truth (no collisions or off-road events from labeling mistakes).

Method Realism meta-score Kinematic metrics Interactive metrics Map-based metrics minADE ADE
Ground-truth (UB) 0.832 0.606 0.846 0.961 0 0
π_Base self-play RL 0.737 0.319 0.789 0.938 10.834 11.317
SMART-tiny-CLSFT 0.805 0.534 0.830 0.949 1.124 3.123
π_Random 0.485 0.214 0.657 0.408 6.477 18.286

Table: WOSAC baselines in PufferDrive on 229 selected clean held-out validation scenarios.


✏️ Download the dataset from Hugging Face to reproduce these results or benchmark your policy.