Simulator Guide¶
Deep dive into how the Drive environment is wired, what it expects as inputs, and how observations/actions/configs are shaped. The environment entrypoint is pufferlib/ocean/drive/drive.py, which wraps the C core in pufferlib/ocean/drive/drive.h via binding.c.
Runtime inputs and lifecycle¶
- Map binaries: The environment scans
resources/drive/binariesformap_*.binfiles and requires at least one to load. Keepnum_mapsno larger than what is present on disk. During vectorized setup,binding.sharedsamples maps until it accumulates at leastnum_agentscontrollable entities, skipping maps with no valid agents (set_active_agentsindrive.h). - Episode length: Default
scenario_length = 91to match the Waymo logs (trajectory data is 91 steps), but you can setenv.scenario_length(CLI or.ini) to any positive value. Metrics are logged andc_resetis called whentimestep == scenario_length. - Resampling maps: Python-side
Drive.stepreinitializes the vectorized environments everyresample_frequencysteps (default910, ~10 episodes) with fresh map IDs and seeds. - Initialization controls:
init_stepsstarts agents from a later timestep in the logged trajectory.init_mode(create_all_validvscreate_only_controlled) decides which logged actors are instantiated at reset.control_mode(control_vehicles,control_agents,control_tracks_to_predict,control_sdc_only) selects which instantiated actors are policy-controlled. Non-controlled actors can still appear as static or expert replay agents.goal_behaviorchooses what happens on goal reach (0respawn at start pose,1sample new lane-following goals via the lane topology graph,2stop in place).goal_radiussets the completion threshold in meters.
See Data for how to produce the .bin inputs, including the binary layout.
Actions and dynamics¶
- Action types (
env.action_type): discrete(default): classic dynamics use a singleMultiDiscrete([7*13])index decoded into acceleration (ACCELERATION_VALUES) and steering (STEERING_VALUES); jerk dynamics useMultiDiscrete([4, 3])overJERK_LONG/JERK_LAT.continuous: a 2-D Box in[-1, 1]. Classic scales to the max accel/steer magnitudes used in the discrete table. Jerk scales asymmetrically: negative values reach up to-15 m/s^3braking, positives up to4 m/s^3acceleration, lateral jerk up to±4 m/s^3.- Dynamics models (
env.dynamics_model): classic: bicycle model integrating accel/steer withdt(default0.1).jerk: integrates longitudinal/lateral jerk into accel, then into velocity/pose with steering limited to±0.55 rad. Speeds are clipped to[0, 20] m/s.
Observation space¶
Shape is ego_features + 63 * 7 + 200 * 7 = 1848 for classic dynamics (ego_features = 7) or 1851 for jerk dynamics (ego_features = 10). Computed in compute_observations (drive.h):
- Ego block (classic):
- Goal position in ego frame (x, y) scaled by
0.005(~200 m range to 1.0) - Ego speed /
MAX_SPEED(100 m/s) - Width /
MAX_VEH_WIDTH(15 m) - Length /
MAX_VEH_LEN(30 m) - Collision flag (1 if the agent collided this step)
- Respawn flag (1 if the agent was respawned this episode)
- Ego block additions (jerk dynamics model only):
- Steering angle / π
- Longitudinal acceleration normalized to
[-15, 4] - Lateral acceleration normalized to
[-4, 4] - Respawn flag (index 9)
- Partner blocks: Up to 63 other agents (active first, then static experts) within 50 m. Each uses 7 values: relative (x, y) in ego frame scaled by
0.02, width/length normalized as above, relative heading encoded as(cos Δθ, sin Δθ), and speed /MAX_SPEED. Zero-padded when fewer neighbors are present or when agents are in respawn. - Road blocks: Up to 200 nearby road segments pulled from a precomputed grid (
vision_range = 21). Each entry stores relative midpoint (x, y) scaled by0.02, segment length /MAX_ROAD_SEGMENT_LENGTH(100 m), width /MAX_ROAD_SCALE(100),(cos, sin)of the segment direction in ego frame, and a type ID (ROAD_LANE..DRIVEWAYstored as0..6). Remaining slots are zero-padded.
Rewards, termination, and metrics¶
- Per-step rewards (
c_step): - Collision with another actor:
reward_vehicle_collision(default-0.5) - Off-road (road-edge intersection):
reward_offroad_collision(default-0.2) - Goal reached:
reward_goal(default1.0) orreward_goal_post_respawnafter a respawn - Optional ADE shaping:
reward_ade * avg_displacement_error, where ADE is accumulated incompute_agent_metrics - Termination: No early truncation; episodes roll to
scenario_lengthsteps. Ifgoal_behavioris respawn,respawn_agentresets the pose and marksrespawn_timestepso the respawn flag shows up in observations. - Logged metrics (
add_logaggregates over all active agents across envs): score: reached goal without collision/off-roadcollision_rate/offroad_rate: fraction of agents with ≥1 event in the episodeavg_collisions_per_agent/avg_offroad_per_agent: counts per agent, capturing repeated eventscompletion_rate: reached goal (even if collided/off-road);dnf_rate: clean but never reached goallane_alignment_rate,avg_displacement_error,num_goals_reached, plus counts of active/static/expert agents
collision_behavior, offroad_behavior, reward_vehicle_collision_post_respawn, and spawn_immunity_timer are parsed from the INI but currently unused in the stepping logic.
Configuration files (.ini)¶
pufferlib/config/default.ini supplies global defaults. Environment-specific overrides live in pufferlib/config/ocean/drive.ini and are loaded first when you run puffer train puffer_drive; CLI flags (e.g., --env.num-maps 128) override both.
Key sections in pufferlib/config/ocean/drive.ini:
- [env]: Simulator knobs: num_agents (policy slots, C core cap 64), num_maps, scenario_length, resample_frequency, action_type, dynamics_model, rewards, goal_radius, goal_behavior, init_steps, init_mode, control_mode; rendering toggles render, render_interval, obs_only, show_grid, show_lasers, show_human_logs, render_map.
- [vec]: Vectorization sizing (num_envs, num_workers, batch_size; backend defaults to multiprocessing).
- [policy]/[rnn]: Model widths for the Torch policy (input_size, hidden_size) and optional LSTM wrapper.
- [train]: PPO-style hyperparameters (timesteps, learning rate, clipping, batch/minibatch, BPTT horizon, optimizer choice) merged with any unspecified defaults from pufferlib/config/default.ini.
- [eval]: WOSAC/human-replay switches and sizing (eval.wosac_*, eval.human_replay_*) mapped directly to the Drive kwargs in evaluation subprocesses.
Model overview¶
Defined in pufferlib/ocean/torch.py:Drive:
- Three MLP encoders (ego, partners, roads) with LayerNorm. Partner and road encodings are max-pooled across instances.
- Concatenated embedding → GELU → linear to hidden_size, then split into actor/value heads.
- Discrete actions are emitted as logits per dimension (MultiDiscrete), continuous actions as Gaussian parameters (softplus std). Value head is a single linear output.
- Recurrent = pufferlib.models.LSTMWrapper can wrap the policy using the rnn config entries; otherwise the policy is feed-forward.
Drive source files (what lives where)¶
pufferlib/ocean/drive/drive.py: Python Gymnasium-style wrapper that sets up buffers, validates map availability, seeds the C core viabinding.env_init, and handles map resampling.pufferlib/ocean/drive/drive.h: Main C implementation of stepping, observations, rewards/metrics, grid map, lane graph, and collision checking.pufferlib/ocean/drive/binding.c: Python C-extension glue that exposesDriveto Python, handles shared buffer setup, logging, and reading the.iniconfig.pufferlib/ocean/drive/visualize.c: Raylib-based renderer used by thevisualizebinary and training video exports.pufferlib/ocean/drive/drive.c: Small C demo/perf harness and network parity test runner for the C policy head.pufferlib/ocean/drive/drivenet.h: Lightweight C inference network used by the visualizer/demo to mirror the Torch policy outputs.
Drive README (C core notes)¶
Agent initialization and control¶
init_mode¶
Determines which agents are created in the environment.
| Option | Description |
|---|---|
create_all_valid |
Create all entities valid at initialization (traj_valid[init_steps] == 1). |
create_only_controlled |
Create only those agents that are controlled by the policy. |
control_mode¶
Determines which created agents are controlled by the policy.
| Option | Description |
|---|---|
control_vehicles (default) |
Control only valid vehicles (not experts, beyond MIN_DISTANCE_TO_GOAL, under MAX_AGENTS). |
control_agents |
Control all valid agent types (vehicles, cyclists, pedestrians). |
control_tracks_to_predict (WOMD only) |
Control agents listed in the tracks_to_predict metadata. |
Termination conditions (done)¶
Episodes are never truncated before reaching episode_len. The goal_behavior argument controls agent behavior after reaching a goal early:
goal_behavior=0(default): Agents respawn at their initial position after reaching their goal (last valid log position).goal_behavior=1: Agents receive new goals indefinitely after reaching each goal.goal_behavior=2: Agents stop after reaching their goal.
Logged performance metrics¶
We record multiple performance metrics during training, aggregated over all active agents (alive and controlled). Key metrics include:
score: Goals reached cleanly (goal was achieved without collision or going off-road)collision_rate: Binary flag (0 or 1) if agent hit another vehicle.offroad_rate: Binary flag (0 or 1) if agent left road bounds.completion_rate: Whether the agent reached its goal in this episode (even if it collided or went off-road).
Metric aggregation¶
The num_agents parameter in drive.ini defines the total number of agents used to collect experience. At runtime, Puffer uses num_maps to create enough environments to populate the buffer with num_agents, distributing them evenly across num_envs.
Because agents are respawned immediately after reaching their goal, they remain active throughout the episode.
At the end of each episode (i.e., when timestep == TRAJECTORY_LENGTH), metrics are logged once via:
if (env->timestep == TRAJECTORY_LENGTH) {
add_log(env);
c_reset(env);
return;
}
Metrics are normalized and aggregated in vec_log (pufferlib/ocean/env_binding.h). They are averaged over all active agents across all environments. For example, the aggregated collision rate is computed as:
where \(N\) is the number of controlled agents.
Since these metrics do not capture multiple events per agent, we additionally log the average number of collision and off-road events per episode. This is computed as:
where \(N\) is the total number of controlled agents. For example, an avg_collisions_per_agent value of 4 indicates that, on average, each agent collides four times per episode.

Effect of respawning on metrics¶
By default, agents are reset to their initial position when they reach their goal before the episode ends. Upon respawn, respawn_timestep is updated from -1 to the current step index. After an agent respawns, all other agents are removed from the environment, so collisions with other agents cannot occur post-respawn.

