End-to-End Driving

The End-to-End Driving dataset aims to foster research into the generalization and reasoning capabilities of end to end modeling approaches on long-tail examples.
Check out the benchmarks based on this dataset:
2025 Challenges: Vision-based E2E Driving
Overview
The End-to-End Driving Dataset includes the following features.
Type | Name |
---|---|
Sensor Data | |
Routing | |
Labels | |
Labels |
Preview of Our Long-Tail Scenarios
We selected 6 videos from our 4000 driving segments to demonstrate the long-tail nature of our dataset.
Flock of birds
Cut-in
Deer at night
Car against traffic at 70mph
Fallen scooter
Marathon, cones, traffic lights
Coordinate Systems
This dataset employs two primary coordinate systems: vehicle coordinates and sensor frames.
Vehicle Coordinates: The vehicle coordinate system is centered at the ego vehicle's center. The x-axis points forward, the y-axis points left, and the z-axis points upward. All trajectory data is referenced to this vehicle coordinate system.
Sensor Frames: Each sensor's frame is related to the vehicle frame by an extrinsic transformation. For cameras, the frame is centered at the lens. The x-axis points out from the lens, the z-axis points upward, and the y/z plane is parallel to the camera's image plane. This is a right-handed coordinate system.
Camera Data
This dataset includes images from eight cameras, each capturing a different direction: front, front left, front right, side left, side right, rear, rear left, and rear right. For each direction, a single JPEG image is provided. Alongside the image data, we supply camera intrinsics and extrinsics, which define the camera's internal parameters and its position relative to the vehicle's center, respectively. These parameters enable the projection of 3D trajectories onto the camera images. Each driving segment features 10Hz camera video sequences, with training data spanning 20 seconds and testing data spanning 12 seconds.

High-Level Command
We provide a discrete driving command in each frame, indicating whether the intended route is towards the left, straight or right direction.
Labels
Ego Status
Past Trajectory: The ego vehicle's past 4-second trajectory, aligned with the current camera timestamp, is provided as waypoints
[(x1, y1), (x2, y2), ...]
at 4Hz frequency. All waypoints are in vehicle coordinates.Velocity and Acceleration: The ego vehicle's velocity and acceleration, aligned with the past trajectory, is also provided.
(Training Data Only) Future Trajectory: The ego vehicle's future 5-second trajectory, in the same format and frequency as the past trajectory.
Scenario Clusters
The driving scenarios can be divided into the following categories:
Construction: Construction zone scenarios.
Intersection: Complex interaction scenarios.
Pedestrian: Pedestrian interaction scenarios.
Cyclist: Cyclists interaction scenarios.
Multi-Lane Maneuvers: Scenarios where the ego vehicle is required to change lanes in multi-lane roads.
Single-Lane Maneuvers: Scenarios where the ego vehicle is required to take actions in single-lane roads.
Cut-ins: Scenarios where other on-road agents cut in.
Foreign Object Debris: Scenarios with rare objects such as animals and furniture, etc.
Special Vehicles: Scenarios where special vehicles.
Spotlight: Manually selected challenging scenarios.
Others: Scenarios not belonging to any of the clusters above.
Rater Feedback Labels
To capture the diversity of acceptable driving decisions during critical events, this dataset includes rater feedback labels. At specific moments within each driving segment, expert labelers rate three distinct 5-second future trajectories on a scale of 0 to 10, where 0 represents the worst driving and 10 the best driving trajectory. Importantly, we ensure that at least one of the rater specified trajectories receives a score higher than 6.
Data Composition
This dataset comprises 4,021 driving segments saved as tf.record format, divided as follows:
Training Data (2,037 segments): Each segment includes a 20-second video, complete driving logs for the entire duration.
Validation Data (479 segments): Each segment includes a 20-second video, complete driving logs for the entire duration. In addition, rater feedback labels will be provided for a single frame in each segment.
Testing Data (1,505 segments): Each segment includes a 12-second video. Participants are tasked with predicting the 5-second future trajectory based on the last frame. Rater feedback labels and future driving logs are withheld.
Data Proto
Our data proto definition is straightforward, check our [GitHub] for details.
Tutorial
Check this colab to follow with the provided tutorial to get familiar with the dataset.