Navformer is the end-to-end model training and evaluation component of WorldEngine, built on MMDetection3D, the nuPlan / OpenScene dataset and NAVSIM.
It supports a full training loop: train → open-loop evaluation → rare case extraction → RL fine-tuning, with VADv2 and HydraMDP as the supported model architectures.
- System Requirements
- Installation
- Data
- Quick Reference
- Training
- Evaluation
- Rare Case Extraction
- Configuration
- Model Architectures
- Advanced Training
- Troubleshooting
- Performance Optimization
Minimum:
- GPU: NVIDIA GPU with 8 GB VRAM (e.g., RTX 2080)
- RAM: 32 GB
- Storage: 500 GB SSD
- CPU: 8 cores
Recommended:
- GPU: NVIDIA GPU with 24 GB+ VRAM (e.g., RTX 3090, A100)
- RAM: 64 GB+
- Storage: 5 TB+ SSD
- CPU: 16+ cores
Software:
- OS: Linux (Ubuntu 20.04 / 22.04)
- CUDA: 11.8
- Conda / Miniconda
conda create --name navformer python=3.9 -y
conda activate navformerpip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 \
--index-url https://download.pytorch.org/whl/cu118Verify:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
# Expected: PyTorch: 2.0.1+cu118, CUDA: TrueMMCV must be built from source to include custom CUDA operators:
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.6.2
# Build with custom ops (takes 10–15 minutes)
# Downgrade setuptools to ~75.1.0 if you encounter build errors
MMCV_WITH_OPS=1 pip install -v -e .
python .dev_scripts/check_installation.py
cd ..Verify:
python -c "import mmcv; print(f'MMCV: {mmcv.__version__}')"
# Expected: MMCV: 1.6.2pip install mmcls==0.25.0
pip install mmdet==2.25.3
pip install mmdet3d==1.0.0rc6
pip install mmsegmentation==0.29.1pip install -r requirements.txt
pip install shapely==2.0.4python -c "
import torch, mmcv, mmdet, mmdet3d, numpy, hydra
print('All Navformer dependencies OK')
print(f'PyTorch {torch.__version__}')
print(f'MMCV {mmcv.__version__}')
print(f'MMDetection3D {mmdet3d.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
"Navformer relies on the NAVSIM devkit v1.1:
git clone -b v1.1 https://github.com/autonomousvision/navsim.gitAdd the following to ~/.bashrc or ~/.zshrc:
export NAVSIM_DEVKIT_ROOT="/path/to/navsim"
export NAVFORMER_ROOT="/path/to/Navformer"
export NUPLAN_MAPS_ROOT="/path/to/nuplan/maps"
PYTHONPATH=$NAVFORMER_ROOT:$NAVSIM_DEVKIT_ROOT:$PYTHONPATHApply:
source ~/.bashrc # or source ~/.zshrcNavformer/
├── data/
│ ├── raw/ # nuPlan and OpenScene datasets
│ └── alg_engine/ # Navformer-specific data
└── experiments/ # Experiment outputs (auto-created)
Navformer reuses the OpenDriveLab/WorldEngine dataset on Hugging Face, which contains merged annotation PKLs, PDM caches, model checkpoints, and K-means vocab files.
-
Hugging Face:
curl -LsSf https://hf.co/cli/install.sh | bash hf download OpenDriveLab/WorldEngine --repo-type dataset --local-dir /path/to/Navformer -
ModelScope (recommended for users in China):
pip install modelscope modelscope download --dataset OpenDriveLab/WorldEngine
data/raw/
├── nuplan/
│ └── dataset/
│ ├── maps/ # HD maps (required)
│ │ ├── nuplan-maps-v1.0.json
│ │ ├── us-nv-las-vegas-strip/
│ │ ├── us-ma-boston/
│ │ ├── us-pa-pittsburgh-hazelwood/
│ │ └── sg-one-north/
│ └── nuplan-v1.1/
│ ├── sensor_blobs/ # Camera images and LiDAR
│ └── splits/
│
└── openscene-v1.1/
├── sensor_blobs/
│ ├── trainval/
│ └── test/
└── meta_datas/
├── trainval/
└── test/
Use symlinks to point at your existing downloads:
cd data/raw
ln -s /path/to/nuplan nuplan
ln -s /path/to/openscene-v1.1 openscene-v1.1data/alg_engine/
├── ckpts/ # Pre-trained model checkpoints
├── merged_infos_navformer/
│ ├── nuplan_openscene_navtrain.pkl
│ └── nuplan_openscene_navtest.pkl
├── pdms_cache/ # Pre-computed PDM metrics cache
│ ├── pdm_8192_gt_cache_navtrain.pkl
│ └── pdm_8192_gt_cache_navtest.pkl
└── test_8192_kmeans.npy # K-means clustering for PDM vocab
conda activate navformer
# Training (8 GPUs)
./scripts/e2e_dist_train.sh <config> <num_gpus> [resume_checkpoint]
# Open-loop navtest evaluation
./scripts/e2e_dist_eval.sh <config> <checkpoint> <num_gpus>
# Full train set evaluation
bash scripts/e2e_dist_eval_navtrain.sh <config> <checkpoint> <num_gpus>
# Rare case extraction
python scripts/rare_case_sampling_by_pdms.py \
--pdm-result <csv_file> \
--base-split <yaml_file> \
--output-dir <output_dir>conda activate navformer
# Train VADv2 (8 GPUs)
./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8Arguments:
<config>— configuration file path<num_gpus>— number of GPUs[resume_checkpoint](optional) — checkpoint to resume from
./scripts/e2e_dist_train.sh \
configs/navformer/e2e_vadv2.py \
8 \
experiments/navformer/e2e_vadv2/latest.pthIf latest.pth exists in experiments/navformer/e2e_vadv2/, training auto-resumes when you omit the third argument.
# Watch training log
tail -f experiments/navformer/e2e_vadv2/logs/train.*
# TensorBoard
tensorboard --logdir experiments/navformer/e2e_vadv2/tf_logsKey metrics:
loss— total training loss (should decrease)loss_planning— planning lossloss_track— tracking lossade_4s— average displacement error at 4 sfde_4s— final displacement error at 4 s
experiments/navformer/e2e_vadv2/
├── e2e_vadv2.py # config backup
├── logs/
│ └── train.*
├── epoch_1.pth
├── ...
├── epoch_20.pth
└── latest.pth # symlink to latest checkpoint
conda activate navformer
./scripts/e2e_dist_eval.sh \
configs/navformer/e2e_vadv2.py \
experiments/navformer/e2e_vadv2/epoch_20.pth \
8Output: experiments/navformer/e2e_vadv2/navtest.csv
./scripts/e2e_dist_eval_navtest_failures.sh \
configs/navformer/e2e_vadv2.py \
experiments/navformer/e2e_vadv2/epoch_20.pth \
8Output: experiments/navformer/e2e_vadv2/navtest_failures.csv
Required before Rare Case Extraction. Evaluates on the full navtrain split:
bash scripts/e2e_dist_eval_navtrain.sh \
configs/navformer/e2e_vadv2.py \
experiments/navformer/e2e_vadv2/epoch_20.pth \
8Output: experiments/navformer/e2e_vadv2/navtrain.csv
token,ade_4s,fde_4s,no_at_fault_collisions,drivable_area_compliance,ego_progress,comfort,score| Metric | Description | Direction |
|---|---|---|
ade_4s |
Average trajectory error over 4 s (m) | lower |
fde_4s |
Final position error at 4 s (m) | lower |
no_at_fault_collisions |
Collision avoidance rate (0–1) | higher |
drivable_area_compliance |
Stay in drivable area (0–1) | higher |
ego_progress |
Route completion (0–1) | higher |
comfort |
Comfort metric (0–1) | higher |
score |
Overall PDM score (0–1) | higher |
Extract failure scenarios from training-set evaluation for targeted fine-tuning.
Prerequisite: complete a Full Train Set Evaluation first.
conda activate navformer
python scripts/rare_case_sampling_by_pdms.py \
--pdm-result experiments/navformer/e2e_vadv2/navtrain.csv \
--base-split configs/navsim_splits/navtrain_split/navtrain_50pct.yaml \
--output-dir configs/navsim_splits/navtrain_split/e2e_vadv2_rareOutput:
configs/navsim_splits/navtrain_split/e2e_vadv2_rare/
├── navtrain_50pct_collision.yaml # collision scenarios
├── navtrain_50pct_off_road.yaml # off-road scenarios
└── navtrain_50pct_ep_1pct.yaml # low ego-progress (bottom 1%)
Edit scripts/rare_case_sampling_by_pdms.py:
# Change collision threshold
collision_scenarios = df[df['no_at_fault_collisions'] < 0.95] # default 1.0
# Change ego-progress percentile
ep_threshold = df['ego_progress'].quantile(0.05) # default 0.01 (1% → 5%)Configs follow the MMDetection3D hierarchical pattern:
configs/
├── _base_/
│ └── default_runtime.py
├── navformer/
│ ├── e2e_vadv2.py
│ ├── e2e_hydramdp.py
│ └── track_map_nuplan_r50_navtrain.py
└── navsim_splits/
├── navtrain_split/
│ ├── navtrain.yaml
│ ├── navtrain_50pct.yaml
│ └── e2e_vadv2_rare/
│ ├── navtrain_50pct_collision.yaml
│ ├── navtrain_50pct_off_road.yaml
│ └── navtrain_50pct_ep_1pct.yaml
└── navtest_split/
├── navtest.yaml
└── navtest_failures.yaml
model = dict(
type='VADv2', # or 'HydraMDP'
num_query=900,
planning_steps=8,
)
bev_h_, bev_w_ = 200, 200
patch_size = [102.4, 102.4] # physical range in meters
input_modality = dict(
use_lidar=False,
use_camera=True, # 8 cameras
use_radar=False,
use_external=True, # CAN bus
)
total_epochs = 20
optimizer = dict(type='AdamW', lr=2e-4, weight_decay=0.01)
data = dict(
samples_per_gpu=1,
workers_per_gpu=4,
train=dict(
ann_file='merged_infos_navformer/nuplan_openscene_navtrain.pkl',
scenario_filter='configs/navsim_splits/navtrain_split/navtrain_50pct.yaml',
),
val=dict(
ann_file='merged_infos_navformer/nuplan_openscene_navtest.pkl',
scenario_filter='configs/navsim_splits/navtest_split/navtest.yaml',
),
)./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8 \
--cfg-options optimizer.lr=1e-4 total_epochs=30 data.samples_per_gpu=2| Architecture | Config | Strengths |
|---|---|---|
| VADv2 (default) | configs/navformer/e2e_vadv2.py |
Fast inference, general driving |
| HydraMDP | configs/navformer/e2e_hydramdp.py |
Multi-modal planning, safety-critical |
# Node 0 (master)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16
export RANK=0
./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8
# Node 1 (worker)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16
export RANK=8
./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8# in config
fp16 = dict(loss_scale='dynamic')# effective batch = samples_per_gpu * num_gpus * gradient_accumulation_steps
runner = dict(max_epochs=20, gradient_accumulation_steps=4)CUDA out of memory:
# Reduce batch size: data.samples_per_gpu = 1
# Lower BEV resolution: bev_h_, bev_w_ = 150, 150
# Enable gradient checkpointing: model.img_backbone.with_cp = TrueTraining loss not decreasing:
grep "load checkpoint" experiments/navformer/*/logs/train.*
./scripts/e2e_dist_train.sh ... --cfg-options optimizer.lr=1e-4Evaluation hangs:
ps aux | grep python
pkill -f "test.py"
./scripts/e2e_dist_eval.sh ... 4 # try fewer GPUsModuleNotFoundError: No module named mmdet3d:
conda activate navformer
python -c "import mmcv; print(mmcv.__version__)"
pip uninstall mmdet3d -y && pip install mmdet3d==1.0.0rc6Corrupted checkpoint:
# Use a previous epoch
./scripts/e2e_dist_train.sh ... experiments/navformer/e2e_vadv2/epoch_18.pthTraining speed:
data.workers_per_gpu = 8(if CPU/RAM allows)- Store data on NVMe SSD
fp16 = dict(loss_scale='dynamic')data.persistent_workers = True
Memory:
data.samples_per_gpu = 1bev_h_, bev_w_ = 150, 150model.img_backbone.with_cp = True
Multi-node:
- Use homogeneous GPU types across nodes
- InfiniBand for inter-node communication
- Shared NFS/Lustre for data loading