PPO Alternating Optimization
Environment Requirements
env
: Must be registered insdriving.environments.REGISTRY
. Thestep
function takes 2 arguments. The first one representsstage
, which when 0 is used to perform the single-step RL action. The returned value when stage = 1, should be the observation of the controller. Any furthur call tostep
follows the same API asPPO Distributed Centralized Critic
env_params
: These are passed toenv
asenv(**env_params)
. So, ensure compatibility with the corresponding environment.
Logging / Checkpointing
log_dir
: Path to the directory for storing the logs, and checkpointswandb_id
: Id with which to log in wandb. If the id is same as one before, it will append the logs to itload_path
: Checkpoint from which to load a previously trained modelsave_freq
: The frequency with which to checkpoint models
Configurable HyperParameters
ac_kwargs
: Arguments passed toPPOLidarActorCritic
.observation_space
andaction_space
are automatically passed from theenv
, so no need to pass thoseactor_kwargs
: Arguments passed toPPOWaypointCategoricalActor/PPOWaypointGaussianActor
.observation_space
andaction_space
are automatically passed from theenv
, so no need to pass thoseseed
: Random Seednumber_steps_per_controller_update
: Number of observation and action pairs to be collected before training the model. This is split equally across all the processesnumber_episodes_per_spline_update
: Number of episodes for rollout generation before updating the spline subpolicyepochs
: Total epochsgamma
: Discount Factorclip_ratio
: Clip Ratio in the PPO Algorithmpi_lr
: Learning Rate for the Actorvf_lr
: Learning Rate for the Criticspline_lr
: Learning Rate for training the Spline SubPolicytrain_iters
: If>1
the training data is split into mini-batches for training, where batch_size =steps_per_epoch / (train_iters * num_of_processes)
entropy_coeff
: Coefficient for entropy regularizationlam
: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)target_kl
: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
Last updated