PPO Alternating Optimization
Environment Requirements
env
: Must be registered insdriving.environments.REGISTRY
. Thestep
function takes 2 arguments. The first one representsstage
, which when 0 is used to perform the single-step RL action. The returned value when stage = 1, should be the observation of the controller. Any furthur call tostep
follows the same API asPPO Distributed Centralized Critic
env_params
: These are passed toenv
asenv(**env_params)
. So, ensure compatibility with the corresponding environment.
Logging / Checkpointing
log_dir
: Path to the directory for storing the logs, and checkpointswandb_id
: Id with which to log in wandb. If the id is same as one before, it will append the logs to itload_path
: Checkpoint from which to load a previously trained modelsave_freq
: The frequency with which to checkpoint models
Configurable HyperParameters
ac_kwargs
: Arguments passed toPPOLidarActorCritic
.observation_space
andaction_space
are automatically passed from theenv
, so no need to pass thoseactor_kwargs
: Arguments passed toPPOWaypointCategoricalActor/PPOWaypointGaussianActor
.observation_space
andaction_space
are automatically passed from theenv
, so no need to pass thoseseed
: Random Seednumber_steps_per_controller_update
: Number of observation and action pairs to be collected before training the model. This is split equally across all the processesnumber_episodes_per_spline_update
: Number of episodes for rollout generation before updating the spline subpolicyepochs
: Total epochsgamma
: Discount Factorclip_ratio
: Clip Ratio in the PPO Algorithmpi_lr
: Learning Rate for the Actorvf_lr
: Learning Rate for the Criticspline_lr
: Learning Rate for training the Spline SubPolicytrain_iters
: If>1
the training data is split into mini-batches for training, where batch_size =steps_per_epoch / (train_iters * num_of_processes)
entropy_coeff
: Coefficient for entropy regularizationlam
: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)target_kl
: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
Last updated
Was this helpful?