PPO One Step
Environment Requirements
env
: Must be registered insdriving.environments.REGISTRY
. Thestep
function must return the Reward Tensor of sizeN x 1
. By design it will assume the horizon is of size 1. This model will most likely never converge for any other horizon size.env_params
: These are passed toenv
asenv(**env_params)
. So, ensure compatibility with the corresponding environment.
Logging / Checkpointing
log_dir
: Path to the directory for storing the logs, and checkpointswandb_id
: Id with which to log in wandb. If the id is same as one before, it will append the logs to itload_path
: Checkpoint from which to load a previously trained modelsave_freq
: The frequency with which to checkpoint models
Configurable HyperParameters
actor_kwargs
: Arguments passed toPPOWaypointCategoricalActor/PPOWaypointGaussianActor
.observation_space
andaction_space
are automatically passed from theenv
, so no need to pass thoseseed
: Random Seedsteps_per_epoch
: Number of observation and action pairs to be collected before training the model. This is split equally across all the processesepochs
: Total epochsgamma
: Discount Factorclip_ratio
: Clip Ratio in the PPO Algorithmpi_lr
: Learning Rate for the Actortrain_pi_iters
: Total number of times the same set of data is passed for trainingentropy_coeff
: Coefficient for entropy regularizationlam
: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)target_kl
: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
Last updated
Was this helpful?