PPO Distributed Centralized Critic
Environment Requirements
env
: Must be registered insdriving.environments.REGISTRY
. The environmentstep
function takes an action whose batch dim is of sizeN
(number of agents). Currently, it doesn't support variableN
over-time but it is very simple to implement (so open an issue if needed). It needs to return the observation for the next timestep, a BoolTensor of sizeN x 1
specifying if simulation for that agent is completed, Reward Tensor of sizeN x 1
, andinfo
similar to OpenAI Gym Environments.env_params
: These are passed toenv
asenv(**env_params)
. So, ensure compatibility with the corresponding environment.
Logging / Checkpointing
log_dir
: Path to the directory for storing the logs, and checkpointswandb_id
: Id with which to log in wandb. If the id is same as one before, it will append the logs to itload_path
: Checkpoint from which to load a previously trained modelsave_freq
: The frequency with which to checkpoint models
Configurable HyperParameters
ac_kwargs
: Arguments passed toPPOLidarActorCritic
.observation_space
andaction_space
are automatically passed from theenv
, so no need to pass thoseseed
: Random Seedsteps_per_epoch
: Number of observation and action pairs to be collected before training the model. This is split equally across all the processesepochs
: Total epochsgamma
: Discount Factorclip_ratio
: Clip Ratio in the PPO Algorithmpi_lr
: Learning Rate for the Actorvf_lr
: Learning Rate for the Critictrain_iters
: If>1
the training data is split into mini-batches for training, where batch_size =steps_per_epoch / (train_iters * num_of_processes)
entropy_coeff
: Coefficient for entropy regularizationlam
: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)target_kl
: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
Last updated
Was this helpful?