PPO Distributed Centralized Critic
Environment Requirements
env: Must be registered insdriving.environments.REGISTRY. The environmentstepfunction takes an action whose batch dim is of sizeN(number of agents). Currently, it doesn't support variableNover-time but it is very simple to implement (so open an issue if needed). It needs to return the observation for the next timestep, a BoolTensor of sizeN x 1specifying if simulation for that agent is completed, Reward Tensor of sizeN x 1, andinfosimilar to OpenAI Gym Environments.env_params: These are passed toenvasenv(**env_params). So, ensure compatibility with the corresponding environment.
Logging / Checkpointing
log_dir: Path to the directory for storing the logs, and checkpointswandb_id: Id with which to log in wandb. If the id is same as one before, it will append the logs to itload_path: Checkpoint from which to load a previously trained modelsave_freq: The frequency with which to checkpoint models
Configurable HyperParameters
ac_kwargs: Arguments passed toPPOLidarActorCritic.observation_spaceandaction_spaceare automatically passed from theenv, so no need to pass thoseseed: Random Seedsteps_per_epoch: Number of observation and action pairs to be collected before training the model. This is split equally across all the processesepochs: Total epochsgamma: Discount Factorclip_ratio: Clip Ratio in the PPO Algorithmpi_lr: Learning Rate for the Actorvf_lr: Learning Rate for the Critictrain_iters: If>1the training data is split into mini-batches for training, where batch_size =steps_per_epoch / (train_iters * num_of_processes)entropy_coeff: Coefficient for entropy regularizationlam: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)target_kl: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
Last updated