PPO Distributed Centralized Critic

Environment Requirements

env: Must be registered in sdriving.environments.REGISTRY. The environment step function takes an action whose batch dim is of size N (number of agents). Currently, it doesn't support variable N over-time but it is very simple to implement (so open an issue if needed). It needs to return the observation for the next timestep, a BoolTensor of size N x 1 specifying if simulation for that agent is completed, Reward Tensor of size N x 1, and info similar to OpenAI Gym Environments.
env_params: These are passed to env as env(**env_params). So, ensure compatibility with the corresponding environment.

log_dir: Path to the directory for storing the logs, and checkpoints
wandb_id: Id with which to log in wandb. If the id is same as one before, it will append the logs to it
load_path: Checkpoint from which to load a previously trained model
save_freq: The frequency with which to checkpoint models

ac_kwargs: Arguments passed to PPOLidarActorCritic. observation_space and action_space are automatically passed from the env, so no need to pass those
seed: Random Seed
steps_per_epoch: Number of observation and action pairs to be collected before training the model. This is split equally across all the processes
epochs: Total epochs
gamma: Discount Factor
clip_ratio: Clip Ratio in the PPO Algorithm
pi_lr: Learning Rate for the Actor
vf_lr: Learning Rate for the Critic
train_iters: If >1 the training data is split into mini-batches for training, where batch_size = steps_per_epoch / (train_iters * num_of_processes)
entropy_coeff: Coefficient for entropy regularization
lam: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
target_kl: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)

Last updated 4 years ago

Was this helpful?