PPO Alternating Optimization

Environment Requirements

  • env: Must be registered in sdriving.environments.REGISTRY. The step function takes 2 arguments. The first one represents stage, which when 0 is used to perform the single-step RL action. The returned value when stage = 1, should be the observation of the controller. Any furthur call to step follows the same API as PPO Distributed Centralized Critic

  • env_params: These are passed to env as env(**env_params). So, ensure compatibility with the corresponding environment.

Logging / Checkpointing

  • log_dir: Path to the directory for storing the logs, and checkpoints

  • wandb_id: Id with which to log in wandb. If the id is same as one before, it will append the logs to it

  • load_path: Checkpoint from which to load a previously trained model

  • save_freq: The frequency with which to checkpoint models

Configurable HyperParameters

  • ac_kwargs: Arguments passed to PPOLidarActorCritic. observation_space and action_space are automatically passed from the env, so no need to pass those

  • actor_kwargs: Arguments passed to PPOWaypointCategoricalActor/PPOWaypointGaussianActor. observation_space and action_space are automatically passed from the env, so no need to pass those

  • seed: Random Seed

  • number_steps_per_controller_update: Number of observation and action pairs to be collected before training the model. This is split equally across all the processes

  • number_episodes_per_spline_update: Number of episodes for rollout generation before updating the spline subpolicy

  • epochs: Total epochs

  • gamma: Discount Factor

  • clip_ratio: Clip Ratio in the PPO Algorithm

  • pi_lr: Learning Rate for the Actor

  • vf_lr: Learning Rate for the Critic

  • spline_lr: Learning Rate for training the Spline SubPolicy

  • train_iters: If >1 the training data is split into mini-batches for training, where batch_size = steps_per_epoch / (train_iters * num_of_processes)

  • entropy_coeff: Coefficient for entropy regularization

  • lam: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)

  • target_kl: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)

Last updated