PPO One Step
Environment Requirements
- env: Must be registered in- sdriving.environments.REGISTRY. The- stepfunction must return the Reward Tensor of size- N x 1. By design it will assume the horizon is of size 1. This model will most likely never converge for any other horizon size.
- env_params: These are passed to- envas- env(**env_params). So, ensure compatibility with the corresponding environment.
Logging / Checkpointing
- log_dir: Path to the directory for storing the logs, and checkpoints
- wandb_id: Id with which to log in wandb. If the id is same as one before, it will append the logs to it
- load_path: Checkpoint from which to load a previously trained model
- save_freq: The frequency with which to checkpoint models
Configurable HyperParameters
- actor_kwargs: Arguments passed to- PPOWaypointCategoricalActor/PPOWaypointGaussianActor.- observation_spaceand- action_spaceare automatically passed from the- env, so no need to pass those
- seed: Random Seed
- steps_per_epoch: Number of observation and action pairs to be collected before training the model. This is split equally across all the processes
- epochs: Total epochs
- gamma: Discount Factor
- clip_ratio: Clip Ratio in the PPO Algorithm
- pi_lr: Learning Rate for the Actor
- train_pi_iters: Total number of times the same set of data is passed for training
- entropy_coeff: Coefficient for entropy regularization
- lam: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
- target_kl: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
Last updated
Was this helpful?