# PPO Alternating Optimization

## Environment Requirements

* `env`: Must be registered in  `sdriving.environments.REGISTRY`. The `step` function takes 2 arguments. The first one represents `stage`, which when 0 is used to perform the single-step RL action. The returned value when stage = 1, should be the observation of the controller. Any furthur call to `step` follows the same API as `PPO Distributed Centralized Critic`
* `env_params`: These are passed to `env` as `env(**env_params)`. So, ensure compatibility with the corresponding environment.

## Logging / Checkpointing

* `log_dir`: Path to the directory for storing the logs, and checkpoints
* `wandb_id`: Id with which to log in wandb. If the id is same as one before, it will append the logs to it
* `load_path`: Checkpoint from which to load a previously trained model
* `save_freq`: The frequency with which to checkpoint models

## Configurable HyperParameters

* `ac_kwargs`: Arguments passed to `PPOLidarActorCritic`. `observation_space` and `action_space` are automatically passed from the `env`, so no need to pass those
* `actor_kwargs`: Arguments passed to `PPOWaypointCategoricalActor/PPOWaypointGaussianActor`. `observation_space` and `action_space` are automatically passed from the `env`, so no need to pass those
* `seed`: Random Seed
* `number_steps_per_controller_update`: Number of observation and action pairs to be collected before training the model. This is split equally across all the processes
* `number_episodes_per_spline_update`: Number of episodes for rollout generation before updating the spline subpolicy
* `epochs`: Total epochs
* `gamma`: Discount Factor
* `clip_ratio`: Clip Ratio in the PPO Algorithm
* `pi_lr`: Learning Rate for the Actor
* `vf_lr`: Learning Rate for the Critic
* `spline_lr`: Learning Rate for training the Spline SubPolicy
* `train_iters`: If `>1` the training data is split into mini-batches for training, where batch\_size = `steps_per_epoch / (train_iters * num_of_processes)`&#x20;
* `entropy_coeff`: Coefficient for entropy regularization
* `lam`: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
* `target_kl`: Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://avikpal.gitbook.io/social-driving/agents/algorithms/ppo-alternating-optimization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
