The training has been tested with horovodrun. If you want to use only a single process run it with horovodrun -np 1.
The training module is expected to work seamlessly in a distributed setup (even with preemption). In the case of preemption, we load the latest checkpoint from the save directory. For this, to work as expected save in a directory which is unique to a particular slurm job id (eg. <base path>/$SLURM_JOB_ID).