training with 2000 synthetic failures every ~15 seconds and no checkpoints on Crusoe L40S – PyTorch
Collaborators: Less Wright, Howard Huang, Chien-Chin Huang, Crusoe: Martin Cala, Ethan Petersen
tl;dr: we used torchft and torchtitan to train a model in a real-world environment with extreme synthetic failure rates to prove reliability and correctness of fault tolerant training
Training loss across 1200 failures with no checkpoints.
NOTE: Each small spike is a non-participating worker recovering which affects the metrics but not the model
Introduction
We want to demonstrate torchft in worst cas...
Read more at pytorch.org