PyTorch's torchft and torchtitan train 1B-parameter Llama model with 2000 synthetic failures every 15 seconds on 300 Crusoe L40S GPUs, proving fault-tolerant training reliability

training with 2000 synthetic failures every ~15 seconds and no checkpoints on Crusoe L40S – PyTorch

Collaborators: Less Wright, Howard Huang, Chien-Chin Huang, Crusoe: Martin Cala, Ethan Petersen tl;dr: we used torchft and torchtitan to train a model in a real-world environment with extreme synthetic failure rates to prove reliability and correctness of fault tolerant training Training loss across 1200 failures with no checkpoints. NOTE: Each small spike is a non-participating worker recovering which affects the metrics but not the model Introduction We want to demonstrate torchft in worst cas...