Keeping 20,000 GPUs healthy
Back Engineering December 28, 2025•8 minute readModal runs a globally distributed, autoscaling GPU worker pool by sourcing compute from all cloud giants: AWS, GCP, Azure, OCI. We’ve scaled the worker pool to well over 20,000 concurrent GPUs, and launched over four million cloud instances in the last couple years. At this scale, you see almost every GPU reliability problem there is. Today, we’re sharing our GPU reliability system as both a demonstration of our commitment to Modal customers and as...
Read more at modal.com