NVIDIA DGX Spark + Apple Mac Studio Combo Achieves 4x Faster LLM Inference; EXO 1.0 Optimizes Prefill and Decode Phases

Combining NVIDIA DGX Spark + Apple Mac Studio for 4x Faster LLM Inference with EXO 1.0

We recently received early access to 2 NVIDIA DGX Spark™ units. NVIDIA calls it the world's smallest AI supercomputer. It has ~100 TFLOPs of FP16 performance with 128GB of CPU-GPU coherent memory at 273 GB/s. With EXO, we've been running LLMs on clusters of Apple Mac Studios with M3 Ultra chips. The Mac Studio has 512GB of unified memory at 819 GB/s, but the GPU only has ~26 TFLOPs of FP16 performance. The DGX Spark has 4x the compute, the Mac Studio has 3x the memory bandwidth. What if we combi...