FairyFuse achieves 29.6x speedup for LLM inference on Intel CPUs by using ternary weights with addition-only operations, delivering 32.4 tokens per second while maintaining near-lossless quality through multiplication-free execution.

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

View PDF HTML (experimental) Abstract:Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in {-1, 0, +1} provide a more efficient alternative, replacing multiplications with conditional additions, s...