Developer releases NTransformer, C++/CUDA engine running Llama 70B on single RTX 3090 GPU by streaming layers through 24GB VRAM via PCIe with optional NVMe direct I/O.

GitHub - xaskasdf/ntransformer: High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.

NTransformer High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely. Key Results Model Mode Decode VRAM Notes Llama 3.1 8B Q8_0 Resident 48.9 tok/s 10.0 GB All layers in VRAM Llama 3.1 8B Q8_0 Tiered (auto) 48.8 tok/s 10.3 GB 32/32 layers auto-promoted to VRAM Llama 3.1 70B Q6_K Streaming (mmap) 0.006 tok/s 7.3 GB Page cache thrashing (53 GB ...