GitHub - xaskasdf/ntransformer: High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.
NTransformer
High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.
Key Results
Model
Mode
Decode
VRAM
Notes
Llama 3.1 8B Q8_0
Resident
48.9 tok/s
10.0 GB
All layers in VRAM
Llama 3.1 8B Q8_0
Tiered (auto)
48.8 tok/s
10.3 GB
32/32 layers auto-promoted to VRAM
Llama 3.1 70B Q6_K
Streaming (mmap)
0.006 tok/s
7.3 GB
Page cache thrashing (53 GB ...
Read more at github.com