News Score: Score the News, Sort the News, Rewrite the Headlines

GitHub - xaskasdf/ntransformer: High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.

NTransformer High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely. Key Results Model Mode Decode VRAM Notes Llama 3.1 8B Q8_0 Resident 48.9 tok/s 10.0 GB All layers in VRAM Llama 3.1 8B Q8_0 Tiered (auto) 48.8 tok/s 10.3 GB 32/32 layers auto-promoted to VRAM Llama 3.1 70B Q6_K Streaming (mmap) 0.006 tok/s 7.3 GB Page cache thrashing (53 GB ...

Read more at github.com

© News Score  score the news, sort the news, rewrite the headlines