⚡ Gemma‑4‑31B‑IT NVFP4 Turbo: 68% Smaller, 2.5× Faster Text Generation
LilaRest/gemma-4-31B-it-NVFP4-turbo ↗
LilaRest’s *Gemma 4 31B IT NVFP4 Turbo* is a repackaged, quantized version of Google DeepMind’s Gemma‑4 31B‑IT model. Built on the NVIDIA NVFP4 checkpoint, it quantizes self‑attention weights to FP4 (RTN, group‑size = 16) while preserving BF16 embeddings and calibrated FP4 MLP layers. The result is a model that occupies only 18.5 GiB of GPU memory—about 68 % less than the original 58.9 GiB—and runs roughly 2.5× faster on GPUs with Blackwell FP4 tensor cores (e.g., RTX 5090, RTX PRO 6000, B200).
The model is text‑only (vision and audio components stripped) and is intended for the **text‑generation** pipeline using the 🤗 Transformers library (≥ 5.5.0) together with **vLLM** (≥ 0.19) compiled for CUDA 13.0. Benchmarks on a 1 K input / 200‑token output workload show pre‑fill speeds of 15 359 tok/s and decode speeds of 51 tok/s, delivering up to 6.22 requests per second on an RTX PRO 6000. Accuracy on GPQA‑Diamond and MMLU‑Pro drops only 2–3 % compared to the base model, confirming that the aggressive quantization retains near‑identical quality.
The README provides detailed deployment instructions, recommending the `vllm/vllm-openai:cu130-nightly` Docker image for out‑of‑the‑box Blackwell support. Key runtime flags include `--quantization modelopt` to enable NVIDIA’s optimized CUTLASS kernels, `--kv-cache-dtype fp8` to halve KV‑cache memory, and a configurable `--max-model-len` up to 16 k tokens (and up to ~25 k on RTX 5090). The model is licensed under Apache 2.0, matching the original Gemma 4 license, and credits both Google DeepMind and NVIDIA for the base model and quantization work.
Project Ideas
- Deploy a high‑throughput chat API on an RTX 5090 using vLLM and the Turbo model to serve thousands of concurrent conversational requests.
- Build a low‑latency code‑completion assistant that leverages the model's 51 tok/s single‑decode speed for interactive IDE integrations.
- Create a long‑context document summarizer that utilizes the 16 k–25 k token context window for summarizing extensive reports or books.
- Set up a Docker‑based inference service that can be orchestrated in Kubernetes, providing a plug‑and‑play endpoint for any text‑generation application.
- Run a prompt‑engineering benchmark comparing quality and speed on QA datasets (GPQA Diamond, MMLU‑Pro) to guide optimal prompt designs for the Turbo model.