The bottom line: DiffusionGemma denoises up to 256 tokens in parallel per step instead of sequentially and achieves 1,000 tokens/second on NVIDIA H100 at batch size 1 — without cloud dependency.

Google DeepMind has released DiffusionGemma, an open language model that generates text through parallel processing instead of sequential token generation. NVIDIA has optimized the implementation for RTX, RTX PRO and DGX systems, achieving up to 4x higher throughput in single-user operation.

Google DeepMind has released DiffusionGemma as an experimental, open language model. Unlike standard LLMs, which generate text autoregressively one token at a time, DiffusionGemma uses a diffusion-based approach: the model refines up to 256 tokens in parallel at each step — similar to how diffusion models progressively reduce noise in image generation. This eliminates the typical sequential waiting for each new word.

The technical foundation is the Gemma-4 architecture with 26 billion parameters as a Mixture-of-Experts (activated per step: 3.8 billion parameters). NVIDIA has optimized the model for GeForce RTX GPUs, RTX PRO 6000 workstations, DGX Spark and DGX Station. Processing 256 tokens in parallel per step is more compute-intensive than the memory-bound nature of classical LLMs at batch size 1 — exactly what NVIDIA Tensor Cores accelerate efficiently. Measurements show: on an H100 Tensor Core GPU 1,000 tokens/second, on DGX Spark approximately 150 tokens/second, on DGX Station up to 800 tokens/second — consistently around 4x faster than comparable autoregressive models in single-user setups.

DiffusionGemma is available under the Apache 2.0 license as open weights and runs entirely locally without cloud dependency or token-based billing. NVIDIA offers day-one support in Hugging Face Transformers, vLLM and Unsloth; llama.cpp support to follow. The model targets low-latency scenarios: interactive chat applications, agent-based loops or local on-device assistants that require fast response times.

Source: blogs.nvidia.com · Published June 10, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on:

Google DeepMind DiffusionGemma: Parallel Text Generation on Local GPUs

Lumi AI News

Legal

Topics