NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference

Models & research

July 1, 2026 3 min read

Curated by Oleksandr Kuzmenko, AI Product EngineerUpdated July 1, 2026Sources cited on every story

AI-assisted · editor-reviewedHow we use AI

NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference

NVIDIA's new TwoTower model combines autoregressive backbones with a diffusion-based denoiser to improve throughput. It achieves 2.42x faster generation than standard autoregressive decoding while maintaining 98.7% quality.

Impact: Medium

Why it matters

Developers can use this architecture to significantly reduce latency in high-throughput text generation tasks without sacrificing performance.

TL;DR

01Achieve 2.42x generation speed via parallel block-wise diffusion.
02Maintains 98.7% quality compared to standard autoregressive models.
03Supports hybrid inference modes for flexible deployment.
04Requires 2x H100 GPUs for full diffusion mode.

Key facts

Throughput improvement: 2.42x (self-reported)
Quality retention: 98.7% (self-reported)
Memory requirements: ~59GB per GPU (BF16)

Architecture Details

The model is based on the Nemotron-3-Nano-30B-A3B hybrid backbone, interleaving Mamba-2, self-attention, and Mixture-of-Experts (MoE) layers. The denoiser tower refines blocks of tokens in parallel, significantly accelerating the process.

Performance Metrics

Speedup: 2.42x over AR baseline at γ=0.8.
Quality Retention: 98.7% of AR baseline benchmark scores.
Parameters: ~60B total parameters across both towers.

Implementation Note

You must place the towers on separate devices to utilize the full diffusion capability. Ensure you are using torch.bfloat16 for optimal memory usage, requiring ~59GB per GPU.

Try it in 2 minutes

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16")
model.place_towers_on_devices("cuda:0", "cuda:1")

python

✓ When to use

High-throughput synthetic text generation.
Scenarios where GPU budget allows for 2-card setup.
When minor quality drop is acceptable for massive speed gains.

What to do today

Clone the model from Hugging Face.
Place towers on separate cuda devices using the provided API.
Benchmark your specific workload against standard AR decoding.

#Nemotron-Labs-TwoTower#H100#Mamba-2

Sources

NVIDIA Releases Nemotron-Labs-TwoTower

ShareShare on X Share on LinkedIn

Models & research

July 1, 2026 3 min read

Curated by Oleksandr Kuzmenko, AI Product EngineerUpdated July 1, 2026Sources cited on every story

AI-assisted · editor-reviewedHow we use AI

Impact: Medium

Why it matters

Developers can use this architecture to significantly reduce latency in high-throughput text generation tasks without sacrificing performance.

TL;DR

01Achieve 2.42x generation speed via parallel block-wise diffusion.
02Maintains 98.7% quality compared to standard autoregressive models.
03Supports hybrid inference modes for flexible deployment.
04Requires 2x H100 GPUs for full diffusion mode.

Key facts

Throughput improvement: 2.42x (self-reported)
Quality retention: 98.7% (self-reported)
Memory requirements: ~59GB per GPU (BF16)

Architecture Details

Performance Metrics

Speedup: 2.42x over AR baseline at γ=0.8.
Quality Retention: 98.7% of AR baseline benchmark scores.
Parameters: ~60B total parameters across both towers.

Implementation Note

You must place the towers on separate devices to utilize the full diffusion capability. Ensure you are using torch.bfloat16 for optimal memory usage, requiring ~59GB per GPU.

Try it in 2 minutes

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16")
model.place_towers_on_devices("cuda:0", "cuda:1")

python

✓ When to use

High-throughput synthetic text generation.
Scenarios where GPU budget allows for 2-card setup.
When minor quality drop is acceptable for massive speed gains.

What to do today

Clone the model from Hugging Face.
Place towers on separate cuda devices using the provided API.
Benchmark your specific workload against standard AR decoding.

#Nemotron-Labs-TwoTower#H100#Mamba-2

Sources

NVIDIA Releases Nemotron-Labs-TwoTower

ShareShare on X Share on LinkedIn

NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference

Architecture Details

Performance Metrics

Implementation Note

Related stories

Get the morning AI brief

NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference

Architecture Details

Performance Metrics

Implementation Note

Related stories

Get the morning AI brief