NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference
NVIDIA's new TwoTower model combines autoregressive backbones with a diffusion-based denoiser to improve throughput. It achieves 2.42x faster generation than standard autoregressive decoding while maintaining 98.7% quality.
Impact: Medium
Why it matters
Developers can use this architecture to significantly reduce latency in high-throughput text generation tasks without sacrificing performance.
TL;DR
- 01Achieve 2.42x generation speed via parallel block-wise diffusion.
- 02Maintains 98.7% quality compared to standard autoregressive models.
- 03Supports hybrid inference modes for flexible deployment.
- 04Requires 2x H100 GPUs for full diffusion mode.
Key facts
- Throughput improvement
- 2.42x (self-reported)
- Quality retention
- 98.7% (self-reported)
- Memory requirements
- ~59GB per GPU (BF16)
Architecture Details
The model is based on the Nemotron-3-Nano-30B-A3B hybrid backbone, interleaving Mamba-2, self-attention, and Mixture-of-Experts (MoE) layers. The denoiser tower refines blocks of tokens in parallel, significantly accelerating the process.
Performance Metrics
- Speedup: 2.42x over AR baseline at γ=0.8.
- Quality Retention: 98.7% of AR baseline benchmark scores.
- Parameters: ~60B total parameters across both towers.
Implementation Note
You must place the towers on separate devices to utilize the full diffusion capability. Ensure you are using torch.bfloat16 for optimal memory usage, requiring ~59GB per GPU.
Try it in 2 minutes
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16")
model.place_towers_on_devices("cuda:0", "cuda:1")python
✓ When to use
- High-throughput synthetic text generation.
- Scenarios where GPU budget allows for 2-card setup.
- When minor quality drop is acceptable for massive speed gains.
What to do today
- Clone the model from Hugging Face.
- Place towers on separate cuda devices using the provided API.
- Benchmark your specific workload against standard AR decoding.
Sources