Skip to content
ATAI Today Brief
HomeNewsConceptsGuidesToolbox
AboutSubscribeUA
Subscribe

AI Today Brief

The daily AI-engineering brief. Built in public. EN · UA.

XTelegramLinkedInYouTubeRSS
NewsConceptsGuidesSubscribeAdvertiseAboutEditorial policyAI disclosurePrivacyTerms

© 2026 AI Today Brief. All rights reserved.

  1. Home/
  2. News/
  3. Models & research/
  4. NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference
Models & research

NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference

July 1, 2026· 3 min read
OKCurated by Oleksandr Kuzmenko, AI Product Engineer·Updated July 1, 2026·Sources cited on every story
AI-assisted · editor-reviewed·How we use AI
NVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference

NVIDIA's new TwoTower model combines autoregressive backbones with a diffusion-based denoiser to improve throughput. It achieves 2.42x faster generation than standard autoregressive decoding while maintaining 98.7% quality.

Impact: Medium

Why it matters

Developers can use this architecture to significantly reduce latency in high-throughput text generation tasks without sacrificing performance.

TL;DR

  • 01Achieve 2.42x generation speed via parallel block-wise diffusion.
  • 02Maintains 98.7% quality compared to standard autoregressive models.
  • 03Supports hybrid inference modes for flexible deployment.
  • 04Requires 2x H100 GPUs for full diffusion mode.

Key facts

Throughput improvement
2.42x (self-reported)
Quality retention
98.7% (self-reported)
Memory requirements
~59GB per GPU (BF16)

Architecture Details

The model is based on the Nemotron-3-Nano-30B-A3B hybrid backbone, interleaving Mamba-2, self-attention, and Mixture-of-Experts (MoE) layers. The denoiser tower refines blocks of tokens in parallel, significantly accelerating the process.

Performance Metrics

  • Speedup: 2.42x over AR baseline at γ=0.8.
  • Quality Retention: 98.7% of AR baseline benchmark scores.
  • Parameters: ~60B total parameters across both towers.

Implementation Note

You must place the towers on separate devices to utilize the full diffusion capability. Ensure you are using torch.bfloat16 for optimal memory usage, requiring ~59GB per GPU.

Try it in 2 minutes

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16")
model.place_towers_on_devices("cuda:0", "cuda:1")

python

✓ When to use

  • High-throughput synthetic text generation.
  • Scenarios where GPU budget allows for 2-card setup.
  • When minor quality drop is acceptable for massive speed gains.

What to do today

  • →Clone the model from Hugging Face.
  • →Place towers on separate cuda devices using the provided API.
  • →Benchmark your specific workload against standard AR decoding.
#Nemotron-Labs-TwoTower#H100#Mamba-2

Sources

  • NVIDIA Releases Nemotron-Labs-TwoTower
ShareShare on XShare on LinkedIn
← Previous storySenate AI AGENT Act proposal introduces federal agent governance

Related stories

  • Models & researchAnthropic releases Claude Sonnet 5
  • Models & researchDiScoFormer: One-Pass Density and Score Estimation Transformer
  • Models & researchSpecialization is inevitable in AI performance optimization
  • Models & researchOrnith-1.0: Self-Scaffolding Open-Source Models for Agentic Coding Tasks

Email digest

Get the morning AI brief

One email a day — the stories that matter for engineers, founders and tech leads. Human-edited, with links to primary sources.

  • ✓120+ sources scanned daily
  • ✓Edited by a human
  • ✓1 email per day
  • ✓EN + UA

By subscribing you agree to the privacy policy.