Google Introduces Gemini 3.5 Live Translate for Real-Time Multimodal Voice Applications

Google has launched Gemini 3.5 Live Translate, focusing on low-latency, end-to-end voice translation. This release provides a direct API capability for developers looking to build seamless speech-to-speech agents.
Impact: High
Why it matters
You can now bypass separate speech-to-text, translation, and text-to-speech pipelines by leveraging Gemini's native multimodal voice capabilities.
TL;DR
- 01End-to-end audio modeling cuts latency down to sub-second conversational responses.
- 02The system retains emotional prosody and natural pauses during real-time speech translation.
- 03Developers can integrate the model directly via Gemini API SDKs for real-time streaming sockets.
Key facts
- Supported Languages
- 70+
- Google Meet Combinations
- 2,000+
- Grab Tested Volume
- 10M+ monthly calls
- Watermarking Standard
- SynthID
Continuous Multimodal Voice Translation
Google has launched Gemini 3.5 Live Translate, a model providing near real-time speech-to-speech translation across more than 70 languages. Unlike traditional turn-by-turn voice systems, 3.5 Live Translate continuously streams audio, preserving the original speaker's intonation, pitch, and pacing while staying only a few seconds behind the speaker.
Broad Integration and Partners
The model is available in public preview via the Gemini Live API and Google AI Studio, as well as in private preview for enterprise customers in Google Meet. Key real-time streaming partners include Agora, Fishjam, LiveKit, Pipecat, and Vision Agents. Ride-hailing giant Grab is currently testing the technology to facilitate communications for over 10 million monthly voice calls between drivers and passengers.
Security and New Mobile Features
All audio generated by the model is transparently watermarked using Google’s SynthID technology to secure the content and prevent misinformation. For mobile users, Android is receiving a new "listening mode" which allows users to hold their phone to their ear like a standard call to privately hear the incoming audio translation.
✓ When to use
- When developing natural conversational translation apps with sub-second perceived response lag.
- When requiring continuous background audio translation for multi-party meetings.
✕ When NOT to use
- When offline, local-only execution without internet access is required.
- When watermarked output is not allowed by application specifications.
What to do today
- Explore the Gemini API documentation for the new live audio streaming endpoints.
- Test the model's performance on industry-specific domain terminology to verify translation accuracy.