Nemotron 3.5 ASR: 600M‑parameter Multilingual Streaming Speech‑to‑Text
nvidia/nemotron-3.5-asr-streaming-0.6b ↗
NVIDIA’s Nemotron 3.5 ASR (model ID nvidia/nemotron-3.5-asr-streaming-0.6b) is a 600 million‑parameter, cache‑aware FastConformer‑RNNT model built with the NeMo framework. It delivers low‑latency, streaming automatic speech recognition across 40 language‑locales, offering native punctuation, capitalization, and optional automatic language detection. The architecture reuses encoder caches to avoid redundant computation, enabling efficient processing of audio chunks as small as 80 ms while maintaining high accuracy (WER ranging from 4.11 % to 9.03 % on FLEURS test sets).
The model is released under the OpenMDW‑1.1 license for commercial use and supports a wide hardware spectrum (Ampere, Hopper, Jetson, etc.). It can be loaded via NeMo’s ASRModel class and run in streaming mode by specifying a target language or using `target_lang=auto`. The README highlights flexible runtime settings, multiple chunk sizes (80 ms–1.12 s), and a prompt‑conditioned language‑ID mechanism that tags each transcript with its detected language code. Training leveraged diverse multilingual corpora such as NVIDIA Granary, multilingual LibriSpeech, FLEURS, Common Voice, VoxPopuli, and Europarl.
Project Ideas
- Deploy a real‑time multilingual transcription service for call‑center audio that streams speech to text with language‑tagging for each utterance.
- Create live captioning for video‑conference platforms that automatically detects and displays subtitles in the speaker's language, using the model's streaming and auto‑detect features.
- Build a podcast indexing pipeline that streams episodes, extracts transcripts, and stores language codes for searchable, multilingual content archives.
- Fine‑tune the adaptation‑ready locales on domain‑specific medical dictation data to produce a low‑latency, multilingual clinical transcription assistant.
- Integrate the model into NVIDIA Jetson edge devices for on‑device streaming ASR in smart home assistants supporting multiple languages without cloud connectivity.