model February 19, 2026

Qwen3‑TTS 1.7B CustomVoice: Real‑Time Multilingual Speech with Instruction‑Driven Style

Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice is a 1.7 B‑parameter text‑to‑speech model released by the Qwen team. It supports ten major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) and offers nine premium timbres covering various gender, age, language, and dialect combinations. The model can be driven by natural‑language instructions, allowing users to control timbre, emotion, and prosody on the fly. Its architecture uses a discrete multi‑codebook language model with a lightweight non‑DiT backbone, enabling extreme low‑latency streaming generation—audio packets start after a single character input, with end‑to‑end synthesis latency as low as 97 ms.

The model is part of the broader Qwen3‑TTS family, which also includes VoiceDesign and Base models for voice design and rapid voice cloning. Qwen3‑TTS‑CustomVoice integrates the self‑developed Qwen3‑TTS‑Tokenizer‑12Hz for efficient acoustic compression and high‑dimensional semantic modeling, preserving paralinguistic cues and environmental features. Users can access the model via the `qwen-tts` Python package, which provides a `generate_custom_voice` API for single or batch inference, as well as utilities to list supported speakers and languages. The repository also offers a Gradio web UI for interactive demos and DashScope APIs for real‑time cloud inference.

Since its launch on 2026‑01‑21, the model has amassed over 770 k downloads and more than 1 k likes, reflecting strong community interest. Its open‑source Apache‑2.0 license and compatibility with FlashAttention 2 make it suitable for both research and production deployments, especially where real‑time, multilingual, and expressive speech synthesis is required.

Project Ideas

Build a multilingual virtual assistant that answers queries in the user’s language while adapting tone and emotion via natural‑language instructions.
Create an interactive game NPC dialogue system that streams character speech in real time, using custom timbres to match each NPC’s personality.
Develop a podcast‑style audiobook generator that produces chapters in different languages, applying instructional prompts to convey distinct moods for each segment.
Implement a live‑streaming commentary tool for esports, allowing commentators to switch languages and adjust excitement levels on the fly with low‑latency streaming output.
Design a personalized voice messaging app where users select a premium timbre and describe desired emotion, producing custom voice notes instantly.

← Back to all reports