model April 23, 2026

VoxCPM2: 2B‑Parameter Multilingual Diffusion TTS with Voice Design & Cloning

VoxCPM2, released by the OpenBMB VoxCPM team, is a tokenizer‑free diffusion autoregressive text‑to‑speech model that packs 2 billion parameters and supports 30 languages, including many Chinese dialects. Trained on over 2 million hours of multilingual speech, it produces studio‑quality 48 kHz audio using an internal AudioVAE V2 super‑resolution pipeline, and runs in real‑time on modern GPUs (RTF ≈ 0.3 on RTX 4090, faster with Nano‑VLLM).

The model offers three distinct generation modes: (1) standard TTS for any supported language without language tags, (2) voice design, where a natural‑language description (gender, age, tone, emotion, pace, etc.) creates a novel synthetic voice without any reference audio, and (3) controllable cloning, ranging from basic voice cloning with a short reference clip to “ultimate cloning” that uses both a reference audio file and its transcript for near‑perfect timbre preservation. All capabilities are accessible through the open‑source `voxcpm` Python library under an Apache‑2.0 license, making the model commercial‑ready.

Technical details include a MiniCPM‑4‑based backbone, a LocEnc → TSLM → RALM → LocDiT architecture, a token rate of 6.25 Hz, max sequence length of 8192 tokens, and a modest VRAM footprint of roughly 8 GB. VoxCPM2 also supports both full‑parameter fine‑tuning and LoRA adaptation with as little as 5–10 minutes of audio, enabling developers to tailor the model to niche domains or low‑resource languages. The repository provides extensive documentation, demos, and safety guidelines, emphasizing responsible use and prohibiting impersonation or disinformation.

Project Ideas

  1. Generate multilingual audiobooks by feeding chapter text to VoxCPM2, selecting the appropriate language automatically.
  2. Create a custom virtual assistant that speaks with a brand‑specific voice designed via a short natural‑language description.
  3. Build a real‑time translation system that converts spoken input in one language to synthesized speech in another language using VoxCPM2's streaming mode.
  4. Produce personalized podcasts by cloning a host's voice from a brief sample and then generating new episode scripts in the same timbre.
  5. Fine‑tune VoxCPM2 with LoRA on a small dataset of a low‑resource language to enable high‑quality TTS for that language.
← Back to all reports