model March 20, 2026

Fish Audio S2 Pro – Multilingual TTS with Fine‑Grained Inline Control

Fish Audio S2 Pro is a state‑of‑the‑art text‑to‑speech (TTS) model released by the Fish Audio research team. It supports more than 80 languages, including tier‑1 coverage for English, Chinese, and Japanese, and offers free‑form, inline prosody and emotion control via a simple ``[tag]`` syntax. The model is built on a decoder‑only transformer paired with a residual‑vector‑quantized (RVQ) audio codec and a dual‑autoregressive architecture: a 4 B‑parameter “slow” AR branch predicts the primary semantic codebook, while a 400 M‑parameter “fast” AR branch fills in nine residual codebooks to produce high‑fidelity audio. This asymmetric design enables low‑latency streaming—real‑time factor of 0.195 on a single NVIDIA H200 GPU—and efficient serving using SGLang’s continuous batching and KV‑cache optimizations.

The release includes model weights, fine‑tuning scripts, and an SGLang‑based streaming inference engine, making it ready for both research and production deployments. Users can embed expressive instructions directly in the input text, such as ``[whisper]``, ``[laughing]`` or custom phrases like ``[pitch up]``, allowing word‑level modulation of tone, volume, and emotion without predefined tag sets. The model is distributed under the Fish Audio Research License, permitting free research and non‑commercial use, while commercial usage requires a separate agreement. Its combination of multilingual coverage, granular control, and streaming performance has made it a trending model on Hugging Face.

Project Ideas

  1. Create a multilingual audiobook generator that inserts emotion tags like ``[whisper]`` or ``[excited]`` to produce expressive narration for each language.
  2. Build a low‑latency voice assistant that streams responses in real time, using the model’s fast AR branch for on‑device inference.
  3. Develop an interactive storytelling web app where writers can embed inline tags to control character voices, pauses, and laughter during playback.
  4. Design a language‑learning tool that highlights pronunciation by adding ``[emphasis]`` or ``[slow]`` tags to target words in any of the supported 80+ languages.
  5. Implement a live dubbing system for video streams that leverages the model’s streaming performance to translate and vocalize dialogue with custom tone tags.
← Back to all reports