Spark-TTS Spark 语音合成

Official PyTorch code for inference of官方 PyTorch 代码用于推理
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech TokensSpark-TTS：基于 LLM 的高效文本转语音模型，具有单流解耦语音标记

Spark-TTS 🔥

Overview 概述

Spark-TTS is an advanced text-to-speech system that uses the power of large language models (LLM) for highly accurate and natural-sounding voice synthesis. It is designed to be efficient, flexible, and powerful for both research and production use.Spark-TTS 是一种先进的文本转语音系统，它利用大型语言模型 (LLM) 的强大功能实现高度准确且听起来自然的语音合成。它旨在高效、灵活且功能强大，适合研究和生产用途。

Key Features 主要特点

Simplicity and Efficiency: Built entirely on Qwen2.5, Spark-TTS eliminates the need for additional generation models like flow matching. Instead of relying on separate models to generate acoustic features, it directly reconstructs audio from the code predicted by the LLM. This approach streamlines the process, improving efficiency and reducing complexity.简单高效：Spark-TTS 完全基于 Qwen2.5 构建，无需使用流匹配等额外生成模型。它无需依赖单独的模型来生成声学特征，而是直接从预测的代码中重建音频。这种方法简化了流程，提高了效率并降低了复杂性。
High-Quality Voice Cloning: Supports zero-shot voice cloning, which means it can replicate a speaker’s voice even without specific training data for that voice. This is ideal for cross-lingual and code-switching scenarios, allowing for seamless transitions between languages and voices without requiring separate training for each one.高质量语音克隆：支持零样本语音克隆，这意味着即使没有针对该语音的特定训练数据，它也可以复制说话者的声音。这非常适合跨语言和代码切换场景，允许在语言和语音之间无缝转换，而无需对每种语言和语音进行单独训练。
Bilingual Support: Supports both Chinese and English, and is capable of zero-shot voice cloning for cross-lingual and code-switching scenarios, enabling the model to synthesize speech in multiple languages with high naturalness and accuracy.双语支持：支持中英文，并具备跨语言、代码切换场景的零样本语音克隆能力，使模型能够高自然度、高准确度地合成多种语言的语音。
Controllable Speech Generation: Supports creating virtual speakers by adjusting parameters such as gender, pitch, and speaking rate.可控的语音生成：支持通过调整性别、音调、语速等参数创建虚拟说话人。


Inference Overview of Voice Cloning语音克隆推理概述
Inference Overview of Controlled Generation受控发电的推理概述

🚀 News 🚀 新闻

[2025-03-04] Our paper on this project has been published! You can read it here: Spark-TTS.[2025-03-04] 我们关于这个项目的论文已经发表！你可以在这里阅读： Spark-TTS 。
[2025-03-12] Nvidia Triton Inference Serving is now supported. See the Runtime section below for more details.[2025-03-12] 现已支持 Nvidia Triton Inference Serving。有关更多详细信息，请参阅下面的运行时部分。

Install 安装

Clone and Install 克隆并安装

Here are instructions for installing on Linux. If you’re on Windows, please refer to the Windows Installation Guide.以下是在 Linux 上安装的说明。如果您使用的是 Windows，请参阅 Windows 安装指南。
(Thanks to @AcTePuKc for the detailed Windows instructions!)（感谢 @AcTePuKc 提供详细的 Windows 说明！）

Clone the repo 克隆仓库

git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS

Install Conda: please see https://docs.conda.io/en/latest/miniconda.html安装 Conda：请参阅 https://docs.conda.io/en/latest/miniconda.html
Create Conda env: 创建 Conda 环境：

conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

Model Download 模型下载

Download via python: 通过 python 下载：

from huggingface_hub import snapshot_download

snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")

Download via git clone: 通过 git clone 下载：

mkdir -p pretrained_models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B pretrained_models/Spark-TTS-0.5B

Basic Usage 基本用法

You can simply run the demo with the following commands:你可以简单地使用以下命令运行该演示：

cd example
bash infer.sh

Alternatively, you can directly execute the following command in the command line to perform inference：或者也可以直接在命令行中执行以下命令进行推理：

python -m cli.inference \
    --text "text to synthesis." \
    --device 0 \
    --save_dir "path/to/save/audio" \
    --model_dir pretrained_models/Spark-TTS-0.5B \
    --prompt_text "transcript of the prompt audio" \
    --prompt_speech_path "path/to/prompt_audio"

Web UI Usage Web UI 使用

You can start the UI interface by running python webui.py --device 0, which allows you to perform Voice Cloning and Voice Creation. Voice Cloning supports uploading reference audio or directly recording the audio.您可以通过运行 python webui.py --device 0 启动 UI 界面，从而执行语音克隆和语音创建。语音克隆支持上传参考音频或直接录制音频。

Voice Cloning 语音克隆	Voice Creation 语音创作

Optional Methods 可选方法

For additional CLI and Web UI methods, including alternative implementations and extended functionalities, you can refer to:有关其他 CLI 和 Web UI 方法，包括替代实现和扩展功能，您可以参考：

CLI and UI by AcTePuKcAcTePuKc 提供的 CLI 和 UI

Runtime 运行

Nvidia Triton Inference ServingNvidia Triton 推理服务

We now provide a reference for deploying Spark-TTS with Nvidia Triton and TensorRT-LLM. The table below presents benchmark results on a single L20 GPU, using 26 different prompt_audio/target_text pairs (totalling 169 seconds of audio):我们现在提供使用 Nvidia Triton 和 TensorRT- 部署 Spark-TTS 的参考。下表显示了单个 L20 GPU 上的基准测试结果，使用了 26 个不同的 prompt_audio/target_text 对（总共 169 秒的音频）：

Model	Note 笔记	Concurrency 并发性	Avg Latency 平均延迟	RTF
Spark-TTS-0.5B	Code Commit 代码提交	1	876.24 ms 876.24 毫秒	0.1362
Spark-TTS-0.5B	Code Commit 代码提交	2	920.97 ms 920.97 毫秒	0.0737
Spark-TTS-0.5B	Code Commit 代码提交	4	1611.51 ms 1611.51 毫秒	0.0704

Please see the detailed instructions in runtime/triton_trtllm/README.md for more information.有关更多信息，请参阅 runtime/triton_trtllm/README.md 中的详细说明。

Demos 演示

Here are some demos generated by Spark-TTS using zero-shot voice cloning. For more demos, visit our demo page.以下是 Spark-TTS 使用零样本语音克隆生成的一些演示。如需更多演示，请访问我们的演示页面。


Donald Trump 唐纳德·特朗普	Zhongli (Genshin Impact) 钟离（原神）
Donald_Trump.webm 唐纳德·特朗普.webm	Zhong_Li.webm 钟丽.webm


陈鲁豫 Chen Luyu	杨澜 Yang Lan
Chen_Luyu.webm 陈鲁豫.webm	Yang_Lan.webm 杨澜.webm


余承东 Richard Yu	马云 Jack Ma
Yu_Chengdong.webm 余成栋.webm	Ma_Yun.webm 马云


刘德华 Andy Lau	徐志胜 Xu Zhisheng
Liu_Dehua.webm 刘德华.webm	Xu_Zhisheng.webm 徐志胜.webm


哪吒 Nezha	李靖 Li Jing
Ne_Zha.webm 哪吒.webm	Li_Jing.webm 李静.webm

To-Do List 待办事项清单

Release the Spark-TTS paper.发布 Spark-TTS 论文。
Release the training code.发布训练代码。
Release the training dataset, VoxBox.发布训练数据集 VoxBox。

Citation 引用

@misc{wang2025sparktts,
      title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens}, 
      author={Xinsheng Wang and Mingqi Jiang and Ziyang Ma and Ziyu Zhang and Songxiang Liu and Linqin Li and Zheng Liang and Qixi Zheng and Rui Wang and Xiaoqin Feng and Weizhen Bian and Zhen Ye and Sitong Cheng and Ruibin Yuan and Zhixian Zhao and Xinfa Zhu and Jiahao Pan and Liumeng Xue and Pengcheng Zhu and Yunlin Chen and Zhifei Li and Xie Chen and Lei Xie and Yike Guo and Wei Xue},
      year={2025},
      eprint={2503.01710},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2503.01710}, 
}

⚠️ Usage Disclaimer ⚠️ 使用免责声明

This project provides a zero-shot voice cloning TTS model intended for academic research, educational purposes, and legitimate applications, such as personalized speech synthesis, assistive technologies, and linguistic research.该项目提供了零样本语音克隆 TTS 模型，旨在用于学术研究、教育目的和合法应用，例如个性化语音合成、辅助技术和语言研究。

Please note: 请注意：

Do not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal activities.请勿将此模型用于未经授权的语音克隆、冒充、欺诈、诈骗、深度伪造或任何非法活动。
Ensure compliance with local laws and regulations when using this model and uphold ethical standards.确保使用该模型时遵守当地法律法规并遵守道德标准。
The developers assume no liability for any misuse of this model.开发人员对于此模型的任何误用不承担任何责任。

We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us.我们提倡负责任地开发和使用人工智能，并鼓励社区在人工智能研究和应用中坚持安全和道德原则。如果您对道德或滥用有任何疑虑，请联系我们。

Spark-TTS：一个高效的基于LLM的文本转语音模型推理代码