Skip to content

Spark-TTS:一个高效的基于LLM的文本转语音模型推理代码

Published:

原文链接


Spark-TTS   Spark 语音合成

Official PyTorch code for inference of官方 PyTorch 代码用于推理
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech TokensSpark-TTS:基于 LLM 的高效文本转语音模型,具有单流解耦语音标记

Spark-TTS Logo

Institution 1 Institution 2 Institution 3

Institution 4 Institution 5 Institution 6

paperversionHugging Faceversionversionpythonmit

Spark-TTS 🔥

Overview  概述

Spark-TTS is an advanced text-to-speech system that uses the power of large language models (LLM) for highly accurate and natural-sounding voice synthesis. It is designed to be efficient, flexible, and powerful for both research and production use.Spark-TTS 是一种先进的文本转语音系统,它利用大型语言模型 (LLM) 的强大功能实现高度准确且听起来自然的语音合成。它旨在高效、灵活且功能强大,适合研究和生产用途。

Key Features  主要特点


Inference Overview of Voice Cloning语音克隆推理概述
Inference Overview of Controlled Generation受控发电的推理概述

🚀 News  🚀 新闻

Install  安装

Clone and Install  克隆并安装

Here are instructions for installing on Linux. If you’re on Windows, please refer to the Windows Installation Guide.以下是在 Linux 上安装的说明。如果您使用的是 Windows,请参阅 Windows 安装指南
(Thanks to @AcTePuKc for the detailed Windows instructions!)(感谢 @AcTePuKc 提供详细的 Windows 说明!)

git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS
conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

Model Download  模型下载

Download via python:  通过 python 下载:

from huggingface_hub import snapshot_download

snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")

Download via git clone:  通过 git clone 下载:

mkdir -p pretrained_models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B pretrained_models/Spark-TTS-0.5B

Basic Usage  基本用法

You can simply run the demo with the following commands:你可以简单地使用以下命令运行该演示:

cd example
bash infer.sh

Alternatively, you can directly execute the following command in the command line to perform inference:或者也可以直接在命令行中执行以下命令进行推理:

python -m cli.inference \
    --text "text to synthesis." \
    --device 0 \
    --save_dir "path/to/save/audio" \
    --model_dir pretrained_models/Spark-TTS-0.5B \
    --prompt_text "transcript of the prompt audio" \
    --prompt_speech_path "path/to/prompt_audio"

Web UI Usage  Web UI 使用

You can start the UI interface by running python webui.py --device 0, which allows you to perform Voice Cloning and Voice Creation. Voice Cloning supports uploading reference audio or directly recording the audio.您可以通过运行 python webui.py --device 0 启动 UI 界面,从而执行语音克隆和语音创建。语音克隆支持上传参考音频或直接录制音频。

Voice Cloning  语音克隆Voice Creation  语音创作
Image 1Image 2

Optional Methods  可选方法

For additional CLI and Web UI methods, including alternative implementations and extended functionalities, you can refer to:有关其他 CLI 和 Web UI 方法,包括替代实现和扩展功能,您可以参考:

Runtime  运行

Nvidia Triton Inference ServingNvidia Triton 推理服务

We now provide a reference for deploying Spark-TTS with Nvidia Triton and TensorRT-LLM. The table below presents benchmark results on a single L20 GPU, using 26 different prompt_audio/target_text pairs (totalling 169 seconds of audio):我们现在提供使用 Nvidia Triton 和 TensorRT- 部署 Spark-TTS 的参考。下表显示了单个 L20 GPU 上的基准测试结果,使用了 26 个不同的 prompt_audio/target_text 对(总共 169 秒的音频):

ModelNote  笔记Concurrency  并发性Avg Latency  平均延迟RTF
Spark-TTS-0.5BCode Commit  代码提交1876.24 ms  876.24 毫秒0.1362
Spark-TTS-0.5BCode Commit  代码提交2920.97 ms  920.97 毫秒0.0737
Spark-TTS-0.5BCode Commit  代码提交41611.51 ms  1611.51 毫秒0.0704

Please see the detailed instructions in runtime/triton_trtllm/README.md for more information.有关更多信息,请参阅 runtime/triton_trtllm/README.md 中的详细说明。

Demos  演示

Here are some demos generated by Spark-TTS using zero-shot voice cloning. For more demos, visit our demo page.以下是 Spark-TTS 使用零样本语音克隆生成的一些演示。如需更多演示,请访问我们的演示页面


Donald Trump  唐纳德·特朗普Zhongli (Genshin Impact)  钟离(原神)
Donald_Trump.webm  唐纳德·特朗普.webmZhong_Li.webm  钟丽.webm

陈鲁豫 Chen Luyu杨澜 Yang Lan
Chen_Luyu.webm  陈鲁豫.webmYang_Lan.webm  杨澜.webm

余承东 Richard Yu马云 Jack Ma
Yu_Chengdong.webm  余成栋.webmMa_Yun.webm  马云

刘德华 Andy Lau徐志胜 Xu Zhisheng
Liu_Dehua.webm  刘德华.webmXu_Zhisheng.webm  徐志胜.webm

哪吒 Nezha李靖 Li Jing
Ne_Zha.webm  哪吒.webmLi_Jing.webm  李静.webm

To-Do List  待办事项清单

Citation  引用

@misc{wang2025sparktts,
      title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens}, 
      author={Xinsheng Wang and Mingqi Jiang and Ziyang Ma and Ziyu Zhang and Songxiang Liu and Linqin Li and Zheng Liang and Qixi Zheng and Rui Wang and Xiaoqin Feng and Weizhen Bian and Zhen Ye and Sitong Cheng and Ruibin Yuan and Zhixian Zhao and Xinfa Zhu and Jiahao Pan and Liumeng Xue and Pengcheng Zhu and Yunlin Chen and Zhifei Li and Xie Chen and Lei Xie and Yike Guo and Wei Xue},
      year={2025},
      eprint={2503.01710},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2503.01710}, 
}

⚠️ Usage Disclaimer   ⚠️ 使用免责声明

This project provides a zero-shot voice cloning TTS model intended for academic research, educational purposes, and legitimate applications, such as personalized speech synthesis, assistive technologies, and linguistic research.该项目提供了零样本语音克隆 TTS 模型,旨在用于学术研究、教育目的和合法应用,例如个性化语音合成、辅助技术和语言研究。

Please note:  请注意:

We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us.我们提倡负责任地开发和使用人工智能,并鼓励社区在人工智能研究和应用中坚持安全和道德原则。如果您对道德或滥用有任何疑虑,请联系我们。


Previous Post
基于Cloudflare Workers的通用访问计数器xykt/Hits
Next Post
VutronMusic:高颜值的第三方网易云音乐播放器