IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech SystemIndexTTS:工业级可控高效零样本文本转语音系统
👉🏻 IndexTTS 👈🏻 👉🏻 索引 TTS 👈🏻
[HuggingFace Demo] [ModelScope Demo][HuggingFace 演示][ModelScope 演示]
[Paper] [Demos][论文][演示]
IndexTTS is a GPT-style text-to-speech (TTS) model mainly based on XTTS and Tortoise. It is capable of correcting the pronunciation of Chinese characters using pinyin and controlling pauses at any position through punctuation marks. We enhanced multiple modules of the system, including the improvement of speaker condition feature representation, and the integration of BigVGAN2 to optimize audio quality. Trained on tens of thousands of hours of data, our system achieves state-of-the-art performance, outperforming current popular TTS systems such as XTTS, CosyVoice2, Fish-Speech, and F5-TTS.IndexTTS 是一种主要基于 XTTS 和 Tortoise 的 GPT 风格的文本转语音 (TTS) 模型。它能够使用拼音纠正汉字的发音,并通过标点符号控制任意位置的停顿。我们增强了系统的多个模块,包括改进扬声器条件特征表示,以及集成 BigVGAN2 以优化音频质量。我们的系统经过数万小时的数据训练,实现了最先进的性能,优于当前流行的 TTS 系统,如 XTTS、CosyVoice2、Fish-Speech 和 F5-TTS。Experience IndexTTS: Please contact [email protected] for more detailed information.体验指数 TTS:请联系 [email protected] 了解更多详细信息。
Contact 联系
QQ群(二群):1048202584
Discord:https://discord.gg/uT32E7KDmy不和谐:https://discord.gg/uT32E7KDmy
简历:[email protected]
欢迎大家来交流讨论!
📣 Updates 📣 更新
2025/05/14
🔥🔥 We release the IndexTTS-1.5, Significantly improve the model’s stability and its performance in the English language.2025/05/14
🔥🔥 我们发布了 IndexTTS-1.5,显着提高了模型的稳定性和英语性能。2025/03/25
🔥 We release IndexTTS-1.0 model parameters and inference code.2025/03/25
🔥 发布 IndexTTS-1.0 模型参数和推理代码。2025/02/12
🔥 We submitted our paper on arXiv, and released our demos and test sets.2025/02/12
🔥 我们在 arXiv 上提交了论文,并发布了我们的演示和测试集。
🖥️ Method 🖥️ 方法
The overview of IndexTTS is shown as follows.IndexTTS 概述如下图所示。
The main improvements and contributions are summarized as follows:主要改进和贡献总结如下:
- In Chinese scenarios, we have introduced a character-pinyin hybrid modeling approach. This allows for quick correction of mispronounced characters.在中文场景中,我们引入了一种字符-拼音混合建模方法。这允许快速纠正发音错误的字符。
- IndexTTS incorporate a conformer conditioning encoder and a BigVGAN2-based speechcode decoder. This improves training stability, voice timbre similarity, and sound quality.IndexTTS 包含构象构象调节编码器和基于 BigVGAN2 的语音码解码器。这提高了训练稳定性、音色相似性和音质。
- We release all test sets here, including those for polysyllabic words, subjective and objective test sets.我们在这里发布所有测试集,包括多音节单词、主观和客观测试集的测试集。
Model Download 模型下载
🤗HuggingFace �� 拥抱脸 | ModelScope 模型作用域 |
---|---|
IndexTTS 索引 TTS | IndexTTS 索引 TTS |
😁IndexTTS-1.5 😁索引 TTS-1.5 | IndexTTS-1.5 索引 TTS-1.5 |
📑 Evaluation 📑 评估
Word Error Rate (WER) Results for IndexTTS and Baseline Models on the seed-test种子测试中 IndexTTS 和基线模型的单词错误率 (WER) 结果
WER | test_zh | test_en | test_hard |
---|---|---|---|
Human 人 | 1.26 | 2.14 | - |
SeedTTS 种子 TTS | 1.002 | 1.945 | 6.243 |
CosyVoice 2 舒适之声 2 | 1.45 | 2.57 | 6.83 |
F5TTS | 1.56 | 1.83 | 8.67 |
FireRedTTS 火红 TTS | 1.51 | 3.82 | 17.45 |
MaskGCT | 2.27 | 2.62 | 10.27 |
Spark-TTS 火花-TTS | 1.2 | 1.98 | - |
MegaTTS 3 超级 TTS 3 | 1.36 | 1.82 | - |
IndexTTS 索引 TTS | 0.937 | 1.936 | 6.831 |
IndexTTS-1.5 索引 TTS-1.5 | 0.821 | 1.606 | 6.565 |
Word Error Rate (WER) Results for IndexTTS and Baseline Models on the other opensource test另一个开源测试中 IndexTTS 和基线模型的单词错误率 (WER) 结果
Model | aishell1_test | commonvoice_20_test_zh | commonvoice_20_test_en | librispeech_test_clean | avg |
---|---|---|---|---|---|
Human 人 | 2.0 | 9.5 | 10.0 | 2.4 | 5.1 |
CosyVoice 2 舒适之声 2 | 1.8 | 9.1 | 7.3 | 4.9 | 5.9 |
F5TTS | 3.9 | 11.7 | 5.4 | 7.8 | 8.2 |
Fishspeech 鱼语 | 2.4 | 11.4 | 8.8 | 8.0 | 8.3 |
FireRedTTS 火红 TTS | 2.2 | 11.0 | 16.3 | 5.7 | 7.7 |
XTTS | 3.0 | 11.4 | 7.1 | 3.5 | 6.0 |
IndexTTS 索引 TTS | 1.3 | 7.0 | 5.3 | 2.1 | 3.7 |
IndexTTS-1.5 索引 TTS-1.5 | 1.2 | 6.8 | 3.9 | 1.7 | 3.1 |
Speaker Similarity (SS) Results for IndexTTS and Baseline ModelsIndexTTS 和基线模型的说话人相似度 (SS) 结果
Model | aishell1_test | commonvoice_20_test_zh | commonvoice_20_test_en | librispeech_test_clean | avg |
---|---|---|---|---|---|
Human 人 | 0.846 | 0.809 | 0.820 | 0.858 | 0.836 |
CosyVoice 2 舒适之声 2 | 0.796 | 0.743 | 0.742 | 0.837 | 0.788 |
F5TTS | 0.743 | 0.747 | 0.746 | 0.828 | 0.779 |
Fishspeech 鱼语 | 0.488 | 0.552 | 0.622 | 0.701 | 0.612 |
FireRedTTS 火红 TTS | 0.579 | 0.593 | 0.587 | 0.698 | 0.631 |
XTTS | 0.573 | 0.586 | 0.648 | 0.761 | 0.663 |
IndexTTS 索引 TTS | 0.744 | 0.742 | 0.758 | 0.823 | 0.776 |
IndexTTS-1.5 索引 TTS-1.5 | 0.741 | 0.722 | 0.753 | 0.819 | 0.771 |
MOS Scores for Zero-Shot Cloned Voice零样本克隆语音的 MOS 分数
Model | Prosody 韵律 | Timbre 音色 | Quality 质量 | AVG |
---|---|---|---|---|
CosyVoice 2 舒适之声 2 | 3.67 | 4.05 | 3.73 | 3.81 |
F5TTS | 3.56 | 3.88 | 3.56 | 3.66 |
Fishspeech 鱼语 | 3.40 | 3.63 | 3.69 | 3.57 |
FireRedTTS 火红 TTS | 3.79 | 3.72 | 3.60 | 3.70 |
XTTS | 3.23 | 2.99 | 3.10 | 3.11 |
IndexTTS 索引 TTS | 3.79 | 4.20 | 4.05 | 4.01 |
Usage Instructions 使用说明
Environment Setup 环境设置
- Download this repository:下载此存储库:
git clone https://github.com/index-tts/index-tts.git
- Install dependencies: 安装依赖项:
Create a new conda environment and install dependencies:创建新的 conda 环境并安装依赖项:
conda create -n index-tts python=3.10
conda activate index-tts
apt-get install ffmpeg
# or use conda to install ffmpeg
conda install -c conda-forge ffmpeg
Install PyTorch, e.g.:安装 PyTorch,例如:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
Note 注意
If you are using Windows you may encounter an error when installing pynini
: ERROR: Failed building wheel for pynini
In this case, please install pynini
via conda
:如果您使用的是 Windows,则在安装 pynini
时可能会遇到错误 : ERROR: Failed building wheel for pynini
在这种情况下,请通过 conda
安装 pynini
:
# after conda activate index-tts
conda install -c conda-forge pynini==2.1.6
pip install WeTextProcessing --no-deps
Install IndexTTS
as a package:将 IndexTTS
作为包安装:
cd index-tts
pip install -e .
- Download models: 下载型号:
Download by huggingface-cli
:通过 huggingface-cli
下载:
huggingface-cli download IndexTeam/IndexTTS-1.5 \
config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
--local-dir checkpoints
Recommended for China users. 如果下载速度慢,可以使用镜像:
export HF_ENDPOINT="https://hf-mirror.com"
Or by wget
: 或者通过 wget
:
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_discriminator.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_generator.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bpe.model -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/dvae.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/gpt.pth -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/unigram_12000.vocab -P checkpoints
wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/config.yaml -P checkpoints
Note 注意
If you prefer to use the IndexTTS-1.0
model, please replace IndexTeam/IndexTTS-1.5
with IndexTeam/IndexTTS
in the above commands.如果您更喜欢使用 IndexTTS-1.0
模型,请在上述命令中将 IndexTeam/IndexTTS-1.5
替换为 IndexTeam/IndexTTS
。
- Run test script: 运行测试脚本:
# Please put your prompt audio in 'test_data' and rename it to 'input.wav'
python indextts/infer.py
- Use as command line tool:用作命令行工具:
# Make sure pytorch has been installed before running this command
indextts "大家好,我现在正在bilibili 体验 ai 科技,说实话,来之前我绝对想不到!AI技术已经发展到这样匪夷所思的地步了!" \
--voice reference_voice.wav \
--model_dir checkpoints \
--config checkpoints/config.yaml \
--output output.wav
Use --help
to see more options.使用 --help
查看更多选项。
indextts --help
Web Demo 网络演示
pip install -e ".[webui]" --no-build-isolation
python webui.py
# use another model version:
python webui.py --model_dir IndexTTS-1.5
Open your browser and visit http://127.0.0.1:7860
to see the demo.打开浏览器并访问 http://127.0.0.1:7860
查看演示。
Sample Code 示例代码
from indextts.infer import IndexTTS
tts = IndexTTS(model_dir="checkpoints",cfg_path="checkpoints/config.yaml")
voice="reference_voice.wav"
text="大家好,我现在正在bilibili 体验 ai 科技,说实话,来之前我绝对想不到!AI技术已经发展到这样匪夷所思的地步了!比如说,现在正在说话的其实是B站为我现场复刻的数字分身,简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能,可以访问 bilibili studio,相信我,你们也会吃惊的。"
tts.infer(voice, text, output_path)
Acknowledge 承认
📚 Citation 📚 引文
🌟 If you find our work helpful, please leave us a star and cite our paper.🌟 如果您觉得我们的工作有帮助,请给我们留下星号并引用我们的论文。
@article{deng2025indextts,
title={IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System},
author={Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang},
journal={arXiv preprint arXiv:2502.05512},
year={2025}
}