Skip to content

Seed-VC: 实时支持的零样本语音转换与歌声转换工具

Published:

原文链接


Seed-VC

Hugging Face arXiv

English | 简体中文 | 日本語

real-time-demo.webm

Currently released model supports zero-shot voice conversion 🔊 , zero-shot real-time voice conversion 🗣️ and zero-shot singing voice conversion 🎶. Without any training, it is able to clone a voice given a reference speech of 1~30 seconds.

We support further fine-tuning on custom data to increase performance on specific speaker/speakers, with extremely low data requirement (minimum 1 utterance per speaker) and extremely fast training speed (minimum 100 steps, 2 min on T4)!

Real-time voice conversion is support, with algorithm delay of ~300ms and device side delay of ~100ms, suitable for online meetings, gaming and live streaming.

To find a list of demos and comparisons with previous voice conversion models, please visit our demo page🌐 and Evaluaiton📊.

We are keeping on improving the model quality and adding more features.

Evaluation📊

See EVAL.md for objective evaluation results and comparisons with other baselines.

Installation📥

Suggested python 3.10 on Windows, Mac M Series (Apple Silicon) or Linux. Windows and Linux:

pip install -r requirements.txt

Mac M Series:

pip install -r requirements-mac.txt

Usage🛠️

We have released 3 models for different purposes:

VersionNamePurposeSampling RateContent EncoderVocoderHidden DimN LayersParamsRemarks
v1.0seed-uvit-tat-xlsr-tiny (🤗📄)Voice Conversion (VC)22050XLSR-largeHIFT384925Msuitable for real-time voice conversion
v1.0seed-uvit-whisper-small-wavenet (🤗📄)Voice Conversion (VC)22050Whisper-smallBigVGAN5121398Msuitable for offline voice conversion
v1.0seed-uvit-whisper-base (🤗📄)Singing Voice Conversion (SVC)44100Whisper-smallBigVGAN76817200Mstrong zero-shot performance, singing voice conversion

Checkpoints of the latest model release will be downloaded automatically when first run inference.
If you are unable to access huggingface for network reason, try using mirror by adding HF_ENDPOINT=https://hf-mirror.com before every command.

Command line inference:

python inference.py --source <source-wav>
--target <referene-wav>
--output <output-dir>
--diffusion-steps 25 # recommended 30~50 for singingvoice conversion
--length-adjust 1.0
--inference-cfg-rate 0.7
--f0-condition False # set to True for singing voice conversion
--auto-f0-adjust False # set to True to auto adjust source pitch to target pitch level, normally not used in singing voice conversion
--semi-tone-shift 0 # pitch shift in semitones for singing voice conversion
--checkpoint <path-to-checkpoint>
--config <path-to-config>
 --fp16 True

where:

Voice Conversion Web UI:

python app_vc.py --checkpoint <path-to-checkpoint> --config <path-to-config> --fp16 True

Then open the browser and go to http://localhost:7860/ to use the web interface.

Singing Voice Conversion Web UI:

python app_svc.py --checkpoint <path-to-checkpoint> --config <path-to-config> --fp16 True

Integrated Web UI:

python app.py

This will only load pretrained models for zero-shot inference. To use custom checkpoints, please run app_vc.py or app_svc.py as above.

Real-time voice conversion GUI:

python real-time-gui.py --checkpoint-path <path-to-checkpoint> --config-path <path-to-config>

Important

It is strongly recommended to use a GPU for real-time voice conversion. Some performance testing has been done on a NVIDIA RTX 3060 Laptop GPU, results and recommended parameter settings are listed below:

Model ConfigurationDiffusion StepsInference CFG RateMax Prompt LengthBlock Time (s)Crossfade Length (s)Extra context (left) (s)Extra context (right) (s)Latency (ms)Inference Time per Chunk (ms)
seed-uvit-xlsr-tiny100.73.00.18s0.04s2.5s0.02s430ms150ms

You can adjust the parameters in the GUI according to your own device performance, the voice conversion stream should work well as long as Inference Time is less than Block Time.
Note that inference speed may drop if you are running other GPU intensive tasks (e.g. gaming, watching videos)

Explanations for real-time voice conversion GUI parameters:

The algorithm delay is appoximately calculated as Block Time * 2 + Extra context (right), device side delay is usually of ~100ms. The overall delay is the sum of the two.

You may wish to use VB-CABLE to route audio from GUI output stream to a virtual microphone.

(GUI and audio chunking logic are modified from RVC, thanks for their brilliant implementation!)

Training🏋️

Fine-tuning on custom data allow the model to clone someone’s voice more accurately. It will largely improve speaker similarity on particular speakers, but may slightly increase WER.
A Colab Tutorial is here for you to follow: Open In Colab

  1. Prepare your own dataset. It has to satisfy the following:

    • File structure does not matter
    • Each audio file should range from 1 to 30 seconds, otherwise will be ignored
    • All audio files should be in on of the following formats: .wav .flac .mp3 .m4a .opus .ogg
    • Speaker label is not required, but make sure that each speaker has at least 1 utterance
    • Of course, the more data you have, the better the model will perform
    • Training data should be as clean as possible, BGM or noise is not desired
  2. Choose a model configuration file from configs/presets/ for fine-tuning, or create your own to train from scratch.

    • For fine-tuning, it should be one of the following:

      • ./configs/presets/config_dit_mel_seed_uvit_xlsr_tiny.yml for real-time voice conversion
      • ./configs/presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml for offline voice conversion
      • ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml for singing voice conversion
  3. Run the following command to start training:

python train.py 
--config <path-to-config> 
--dataset-dir <path-to-data>
--run-name <run-name>
--batch-size 2
--max-steps 1000
--max-epochs 1000
--save-every 500
--num-workers 0

where:

  1. If training accidentially stops, you can resume training by running the same command again, the training will continue from the last checkpoint. (Make sure run-name and config arguments are the same so that latest checkpoint can be found)

  2. After training, you can use the trained model for inference by specifying the path to the checkpoint and config file.

    • They should be under ./runs/<run-name>/, with the checkpoint named ft_model.pth and config file with the same name as the training config file.
    • You still have to specify a reference audio file of the speaker you’d like to use during inference, similar to zero-shot usage.

TODO📝

Known Issues

CHANGELOGS🗒️


Previous Post
Cursor Free VIP: 自动注册Cursor AI与重置机器ID的工具
Next Post
GoHomeEasy:基于Cloudflare Workers的Shadowsocks订阅管理工具,专为家庭用户设计