Skip to content

This is a TTS model based on VITS that can control the output speech emotion through natural language and control the speaker through reference audio.

Notifications You must be signed in to change notification settings

LuckyBian/EMOTTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EMOTTS: Multilingual Emotion-Controlled Voice Cloning Text-to-Speech System

Create Env

conda create -n emo python=3.8
conda activate emo

Install packages

pip install -r requirements.txt
python env.py

Download Pre-trained Model

Download the model by this link, and then put them into /chinese-roberta-wwm-ext

Collecting Data

Collet the data by this

Preprocessing

Use this code to complete the following preprocessing:

  1. Change the audio to single channel, sampling rate to 22050, format to wav.
  2. Merge and slice the audio into 10s segments.
  3. Use ASR technology to recognize text in speech.
  4. Store the audio, emotion and text in 3 folders with corresponding file names.
# The audio path and corresponding text and emotion are stored and divided into training set and validation set.
python getdata.py
python split.py

Build Monotonic Alignment Search

cd monotonic_align
python setup.py build_ext --inplace
cd ..

Training

python train.py -c path/to/json -m model

Inference

python infer.py

About

This is a TTS model based on VITS that can control the output speech emotion through natural language and control the speaker through reference audio.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published