This is a realization of the model proposed by SberDevices team and extended with quicker TP-GST module by me
The training process has been carried out on a russian language dataset
- Tacotron2 Encoder + Decoder
- Global Style Tokens module
- 3 Text-predicting style embedding models
- BERT model
- NVIDIA GPU + CUDA cuDNN
- Clone this repo:
git clone https://github.com/lightbooster/TP-GST-BERT-Tacotron2.git
- CD into this repo:
TP-GST-BERT-Tacotron2
- Initialize submodule:
git submodule init; git submodule update
- Install [PyTorch]
- Install [Apex]
- Install python requirements or build docker image
- Install python requirements:
pip install -r requirements.txt
NOTE: elaborated example of SetUp in notebook demo.ipynb
- Install python requirements:
- Download BERT checkpoint (I used RuBERT from deeppavlov.ai)
- Move BERT checkpoint, config and vocabulary into /bert folder or setup related paths in hparams.py
- Modify BERT hyper parameters in hparams.py if those are needed
- Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
- (OPTIONAL)
tensorboard --logdir=outdir/logdir
Training using a pre-trained model can lead to faster convergence
By default, the speaker embedding layer is [ignored]
- Download my pretrained model checkpoint on a russian language dataset NOTE: checkpoint doesn't contain BERT model weights, use ceparate checkpoint for it
python train.py --output_directory=outdir --log_directory=logdir -c {PATH_TO_CHECKPOINT} --warm_start
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
M-AILABS data preprocessing, train configuration and inference demos are represented in the notebook demo.ipynb
WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.
- Nvidia's Mellotron (Tacotron2 + GST) is a basement of work
- Stanton, D., Wang, Y., & Skerry-Ryan, R. J. (2018, August 4) Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis https://arxiv.org/abs/1808.01410
- Skerry-Ryan, RJ, Battenberg, Eric, Xiao, Ying, Wang, Yuxuan, Stanton, Daisy, Shor, Joel, Weiss, Ron J., Clark, Rob, and Saurous, Rif A. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron https://arxiv.org/abs/1803.09047
- Sber Devices Synthesis of speech of virtual assistants Salute https://habr.com/ru/company/sberdevices/blog/548812/