Connectionist Temporal Classification (CTC) Automatic Speech Recognition. Training
and Decoding
are extremely fast.
kaldi-ctc is based on kaldi, warp-ctc and cudnn.
Components | Role |
---|---|
kaldi | Parent body, data prepare / build decoding WFST |
warp-ctc | Fast parallel implementation of CTC |
cudnn(=5.x) | Fast recurrent neural networks(LSTM,GRU,ReLU,Tanh) |
# install dependents
cd tools
make -j
make openblas
# Install cudnn, reference script `extras/install_cudnn.sh`
bash extras/install_cudnn.sh # just download cudnn, copy include/lib[64] dirs to system's CUDA path yourself.
cd ../src
# change `YOUR_CUDNN_ROOT`
./configure --cudnn-root=YOUR_CUDNN_ROOT --openblas-root=../tools/OpenBLAS/install
make depend -j
make -j
Make sure the GPU's memory is enough, default setting can run on GTX TITAN X/1080( >= 8G).
Using smaller minibatch_size(default 16)
/ max_allow_frames(default 2000)
or bigger frame_subsampling_factor(default 1)
if your GPUs are older.
cd egs/librispeech/ctc
bash run.sh --stage -2 --num-gpus 4(change to your GPU devices amount)
steps/ctc/report/generate_plots.py exp/ctc/cudnn_google_fs3 reports/ctc-google
Models | Real Time Factor(RTF) | test_clean | dev_clean | test_other | dev_other |
---|---|---|---|---|---|
chain | 6.20 | 5.83 | 14.73 | 14.56 | |
CTC-monophone | (0.05 ~ 0.06) / frame_subsampling_factor |
8.63 | 9.02 | 20.75 | 22.16 |
CTC-character |
- There are many Out Of Vocabularies(OOVs) in training transcriptions now
awk 'FNR==NR{T[$1]=1;} FNR<NR{for(i=2;i<=NF;i++) {if (!($i in T)) print $i;}}' data/lang_nosp/words.txt data/train_960/text | sort -u | wc -l
14291
- CTC system gets better results than chain system on a larger corpu.