International Core Journal of Engineering 2020-26 | Page 51
The dataset has two versions of transcriptions, one is for
word recognition, and the other is for phone recognition. We
use both transcriptions in the experiments, and thus the
performance of speech recognition can be measured by phone
error rate (PER) and character error rate (CER).
in Fig. 1 with the red color.
B. Per-frame Dropout
For the per-frame dropout, we will implement that on the
three gates. For every gate, there will be a mask applied to it.
Thus, Equation (7), (8) and (11) of LSTMP will be changed
into Equation (22), (23) and (24).
= ∙ ( + +
= ∙ ( + +
= ∙ ( + +
+
+
(22)
)
+
)
)
Since the dataset we use is not very big, some data
augmentation technologies may give better performance. Here,
we use the simplest data augmentation with speed perturbation
[18], which changes the speed of the audio signal to produce
3 versions of the original signal with speed factors of 0.9, 1.0
and 1.1.
(23)
B. Neural Networks
The baseline LSTM is the LSTMP in Section II. The
baseline LSTMP has three LSTMP layers, plus the final
output full-connected layer. The dimension of LSTMP’s cell
is 1024, and the dimension of LSTMP’s output and recurrent
connection are 512 and 256 respectively. The number of total
parameters is about 12M.
(24)
Where
, = , , are masks for input gate, forget gate
and output gate respectively.
= Bernoulli(1 − ),
= , ,
(25)
The implementation of per-frame dropout is illustrated in
Fig. 1 with the green color.
The label delay for every LSTM layer in the neural
networks are -1 for first layer, -2 for second layer and -3 for
the third layer.
C. Dropout Schedule
Another question of the dropout is the dropout schedule,
i.e. how to design a probability schedule. In this work, we
adopt the dropout schedule in [15], which is the normal case
in Kaldi [17].
The input to the neural network is 40-dim Mel-frequency
cepstral coefficients (MFCC) of 5 continuous frames (±2) and
100-dim i-vector. The MFCC features are extracted by every
10 ms with a window size 25 ms. The i-vector features are
used for speaker adaptation [19], so cepstral mean and
variance normalization is not needed any more.
In this case, at the very beginning, the dropout probability
is set to zero, and when the training is completed by 20%, we
will increase the probability linearly. When it finishes the 50%
of the training, we will decrease the probability linearly, and
finally it will be zero when finishing the training.
The output of the neural network is 3456 tied triphone
states [20], which is obtained by forcing align using the
Hidden Markov Models-Gaussian Mixture Models (HMM-
GMM) [21] acoustic model. The GMM-HMM baseline is
trained by the Kaldi recipe as default.
We will ref the peak dropout probability as the dropout
probability . With = 0.3, the dropout schedule will look
like that in Fig. 2.
C. Toolkit and Hyper-parameters
All the experiments are done with Kaldi [17], which is the
most famous, popular and open-source speech recognition
toolkit. We use the Kaldi’s nnet3 architecture.
The networks were trained on a cluster with several CPU
and GPU machines. We use the parallel training strategy with
parameter averaging in [22], with initializing 2 GPUs and
finishing with 8 GPUs.
Figure 2. An dropout schedule with
The learning rate at very beginning is 0.0003, ant at the
end is 0.00003. The momentum value is 0.5. The training is
based on the context sensitive chunk as that in [23], with the
chunk width 20 frames and chunk left 40 frames. The chunk
right is set to be 0 frame, since we just train the unidirectional
LSTM. Also, on every GPU, the batch size is 100 chunks, and
a label delay of 5 is used to improve the performance [3].
= 0.3.
V. R ESULTS AND D ISCUSSIONS
IV. E XPERIMENTS S ETUP
This section will present our results and give some
discussions. The results are illustrated in Table I.
This section will give some details about the experiment’s
setup, including dataset, data augmentation, neural network
architecture and some hyper-parameters.
TABLE I. T HE WER A ND PER R ESULTS OF S PEECH R ECOGNITION
A. Dataset and Data Augmentation
The corpus used in our experiments is the THCHS-30 [16]
corpus, which means Tsinghua 30-Hour Chinese corpus, and
is open-sourced by Tsinghua University. The data is split into
training set and testing set, which are 25 hours and 6 hours
respectively. And there are 50 and 10 speakers in training and
testing sets respectively.
Acoustic Model Dropout Method Character
CER(%) Phone
PER(%)
LSTMP
LSTMP
LSTMP
LSTMP None
Per-element
Per-frame
Both 22.02
21.77
20.85
20.74 8.91
8.17
7.92
7.76
The baseline LSTMP based acoustic model achieves a
29