International Core Journal of Engineering 2020-26 | Page 51

The dataset has two versions of transcriptions, one is for word recognition, and the other is for phone recognition. We use both transcriptions in the experiments, and thus the performance of speech recognition can be measured by phone error rate (PER) and character error rate (CER). in Fig. 1 with the red color. B. Per-frame Dropout For the per-frame dropout, we will implement that on the three gates. For every gate, there will be a mask applied to it. Thus, Equation (7), (8) and (11) of LSTMP will be changed into Equation (22), (23) and (24). = ∙ ( + + = ∙ ( + + = ∙ ( + + + + (22) ) + ) ) Since the dataset we use is not very big, some data augmentation technologies may give better performance. Here, we use the simplest data augmentation with speed perturbation [18], which changes the speed of the audio signal to produce 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. (23) B. Neural Networks The baseline LSTM is the LSTMP in Section II. The baseline LSTMP has three LSTMP layers, plus the final output full-connected layer. The dimension of LSTMP’s cell is 1024, and the dimension of LSTMP’s output and recurrent connection are 512 and 256 respectively. The number of total parameters is about 12M. (24) Where , = , , are masks for input gate, forget gate and output gate respectively. = Bernoulli(1 − ), = , , (25) The implementation of per-frame dropout is illustrated in Fig. 1 with the green color. The label delay for every LSTM layer in the neural networks are -1 for first layer, -2 for second layer and -3 for the third layer. C. Dropout Schedule Another question of the dropout is the dropout schedule, i.e. how to design a probability schedule. In this work, we adopt the dropout schedule in [15], which is the normal case in Kaldi [17]. The input to the neural network is 40-dim Mel-frequency cepstral coefficients (MFCC) of 5 continuous frames (±2) and 100-dim i-vector. The MFCC features are extracted by every 10 ms with a window size 25 ms. The i-vector features are used for speaker adaptation [19], so cepstral mean and variance normalization is not needed any more. In this case, at the very beginning, the dropout probability is set to zero, and when the training is completed by 20%, we will increase the probability linearly. When it finishes the 50% of the training, we will decrease the probability linearly, and finally it will be zero when finishing the training. The output of the neural network is 3456 tied triphone states [20], which is obtained by forcing align using the Hidden Markov Models-Gaussian Mixture Models (HMM- GMM) [21] acoustic model. The GMM-HMM baseline is trained by the Kaldi recipe as default. We will ref the peak dropout probability as the dropout probability . With = 0.3, the dropout schedule will look like that in Fig. 2. C. Toolkit and Hyper-parameters All the experiments are done with Kaldi [17], which is the most famous, popular and open-source speech recognition toolkit. We use the Kaldi’s nnet3 architecture. The networks were trained on a cluster with several CPU and GPU machines. We use the parallel training strategy with parameter averaging in [22], with initializing 2 GPUs and finishing with 8 GPUs. Figure 2. An dropout schedule with The learning rate at very beginning is 0.0003, ant at the end is 0.00003. The momentum value is 0.5. The training is based on the context sensitive chunk as that in [23], with the chunk width 20 frames and chunk left 40 frames. The chunk right is set to be 0 frame, since we just train the unidirectional LSTM. Also, on every GPU, the batch size is 100 chunks, and a label delay of 5 is used to improve the performance [3]. = 0.3. V. R ESULTS AND D ISCUSSIONS IV. E XPERIMENTS S ETUP This section will present our results and give some discussions. The results are illustrated in Table I. This section will give some details about the experiment’s setup, including dataset, data augmentation, neural network architecture and some hyper-parameters. TABLE I. T HE WER A ND PER R ESULTS OF S PEECH R ECOGNITION A. Dataset and Data Augmentation The corpus used in our experiments is the THCHS-30 [16] corpus, which means Tsinghua 30-Hour Chinese corpus, and is open-sourced by Tsinghua University. The data is split into training set and testing set, which are 25 hours and 6 hours respectively. And there are 50 and 10 speakers in training and testing sets respectively. Acoustic Model Dropout Method Character CER(%) Phone PER(%) LSTMP LSTMP LSTMP LSTMP None Per-element Per-frame Both 22.02 21.77 20.85 20.74 8.91 8.17 7.92 7.76 The baseline LSTMP based acoustic model achieves a 29