International Core Journal of Engineering 2020-26 | Page 52
CER of 22.02% and a PER of 8.91%. For the per-element
dropout, it achieves a CER of 21.77% and a PER of 8.17%,
with relative reductions of 1% and 8% respectively.
[7]
[8]
When using the per-frame dropout, the performance is
better than that of per-element dropout, whose CER and PER
are 20.85% and 7.92% respectively, with relative reductions
of 5% and 11% respectively, compared with the baseline
LSTMP.
[9]
[10]
When combining the both dropout methods, we got the
best performance, with a CER of 20.74% and a PER of 7.76%.
Compared with the baseline LSTMP, it has a 5.8% reduction
of CER and a 9.8% reduction of PER.
[11]
[12]
The experiments above showed that the dropout methods
can improve the neural networks based acoustic model’s
performance. With its application to LSTMP, we got the best
performance in the THCHS-30 task.
[13]
VI. C ONCLUSIONS
In this work, a baseline LSTMP acoustic model is built for
Chinese continuous speech recognition. Two kinds of dropout
methods are explored, i.e. the per-element dropout and per-
frame dropout. The experiments on THCHS-30 corpus
showed that both dropout methods can improve the
performance of LSTMP based acoustic model. Also, when
combining these two methods, the best performance can be
got.
[14]
[15]
[16]
[17]
[18]
In the future, firstly, we may try to do some experiments
on English corpus and larger corpus. Also, the dropout
methods can be used here to improve the performance of
bidirectional LSTM (BLSTM) [24, 25].
[19]
[20]
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning."
nature 521.7553 (2015): 436.
Dahl, George E., et al. "Context-dependent pre-trained deep neural
networks for large-vocabulary speech recognition." IEEE Transactions
on audio, speech, and language processing 20.1 (2012): 30-42.
Sak, Haşim, Andrew Senior, and Françoise Beaufays. "Long short-term
memory recurrent neural network architectures for large scale acoustic
modeling." Fifteenth annual conference of the international speech
communication association. 2014.
Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech
recognition with deep recurrent neural networks." Acoustics, Speech
and Signal Processing (ICASSP), 2013 IEEE International Conference
on. IEEE, 2013.
Xiong, Wayne, et al. "The Microsoft 2016 conversational speech
recognition system. "Acoustics, Speech and Signal Processing
(ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
Saon, George, et al. "English Conversational Telephone Speech
Recognition by Humans and Machines." Eighteenth annual conference
of the international speech communication association .2017.
[21]
[22]
[23]
[24]
[25]
30
Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term
dependencies with gradient descent is difficult”, IEEE Transactions on
Neural Networks, vol. 5, no. 2, pp. 157–166, Mar 1994.
Sepp Hochreiter and Jurgen Schmidhuber, ”Long short-term memory, ”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Felix A Gers, Jurgen Schmidhuber, and Fred Cummins, ”Learning to
forget: Continual prediction with lstm,” Neural Computation, vol. 12,
no. 10, pp. 2451–2471, 2000.
Felix
A
Gers,
Nicol
N
Schraudolph,
and
Jurgen
Schmidhuber, ”Learning precise timing with lstm recurrent networks,”
Journal of machine learning research, vol. 3, no. Aug, pp. 115–143,
2002.
Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural
networks on sequence modeling." NIPS 2014 Workshop on Deep
Learning, December 2014. 2014.
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural
networks from overfitting." The Journal of Machine Learning Research
15.1 (2014): 1929-1958.
Gal, Yarin, and Zoubin Ghahramani. "A theoretically grounded
application of dropout in recurrent neural networks." Advances in
neural information processing systems. 2016.
Pham, Vu, et al. "Dropout improves recurrent neural networks for
handwriting recognition." Frontiers in Handwriting Recognition
(ICFHR), 2014 14th International Conference on. IEEE, 2014.
Cheng, Gaofeng, et al. "An exploration of dropout with LSTMs." Proc.
Interspeech. 2017.
Wang, Dong, and Xuewei Zhang. "Thchs-30: A free chinese speech
corpus." arXiv preprint arXiv:1512.01882 (2015).
Povey, Daniel, et al. "The Kaldi speech recognition toolkit." IEEE 2011
workshop on automatic speech recognition and understanding. No.
EPFL-CONF-192584. IEEE Signal Processing Society, 2011.
Ko, Tom, et al. "Audio augmentation for speech recognition."
Sixteenth Annual Conference of the International Speech
Communication Association. 2015.
Saon, George, et al. "Speaker adaptation of neural network acoustic
models using i-vectors." ASRU. 2013.
Young, Steve J., Julian J. Odell, and Philip C. Woodland. "Tree-based
state tying for high accuracy acoustic modelling." Proceedings of the
workshop on Human Language Technology. Association for
Computational Linguistics, 1994.
Huang, Xuedong D., Yasuo Ariki, and Mervyn A. Jack. "Hidden
Markov models for speech recognition." (1990): 60-80.
Povey, Daniel, Xiaohui Zhang, and Sanjeev Khudanpur. "Parallel
training of DNNs with natural gradient and parameter averaging."
arXiv preprint arXiv:1410.7455 (2014).
Chen, Kai, and Qiang Huo. "Training deep bidirectional LSTM
acoustic model for LVCSR by a context-sensitive-chunk BPTT
approach." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.7 (2016): 1185-1193.
Graves, Alex, and Jürgen Schmidhuber. "Framewise phoneme
classification with bidirectional LSTM and other neural network
architectures." Neural Networks 18.5-6 (2005): 602-610.
Graves, Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid
speech recognition with deep bidirectional LSTM." Automatic Speech
Recognition and Understanding (ASRU), 2013 IEEE Workshop on.
IEEE, 2013.