International Core Journal of Engineering 2020-26 | Page 50

1) Per-element Dropout. Generally, consider an output activation ℎ of a certain layer, and given a specific probability , the per-element dropout will first generate a mask vector , whose dimension is the same as that of ℎ . And the elements of mask vector are consist of 0 and 1, where the probability of 0 is . In recent years, some works have been done to improve the standard LSTM. The most famous one in speech recognition area is the projected LSTM (LSTMP) [3]. Based on the standard LSTM, LSTMP add two projection layers to reduce the dimension of output and recurrence connections. Here, as that in Kaldi [17], which is one of the most popular open source speech recognition toolkits, we use one big matrix to replace the above two separated matrices. The equations of LSTMP can be written as Equation (7)-(14). = ( + + = ( + + = tanh( = + ⊙ + = ( = + + + ) = Bernoulli(1 − ), = 1,2, … , Where (10) ⊙ + + ) ⊙ tanh( ) [1: ] (16) is the dimension of ℎ . 2) Per-frame Dropout. The dropout method above only deprecates parts of the activations, which will be the case in the DNN and CNN. However, in the complicated LSTM based RNN, there are several gates, it will also be possible to random select certain gates and deprecate all the elements of the selected gates. For this case, we don’t need to generate the mask vector, but just a value with 0 or 1 will work. (11) (12) (13) = = (8) (9) ) (15) Where is the mask and its every element obey the Bernoulli distribution with a success probability 1 − , shown in Equation (16). (7) ) + ℎ =ℎ ⊙ (14) ℎ =ℎ ∙ Where is the recurrent connection, which is the first elements of the output . And is the projection matrix, to lower- which projects the output activations of cell is 1/2 of the dimension . Generally, the dimension of , and is the first half of . So with this dimension of modification, the parameters of ∗ in LSTM can be reduced by 3/4 to ∗ in LSTMP. Also, the output dimension is reduced by 1/2,, which will result in that the parameters of ∗ in the next layer halve. Where ∙ is the multiplication between number and vector. And is the mask to determine whether to deprecate all or none elements. = Bernoulli(1 − ) (17) However, the per-frame dropout is not suitable for the simple neural networks like DNN, because the full-dropout will leads zero output. But this is not the case in LSTM architecture, because there are three gates, if not all gates are fully deprecated, it will still output non-zero values. The illustration of LSTMP is shown in Fig. 1. III. I MPLEMENTATION OF D ROPOUT WITH LSTMP This section will discuss how to implement the dropout in the LSTM’s architecture. A. Per-element Dropout Since we want to implement the per-frame dropout on the three gates, and per-frame and per-element dropout cannot be implemented on the same activations simultaneously, we cannot implement the per-element dropout on gates. A reasonable and simple place to implement the per-frame dropout is the on the output activations . Figure 1. Per-frame and per-element droout for LSTMP With this implementation, Equation (13) and (14) of LSTMP will be changed into Equation (18)-(21). LSTM and LSTMP are the RNNs, so it can model a long time dependency of input sequence, which will be very helpful in the tasks like speech recognition, machine translation and so on. Because in these tasks, the history will help to make decisions for current time. (18) ℎ = = ℎ [1: B. Dropout The dropout was proposed to prevent the neural networks- based model from overfitting [12]. Also, dropout was found to introduce noise when training, which will be somewhat equal to data augmentation. Because LSTM both models the dependency of input sequence in time axis and feature axis, the application of dropout with LSTM can have many choices. ] =ℎ ⊙ = Bernoulli(1 − ), = 1,2, … , (19) (20) (21) The implementation of per-element dropout is illustrated 28