International Core Journal of Engineering 2020-26 | Page 50
1) Per-element Dropout.
Generally, consider an output activation ℎ of a certain
layer, and given a specific probability , the per-element
dropout will first generate a mask vector , whose dimension
is the same as that of ℎ . And the elements of mask vector
are consist of 0 and 1, where the probability of 0 is .
In recent years, some works have been done to improve
the standard LSTM. The most famous one in speech
recognition area is the projected LSTM (LSTMP) [3]. Based
on the standard LSTM, LSTMP add two projection layers to
reduce the dimension of output and recurrence connections.
Here, as that in Kaldi [17], which is one of the most popular
open source speech recognition toolkits, we use one big matrix
to replace the above two separated matrices. The equations of
LSTMP can be written as Equation (7)-(14).
= ( + +
= ( + +
= tanh(
=
+
⊙
+
= (
=
+
+
+
)
= Bernoulli(1 − ), = 1,2, … ,
Where
(10)
⊙
+
+
)
⊙ tanh( )
[1:
]
(16)
is the dimension of ℎ .
2) Per-frame Dropout.
The dropout method above only deprecates parts of the
activations, which will be the case in the DNN and CNN.
However, in the complicated LSTM based RNN, there are
several gates, it will also be possible to random select certain
gates and deprecate all the elements of the selected gates. For
this case, we don’t need to generate the mask vector, but just
a value with 0 or 1 will work.
(11)
(12)
(13)
=
=
(8)
(9)
)
(15)
Where
is the mask and its every element obey the
Bernoulli distribution with a success probability 1 − , shown
in Equation (16).
(7)
)
+
ℎ =ℎ ⊙
(14)
ℎ =ℎ ∙
Where is the recurrent connection, which is the first
elements of the output . And
is the projection matrix,
to lower-
which projects the output activations of cell
is 1/2 of the
dimension . Generally, the dimension of
, and is the first half of . So with this
dimension of
modification, the parameters of ∗ in LSTM can be reduced
by 3/4 to ∗ in LSTMP. Also, the output dimension is
reduced by 1/2,, which will result in that the parameters of ∗
in the next layer halve.
Where ∙ is the multiplication between number and vector.
And is the mask to determine whether to deprecate all or
none elements.
= Bernoulli(1 − )
(17)
However, the per-frame dropout is not suitable for the
simple neural networks like DNN, because the full-dropout
will leads zero output. But this is not the case in LSTM
architecture, because there are three gates, if not all gates are
fully deprecated, it will still output non-zero values.
The illustration of LSTMP is shown in Fig. 1.
III. I MPLEMENTATION OF D ROPOUT WITH LSTMP
This section will discuss how to implement the dropout in
the LSTM’s architecture.
A. Per-element Dropout
Since we want to implement the per-frame dropout on the
three gates, and per-frame and per-element dropout cannot be
implemented on the same activations simultaneously, we
cannot implement the per-element dropout on gates. A
reasonable and simple place to implement the per-frame
dropout is the on the output activations .
Figure 1. Per-frame and per-element droout for LSTMP
With this implementation, Equation (13) and (14) of
LSTMP will be changed into Equation (18)-(21).
LSTM and LSTMP are the RNNs, so it can model a long
time dependency of input sequence, which will be very helpful
in the tasks like speech recognition, machine translation and
so on. Because in these tasks, the history will help to make
decisions for current time.
(18)
ℎ =
= ℎ [1:
B. Dropout
The dropout was proposed to prevent the neural networks-
based model from overfitting [12]. Also, dropout was found
to introduce noise when training, which will be somewhat
equal to data augmentation. Because LSTM both models the
dependency of input sequence in time axis and feature axis,
the application of dropout with LSTM can have many choices.
]
=ℎ ⊙
= Bernoulli(1 − ), = 1,2, … ,
(19)
(20)
(21)
The implementation of per-element dropout is illustrated
28