International Core Journal of Engineering 2020-26 | Page 199
recurrent neural network. dependence.
One of the most common examples is that when
predicting “the clouds are in the (...)”, in this case, the
interval between the relevant information and the position
of the predicted word is very small, and RNN uses previous
information to predict the word “sky”. But if you want to
predict “I grew up in France ... I speak fluent (...)”, the
language model speculates that the next word may be the
name of a language, but the specific language needs to use
the long-spaced preceding French. In this case, RNN cannot
use the information with long-spaced intervals because of
the “gradient disappearance” problem. While LSTM is
designed explicitly to avoid the problem of long-term The standard RNN structure has a chain form of
repeating neural network modules, typically a tanh layer for
repetitive learning (Figure 1), while in the LSTM (Figure 2),
there are four special structures in the repeating module.
Throughout of the horizontal line on the top of the figure is
the state of cells (cell), the yellow matrix is neural network
layer obtained by learning, the pink circle arithmetic
operation, and the black arrows represent the vector
transmission. In the whole, h not only flows over time, the
state of cell c also flows over time, representing the long-
term memory.
Figure 2. LTSM Network Basic Structure.
Figure 1. RNN Network Basic Structure.
The reason why LSTM can remember long-term
information lies in the design of “gate” structure, which is a
method of selective passage of information. In LSTM, the
first stage is the forgetting gate. The forgetting layer
determines which information needs to be forgotten from
the cell state. The next stage is the input gate. The input
gate determines which new information can be stored in the
cell state. The last stage is the output gate. The output gate
determines what value to output.
B. WordEmbedding
Word Embedding is an important concept in Natural
Language Processing (NLP). WordEmbedding can be used
to convert a word into a fixed-length vector representation,
which is convenient for mathematical processing.
The first step in using a mathematical model to process
a text corpus is to convert the text into a mathematical
representation. There are two methods. The first method is
that it can represent a word by a one-hot matrix, and the
one-hot matrix means that each line has only one element
which is 1, and other elements being 0. Assign a number to
each word in the dictionary. When encoding a sentence,
convert each word into a one-hot matrix corresponding to
the number of the word in the dictionary. For example, to
express "the cat sat on the mat", you can use the matrix
shown in Figure 4.
A bi-directional RNN consists of two ordinary RNNs. A
positive RNN uses past information, and a negative-
sequence RNN utilizes future information, so that at the t
moment, both information at the t 1 moment and t 1
can be utilized. In general, because the bi-directional LSTM
can simultaneously utilize information from past and future
times, it will be more accurate than the single-directional
LSTM final prediction. Figure 3 shows the structure of a bi-
directional LSTM.
Figure 3. Bi-directional LTSM Network Structure.
Figure 4. One-hot Number Method.
A 1 o A 2 o
o A i , a positive RNN, participates in
the positive calculation. The input x t of the t moment is the
sequence data of the t moment and the output A t -1 of the
The method of one-hot representation is very intuitive,
but there are two drawbacks. First, the length of each
dimension of matrix is the length of dictionary. For example,
the dictionary contains 10,000 words, and hen the one-hot
vector corresponding to each word is the vector of
1X10,000. This vector has only one position, and the rest is
0, which wastes space and is not conducive to calculation.
Second, the one-hot matrix is equivalent to simply
numbering each word, but the relationship between words
and words is completely unrecognizable. For example,
“cat” and “mouse” are more relevant than “cat” and
“cellphone”. This relationship is not reflected in the one-hot
t 1 moment. A 1 ' o A 2 ' o
o A i ' , a negative RNN,
participates in the negative calculation. The input x t of the
t moment is the sequence data of the t moment and the
output A t ' 1 of the t 1 moment. The final output value of
the t moment depends on A t -1 and A t ' 1 .
177