International Core Journal of Engineering 2020-26 | Page 199

recurrent neural network. dependence. One of the most common examples is that when predicting “the clouds are in the (...)”, in this case, the interval between the relevant information and the position of the predicted word is very small, and RNN uses previous information to predict the word “sky”. But if you want to predict “I grew up in France ... I speak fluent (...)”, the language model speculates that the next word may be the name of a language, but the specific language needs to use the long-spaced preceding French. In this case, RNN cannot use the information with long-spaced intervals because of the “gradient disappearance” problem. While LSTM is designed explicitly to avoid the problem of long-term The standard RNN structure has a chain form of repeating neural network modules, typically a tanh layer for repetitive learning (Figure 1), while in the LSTM (Figure 2), there are four special structures in the repeating module. Throughout of the horizontal line on the top of the figure is the state of cells (cell), the yellow matrix is neural network layer obtained by learning, the pink circle arithmetic operation, and the black arrows represent the vector transmission. In the whole, h not only flows over time, the state of cell c also flows over time, representing the long- term memory. Figure 2. LTSM Network Basic Structure. Figure 1. RNN Network Basic Structure. The reason why LSTM can remember long-term information lies in the design of “gate” structure, which is a method of selective passage of information. In LSTM, the first stage is the forgetting gate. The forgetting layer determines which information needs to be forgotten from the cell state. The next stage is the input gate. The input gate determines which new information can be stored in the cell state. The last stage is the output gate. The output gate determines what value to output. B. WordEmbedding Word Embedding is an important concept in Natural Language Processing (NLP). WordEmbedding can be used to convert a word into a fixed-length vector representation, which is convenient for mathematical processing. The first step in using a mathematical model to process a text corpus is to convert the text into a mathematical representation. There are two methods. The first method is that it can represent a word by a one-hot matrix, and the one-hot matrix means that each line has only one element which is 1, and other elements being 0. Assign a number to each word in the dictionary. When encoding a sentence, convert each word into a one-hot matrix corresponding to the number of the word in the dictionary. For example, to express "the cat sat on the mat", you can use the matrix shown in Figure 4. A bi-directional RNN consists of two ordinary RNNs. A positive RNN uses past information, and a negative- sequence RNN utilizes future information, so that at the t moment, both information at the t  1 moment and t  1 can be utilized. In general, because the bi-directional LSTM can simultaneously utilize information from past and future times, it will be more accurate than the single-directional LSTM final prediction. Figure 3 shows the structure of a bi- directional LSTM. Figure 3. Bi-directional LTSM Network Structure. Figure 4. One-hot Number Method. A 1 o A 2 o o A i , a positive RNN, participates in the positive calculation. The input x t of the t moment is the sequence data of the t moment and the output A t -1 of the The method of one-hot representation is very intuitive, but there are two drawbacks. First, the length of each dimension of matrix is the length of dictionary. For example, the dictionary contains 10,000 words, and hen the one-hot vector corresponding to each word is the vector of 1X10,000. This vector has only one position, and the rest is 0, which wastes space and is not conducive to calculation. Second, the one-hot matrix is equivalent to simply numbering each word, but the relationship between words and words is completely unrecognizable. For example, “cat” and “mouse” are more relevant than “cat” and “cellphone”. This relationship is not reflected in the one-hot t  1 moment. A 1 ' o A 2 ' o o A i ' , a negative RNN, participates in the negative calculation. The input x t of the t moment is the sequence data of the t moment and the output A t '  1 of the t  1 moment. The final output value of the t moment depends on A t -1 and A t '  1 . 177