International Core Journal of Engineering 2020-26

frame features, superimposed on 1 time 200 frames of short- term energy, a total of 40-dimensional 200 frames of data, that is, the feature array format is 40*200. The second set of two-dimensional data features are: MFCC, short-time zero- crossing rate, in which MFCC extracts 13-dimensional 200- frame features, and performs first-order and second-order differentials to obtain 39-dimensional 200-frame features, superimposed on one-dimensional 200-frame short The zero-crossing rate feature has a total of 40-dimensional 200- frame data, that is, the feature array format is 40*200. Then, by superimposing data, two two-dimensional data are superimposed into three-dimensional data, and the channel parameter is added, and the value is 2, and the format of the feature array obtained at this time is 2*40*200. The combination method is shown in Figure 6. 2) The second layer uses the same settings as the first layer, except that the second layer adds the maximum pooling to reduce the feature dimension. 3) The third layer uses 64 convolution kernels, the receptive field is 3 u 3, the stride 2 u 2, performs batch normalization and ReLU activation functions, and this layer also performs maximum pooling. 4) The fourth layer has a fully connected layer of 1024 hidden cells, and the activation function is Sigmoid. 5) According to the data set output as 10 categories, and then connected to the output layer using Softmax as the activation function. Among them, in the course of training, in order to prevent overfitting, the second, third and full-connection layers use 0.5 loss rate to prevent overfitting. The batch size is set to 64, the ownership weight parameters are all positive through L2, and the learning rate is set to 0.001. 6KDSH˖ IV. E XPERIMENTAL RESULTS AND ANALYSIS 6KDSH˖ The base data set for this article is UrbanSound8K [15] , which has 10 ambient sound events, each recording length of about 4s, placed in 10 fold files, each audio named with a label, so that we can train the neural network. 6KDSH˖ Figure.6 Feature combination C. Convolutional Neural Network Convolutional Neural Network (CNN) can be regarded as a special form of standard neural network, has been widely used in the field of speech recognition to improve the traditional acoustic models of weak robustness, poor real- time, low recognition performance and other disadvantages [13] . There are two things in particular: one is the local sensory field, i.e. each neuron only needs to perceive the local feature, and the other is the weight sharing, a convolution core processes each data of the feature, with the same weight. Through these two points, the convolutional neural network greatly reduces the number of parameters that the neural network needs to train [14] . The experiment used the data in 9 folds of 10 fold folders for training, and the data in 1 fold folder was tested. For this data set credibility, the ten-fold cross-validation published by the official website is used, that is, the data from 9 out of 10 predefined folders is trained, and the data from the remaining folders is tested [16] , and the process is repeated 10 times. A. Data noise recognition test It can be seen from Table I that after the noise is applied, the data has a small improvement in the recognition rate compared with the un-noised data in the case of single feature. Some audios have a lower recognition rate because of the noise, and the overall effect is not obvious. However, the test sound with real ambient sound recorded using third- party equipment for testing is significantly better than un- noised data. The data set test recognition rate was 73.6%, and the actual scene test recognition rate was 70.1%. The convolutional neural network used in this paper is the convolutional network model structure shown in Figure 7. T ABLE I I DENTIFICATION RATE TEST BEFORE AND AFTER NOISE ADDITION UNDER A SINGLE FEATURE Figure.7 Convolutional neural network model In order to evaluate the applicability of our proposed combination features and their combinations in practical scenarios, we use convolutional neural networks to classify them. The model architecture is set as follows: 1) The first layer uses 32 convolution kernels, the receptive field is 3 u 3, the stride is set to 3 u 3, and finally batch normalization is performed. Use ReLU as the activation function. Sound category Pre- noise test Test after noise 0.51 Real recording test before noise 0.44 Real recording test after noise 0.50 air conditioner car horn children playing dog bark drilling 0.52 0.71 0.70 0.71 0.64 0.61 0.60 0.64 0.60 0.77 0.63 0.81 0.62 0.65 0.51 0.72 0.60 engine idling gun shot jackhammer police siren 0.62 0.79 0.60 0.70 0.60 0.81 0.60 0.71 0.32 0.74 0.55 0.68 0.58 0.75 0.54 0.72 street music 0.72 0.70 0.58 0.62 As can be seen from the table, the accuracy is hardly changed in the case of using only the test set data 112

International Core Journal of Engineering 2020-26 | Page 134