International Core Journal of Engineering 2020-26 | Page 134
frame features, superimposed on 1 time 200 frames of short-
term energy, a total of 40-dimensional 200 frames of data,
that is, the feature array format is 40*200. The second set of
two-dimensional data features are: MFCC, short-time zero-
crossing rate, in which MFCC extracts 13-dimensional 200-
frame features, and performs first-order and second-order
differentials to obtain 39-dimensional 200-frame features,
superimposed on one-dimensional 200-frame short The
zero-crossing rate feature has a total of 40-dimensional 200-
frame data, that is, the feature array format is 40*200. Then,
by superimposing data, two two-dimensional data are
superimposed into three-dimensional data, and the channel
parameter is added, and the value is 2, and the format of the
feature array obtained at this time is 2*40*200. The
combination method is shown in Figure 6.
2) The second layer uses the same settings as the first
layer, except that the second layer adds the maximum
pooling to reduce the feature dimension.
3) The third layer uses 64 convolution kernels, the
receptive field is 3 u 3, the stride 2 u 2, performs batch
normalization and ReLU activation functions, and this layer
also performs maximum pooling.
4) The fourth layer has a fully connected layer of 1024
hidden cells, and the activation function is Sigmoid.
5) According to the data set output as 10 categories, and
then connected to the output layer using Softmax as the
activation function.
Among them, in the course of training, in order to
prevent overfitting, the second, third and full-connection
layers use 0.5 loss rate to prevent overfitting. The batch size
is set to 64, the ownership weight parameters are all positive
through L2, and the learning rate is set to 0.001.
6KDSH˖
IV. E XPERIMENTAL RESULTS AND ANALYSIS
6KDSH˖
The base data set for this article is UrbanSound8K [15] ,
which has 10 ambient sound events, each recording length
of about 4s, placed in 10 fold files, each audio named with a
label, so that we can train the neural network.
6KDSH˖
Figure.6 Feature combination
C. Convolutional Neural Network
Convolutional Neural Network (CNN) can be regarded
as a special form of standard neural network, has been
widely used in the field of speech recognition to improve the
traditional acoustic models of weak robustness, poor real-
time,
low
recognition
performance
and
other
disadvantages [13] . There are two things in particular: one is
the local sensory field, i.e. each neuron only needs to
perceive the local feature, and the other is the weight
sharing, a convolution core processes each data of the
feature, with the same weight. Through these two points, the
convolutional neural network greatly reduces the number of
parameters that the neural network needs to train [14] .
The experiment used the data in 9 folds of 10 fold
folders for training, and the data in 1 fold folder was tested.
For this data set credibility, the ten-fold cross-validation
published by the official website is used, that is, the data
from 9 out of 10 predefined folders is trained, and the data
from the remaining folders is tested [16] , and the process is
repeated 10 times.
A. Data noise recognition test
It can be seen from Table I that after the noise is applied,
the data has a small improvement in the recognition rate
compared with the un-noised data in the case of single
feature. Some audios have a lower recognition rate because
of the noise, and the overall effect is not obvious. However,
the test sound with real ambient sound recorded using third-
party equipment for testing is significantly better than un-
noised data. The data set test recognition rate was 73.6%,
and the actual scene test recognition rate was 70.1%.
The convolutional neural network used in this paper is
the convolutional network model structure shown in Figure
7.
T ABLE I I DENTIFICATION RATE TEST BEFORE AND AFTER NOISE ADDITION
UNDER A SINGLE FEATURE
Figure.7 Convolutional neural network model
In order to evaluate the applicability of our proposed
combination features and their combinations in practical
scenarios, we use convolutional neural networks to classify
them. The model architecture is set as follows:
1) The first layer uses 32 convolution kernels, the
receptive field is 3 u 3, the stride is set to 3 u 3, and finally
batch normalization is performed. Use ReLU as the
activation function.
Sound
category Pre-
noise
test Test after
noise
0.51 Real
recording
test before
noise
0.44 Real
recording
test after
noise
0.50
air
conditioner
car horn
children
playing
dog bark
drilling 0.52 0.71
0.70 0.71
0.64 0.61
0.60 0.64
0.60
0.77
0.63 0.81
0.62 0.65
0.51 0.72
0.60
engine idling
gun shot
jackhammer
police siren 0.62
0.79
0.60
0.70 0.60
0.81
0.60
0.71 0.32
0.74
0.55
0.68 0.58
0.75
0.54
0.72
street music 0.72 0.70 0.58 0.62
As can be seen from the table, the accuracy is hardly
changed in the case of using only the test set data
112