International Core Journal of Engineering 2020-26 | Page 132
three-dimensional array. Finally, we use a convolutional
neural network which is popular at present and has a more
accurate recognition rate to identify it.
the recognition result has an error. Thus, based on Figure 1,
Figure 2 is a flow chart for identifying a sound event in a
convolutional neural network with ambient sound. On the
basis of Figure 1, the training data is subjected to the
necessary noise-adding processing, and the added noise is
the noise in the actual scene. At the same time, the test data
is audio playback in the actual scene, and its ambient sound
has a direct impact on event recognition.
II. S OUND EVENT DETECTION PROCESS
Figure 1 is the basic process for convolutional neural
networks to use a single feature to identify sound events. As
shown in Figure 1, experimental training is to extract the
sound signals of the public data set through feature
extraction, using the characteristics of the classic MFCC, the
resulting sound features are entered into the convolutional
neural network CNN for training and identification.
At the same time, the features are combined in a variety
of characteristic combinations, with four short-term features,
the Log-Mel spectrogram, Mel frequency Coefficient
Coefficients,(MFCC), The characteristics of short-time
zero-crossing rate and short-term energy make up the new
feature name for the first letter of each feature keyword,
which is referred to as LMCE for convenience.
Under normal circumstances, when the above trained
neural network recognizes the event sound in the actual
scene, the recognition rate will be greatly reduced, and even
QHWZRUN
[W
IHDWXUH
H[WUDFWLRQ
:HOO
WUDLQHG
QHWZRUN
&11
0)&&
5HVXOW
0)&&
7HVW$XGLRLQ'DWD
&HQWUDOL]DWLRQ
\W
IHDWXUH
H[WUDFWLRQ
Figure.1 Classic sound recognition process under a single feature
QHWZRUN
[W
[W
1RLVH
QW
&RPELQDWLRQ
IHDWXUH
H[WUDFWLRQ
:HOOWUDLQHG
QHWZRUN
&11
/0&(
/0&(
1RLVH
QW
\W
5HVXOW
DXGLRVLJQDO
ZLWKQRLVH
\ W
&RPELQDWLRQ
IHDWXUH
H[WUDFWLRQ
Figure.2 The noise recognition process under the combined features
recording in the actual scene, and is collected with a
sampling frequency of 44.1KHz and a sampling accuracy of
16bit. In this paper, the noise set of the training set is as
follows. The test data set is obtained by first selecting and
determining the signal-to-noise ratio to add noise to the
original sound, as shown in the following formula
III. A MBIENT SOUND RECOGNITION IN ACTUAL SCENE
A. Sound data set enhancement
In general, most of the well-received neural networks
require a large amount of data for training. For example,
Baidu Speech Recognition Online API Service [10] , when the
data is trained, the audio level reaches the order of one
million. In actual combat projects, it is difficult to obtain
data of this order of magnitude, so some enhancements to
the data are necessary. This article provides data
enhancement for public data sets on the Internet. At the
same time, because of the noise of events in the public
dataset, making it closer to the real environment sound, so
the data enhancement here for this article is also the process
of data noise, but also can increase the generalization of the
network steps.
SNR 10 lg
P signal
P noise
(1)
Taking the dog bark in the data set as an example, Fig.
3(a) is the waveform of the dog bark before the noise is
added, and Fig. 3(b) is the dog bar chart after the
environmental noise with the signal-to-noise ratio of 5dB. It
can be clearly seen that the latter has a significantly
increased complexity compared with the former, which
provides a better generalization ability for the identification
The noise-added noise in this paper is obtained by
110