International Core Journal of Engineering 2020-26

three-dimensional array. Finally, we use a convolutional neural network which is popular at present and has a more accurate recognition rate to identify it. the recognition result has an error. Thus, based on Figure 1, Figure 2 is a flow chart for identifying a sound event in a convolutional neural network with ambient sound. On the basis of Figure 1, the training data is subjected to the necessary noise-adding processing, and the added noise is the noise in the actual scene. At the same time, the test data is audio playback in the actual scene, and its ambient sound has a direct impact on event recognition. II. S OUND EVENT DETECTION PROCESS Figure 1 is the basic process for convolutional neural networks to use a single feature to identify sound events. As shown in Figure 1, experimental training is to extract the sound signals of the public data set through feature extraction, using the characteristics of the classic MFCC, the resulting sound features are entered into the convolutional neural network CNN for training and identification. At the same time, the features are combined in a variety of characteristic combinations, with four short-term features, the Log-Mel spectrogram, Mel frequency Coefficient Coefficients,(MFCC), The characteristics of short-time zero-crossing rate and short-term energy make up the new feature name for the first letter of each feature keyword, which is referred to as LMCE for convenience. Under normal circumstances, when the above trained neural network recognizes the event sound in the actual scene, the recognition rate will be greatly reduced, and even QHWZRUN [W IHDWXUH H[WUDFWLRQ :HOO WUDLQHG QHWZRUN &11 0)&& 5HVXOW 0)&& 7HVW$XGLRLQ'DWD &HQWUDOL]DWLRQ \W IHDWXUH H[WUDFWLRQ Figure.1 Classic sound recognition process under a single feature QHWZRUN [W [W 1RLVH QW &RPELQDWLRQ IHDWXUH H[WUDFWLRQ :HOOWUDLQHG QHWZRUN &11 /0&( /0&( 1RLVH QW \W 5HVXOW DXGLRVLJQDO ZLWKQRLVH \ W &RPELQDWLRQ IHDWXUH H[WUDFWLRQ Figure.2 The noise recognition process under the combined features recording in the actual scene, and is collected with a sampling frequency of 44.1KHz and a sampling accuracy of 16bit. In this paper, the noise set of the training set is as follows. The test data set is obtained by first selecting and determining the signal-to-noise ratio to add noise to the original sound, as shown in the following formula III. A MBIENT SOUND RECOGNITION IN ACTUAL SCENE A. Sound data set enhancement In general, most of the well-received neural networks require a large amount of data for training. For example, Baidu Speech Recognition Online API Service [10] , when the data is trained, the audio level reaches the order of one million. In actual combat projects, it is difficult to obtain data of this order of magnitude, so some enhancements to the data are necessary. This article provides data enhancement for public data sets on the Internet. At the same time, because of the noise of events in the public dataset, making it closer to the real environment sound, so the data enhancement here for this article is also the process of data noise, but also can increase the generalization of the network steps. SNR 10 lg P signal P noise (1) Taking the dog bark in the data set as an example, Fig. 3(a) is the waveform of the dog bark before the noise is added, and Fig. 3(b) is the dog bar chart after the environmental noise with the signal-to-noise ratio of 5dB. It can be clearly seen that the latter has a significantly increased complexity compared with the former, which provides a better generalization ability for the identification The noise-added noise in this paper is obtained by 110

International Core Journal of Engineering 2020-26 | Page 132