International Core Journal of Engineering 2020-26 | Page 195

stands for the collection of clone code fragments detected by ICCV that are clone fragments but are not real clone fragments. FN represents the collection of real clone code fragments not detected by ICCV. C. Experimental results and analysis 1) Clone code detection results and analysis The final result of clone code detection is fed back in XML (extensible markup language) to provide data exchange basis for further analysis of clone code data. Figure 3 shows a partial results of the XML.  Figure 3. Clone code detection result In the experiment, 494 clone code fragments and 72 non-code clone fragments were integrated together as a project to be tested, and then using ICCV for detection. According to the detection results, there are 414 clone code fragments in the data set. In order to verify whether the detected clone code fragments are evenly distributed among different projects, table 2 shows the distribution of 494 real code clone fragments in the source project and 414 detected clone code fragments in 6 projects. obtained by ICCV using formula (8) and formula (9). The recall of this experiment is 430 / 494 = 87.04%, precision is 412 / 414 = 96.38%. The precision of this experiment is relatively high due to the small proportion of noise fragments set in the data set in the total code clone fragment, and the low similarity between noise fragment and variation inserted code fragments. In the following work, the number of clone code fragment and noise code fragment in the data set will be continuously expanded. According to the analysis of the error detection and missed detection in table II, it is found that the main reason for the missed detection is because of these fragments and its corresponding clone code fragment in the replacement of identifier length which can lead to great changes have taken place such as a large number of pixels to the overall migration occurred dislocation, and includes some statements add delete, resulting in the overall structure of the code snippet contour change, when detecting similarity between code images, only a few pixels can match. The main reason for the error detection is that the two code fragments are not clone pair, but their code structure is similar in outline, so that most of the pixels can match each other, caused by calculating distance between two images is smaller than the actual distance, this kind of situation is relatively rare, the algorithm will be improved in subsequent experiments to improve the problem. In order to analyze the effect of ICCV on clone code detection of type-1, type-2 and type-3, this paper calculates the number of three types of clone code in the final test result according to the variation insertion record information in the data set, and respectively calculates the recall and precision of each type of clone detected by Jaccard and pHash, details are shown in table III. TABLE III. R ECALL AND PRECISION FOR DIFFERENT CLONE TYPES . Type Type-1 Type-2 Type-3 pandas scipy django pytorch scikit- learn keras Actual code fragments /piece Source Variation code insertion 16 20 46 51 42 45 89 132 Detect code fragments /piece Successful detection 34 92 74 178 Error detection 0 3 1 11 missed detection 2 5 13 43 14 15 28 0 1 12 12 24 0 0 Jaccard precision 1.00 0.96 1.00 recall 1.00 0.88 0.60 pHash precision 1.00 1.00 0.92 According to the results of Table 3, the Jaccard method can accurately detect all type-1 clone and 60% type-2 clone, while only 10% of type-3 can be detected. Using the pHash algorithm, 100% of type-1 clone, 88% of type-2 clone code, and 60% of type-3 clone can be detected. It can be concluded that using pHash algorithm to detect clone code has a more accurate and comprehensive effect than using Jaccard to detect, that is, using pHash algorithm to further detect clone that cannot be detected by Jaccard. TABLE II. D ISTRIBUTION OF TEST RESULTS OF DIFFERENT PROJECTS . Project recall 1.00 0.65 0.10 2) Visualization results and analysis The visual results of detected clone code information are fed back in PDF form, which facilitates developers to analyze the clone data. Figure 4 shows a visualization of the clone code detected in django software. According to the recorded information of the variation insertion process, analyze the test results, found that the 414 detected clone segment of source code contains 412 real clone fragment. That is, these clone fragment come from 494 clone code fragments including source code fragment and variation insertion code fragment, and one detected clone fragment comes from noise code fragment. Calculate the recall and precision of the clone detection results Detected software Django, and extracted the detected clone information. There are 1639 nodes and 1150 edges, that is, 1639 clone fragments and 1150 clone relationships. Through the lines in the figure, the structural relationship between each clone code can be clearly seen; the larger the node, the higher the frequency of clone of the code fragment. Zoom in on the image to see the details between the nodes and the properties of the node. When maintaining 173