International Core Journal of Engineering 2020-26 | Page 195
stands for the collection of clone code fragments detected
by ICCV that are clone fragments but are not real clone
fragments. FN represents the collection of real clone code
fragments not detected by ICCV.
C. Experimental results and analysis
1) Clone code detection results and analysis
The final result of clone code detection is fed back in
XML (extensible markup language) to provide data
exchange basis for further analysis of clone code data.
Figure 3 shows a partial results of the XML.
Figure 3. Clone code detection result
In the experiment, 494 clone code fragments and 72
non-code clone fragments were integrated together as a
project to be tested, and then using ICCV for detection.
According to the detection results, there are 414 clone code
fragments in the data set. In order to verify whether the
detected clone code fragments are evenly distributed among
different projects, table 2 shows the distribution of 494 real
code clone fragments in the source project and 414 detected
clone code fragments in 6 projects. obtained by ICCV using formula (8) and formula (9). The
recall of this experiment is 430 / 494 = 87.04%, precision is
412 / 414 = 96.38%. The precision of this experiment is
relatively high due to the small proportion of noise
fragments set in the data set in the total code clone fragment,
and the low similarity between noise fragment and variation
inserted code fragments. In the following work, the number
of clone code fragment and noise code fragment in the data
set will be continuously expanded.
According to the analysis of the error detection and
missed detection in table II, it is found that the main reason
for the missed detection is because of these fragments and
its corresponding clone code fragment in the replacement of
identifier length which can lead to great changes have taken
place such as a large number of pixels to the overall
migration occurred dislocation, and includes some
statements add delete, resulting in the overall structure of
the code snippet contour change, when detecting similarity
between code images, only a few pixels can match. The
main reason for the error detection is that the two code
fragments are not clone pair, but their code structure is
similar in outline, so that most of the pixels can match each
other, caused by calculating distance between two images is
smaller than the actual distance, this kind of situation is
relatively rare, the algorithm will be improved in
subsequent experiments to improve the problem. In order to analyze the effect of ICCV on clone code
detection of type-1, type-2 and type-3, this paper calculates
the number of three types of clone code in the final test
result according to the variation insertion record
information in the data set, and respectively calculates the
recall and precision of each type of clone detected by
Jaccard and pHash, details are shown in table III.
TABLE III. R ECALL AND PRECISION FOR DIFFERENT CLONE TYPES .
Type
Type-1
Type-2
Type-3
pandas
scipy
django
pytorch
scikit-
learn
keras
Actual code
fragments /piece
Source
Variation
code
insertion
16
20
46
51
42
45
89
132
Detect code fragments /piece
Successful
detection
34
92
74
178
Error
detection
0
3
1
11
missed
detection
2
5
13
43
14 15 28 0 1
12 12 24 0 0
Jaccard
precision
1.00
0.96
1.00
recall
1.00
0.88
0.60
pHash
precision
1.00
1.00
0.92
According to the results of Table 3, the Jaccard method
can accurately detect all type-1 clone and 60% type-2 clone,
while only 10% of type-3 can be detected. Using the pHash
algorithm, 100% of type-1 clone, 88% of type-2 clone code,
and 60% of type-3 clone can be detected. It can be
concluded that using pHash algorithm to detect clone code
has a more accurate and comprehensive effect than using
Jaccard to detect, that is, using pHash algorithm to further
detect clone that cannot be detected by Jaccard.
TABLE II. D ISTRIBUTION OF TEST RESULTS OF DIFFERENT PROJECTS .
Project
recall
1.00
0.65
0.10
2) Visualization results and analysis
The visual results of detected clone code information are
fed back in PDF form, which facilitates developers to
analyze the clone data. Figure 4 shows a visualization of the
clone code detected in django software.
According to the recorded information of the variation
insertion process, analyze the test results, found that the 414
detected clone segment of source code contains 412 real
clone fragment. That is, these clone fragment come from
494 clone code fragments including source code fragment
and variation insertion code fragment, and one detected
clone fragment comes from noise code fragment. Calculate
the recall and precision of the clone detection results
Detected software Django, and extracted the detected
clone information. There are 1639 nodes and 1150 edges,
that is, 1639 clone fragments and 1150 clone relationships.
Through the lines in the figure, the structural relationship
between each clone code can be clearly seen; the larger the
node, the higher the frequency of clone of the code
fragment. Zoom in on the image to see the details between
the nodes and the properties of the node. When maintaining
173