International Core Journal of Engineering 2020-26 | Page 194
modify the constant value, add whitespace, and other
operations, generate a type-2 code pair. All source code
fragments obtained after mutation insertion are mutation
code fragments. In this process, all mutation related
information is recorded to facilitate the calculation of recall
and precision of detection results of the proposed method.
The data set construction and mutation related information
recording personnel include the author, 5 intra-school clone
code research experts and 6 enterprise software developers.
This paper takes function granularity as the research object
and selects 219 source code fragments from 6 open source
projects.
algorithm to visualize the detected clone code information.
ForceAtlas2 is a force-directed algorithm that combines the
characteristics of several well-known network layout
algorithms [31][32][33], which spatializes the network by
simulating physical systems. When the algorithm is running,
the nodes repel each other like charged particles, while the
edges attract the nodes like springs. When the algorithm is
stable, it will reach the physical equilibrium state, resulting
in a stable layout. In this paper, clone code fragment is set
as node, clone relationship is edge to represent the
relationship between clone codes, and the network analysis
software Gephi is selected as the drawing tool to extract the
clone fragment name and clone relationship for network
drawing. The main steps of extraction algorithm are as
follows:
The specific mutation insertion process includes three
cases. The first case is to add, delete white space character
or comment on the source code fragment, and modify the
annotation operation, so that the inserted code fragment and
the source code fragment form a type-1 clone pair. The
second case is to modify the source code fragment data type,
identifier and constant value and other operations, so that
the source code fragment and mutation inserted code
fragment form a type-2 clone pair. The third case is to
delete, modify or add a small number of statements in the
source code fragment, and the modified code fragment has
the same function as the source code fragment, so that form
a type-3 clone pair. After mutation insertion, 275 mutation
fragments were obtained, so there were 494 clone code
fragments in the data set. In order to make the results more
objective, on this basis, add 72 non-clone code fragments,
namely noise fragments, and any two noise fragments are
not clone. The basic project information is shown in table I.
Algorithm 4 Extract Clone Information Algorithm.
function EXTRACTION(result_xml)
node = {}, edge = {}, N = [], file = [], clone_file =
[], flag = FALSE
for each line ę all lines do
if "similarityf" in line then
N.append(clone_similarity)
end if
if "file" in line then
if node.get(file_name) is None
node[file_name] ĕ 0
end if
node[file_name] ĕ node[file_name] + 1
if Fflag then
file.append(file_name)
flag ĕ TRUE
end if
elif flag then
clone_file.append(file_name)
flag ĕ FALSE
end if
end if
end for
for i in range(N.length) do
edge[file[i]] ĕ {clone_file[i] : N[i]}
end for
return node, edge
end function
TABLE I. P ROJECT BASIC INFORMATION
Project Version Project
size/row
pandas 0.24.0 384384
Basic function
scipy 1.2.0 318131 django 2.1.4 333179 pytorch
scikit-
learn
keras 1.0.0 244439 Data structures and data analysis
tools
Scientific computing tool
A rapidly developed high-level
Python Web framework
End-to-end deep learning platform
0.20.2 230534 Data mining and data analysis tools
2.2.4 66156 Advanced neural network API
B. Evaluation index
The detection result is a dichotomy problem, that is, the
code fragment is a clone fragment or not a clone fragment.
Therefore, the recall rate and precision are adopted to
evaluate the experimental results [23][36][37]. Recall refers
to the ratio of all clone code detected to the total number of
clones.
IV. E XPERIMENTAL ANALYSIS
A. Data set
Since the methods proposed by the researchers are not
uniform in terms of detection granularity, language
orientation and system size, there is no international
standard test data set for evaluating clone code detection
tools. Svajlenko et al. proposed a JAVA code set,
BigCloneBench[34], but it is limited by language and only
ten functions, and cannot be applied to all detection tools. In
order to accurately and objectively evaluate ICCV, this
paper draws on the variation insertion[35] test method
proposed by the Roy team. This method does not rely on
any tools or programming language, can be more fair and
independent evaluation of test results. The core idea of it is
artificially in a given source code snippet to modify, add or
delete some statements, generate code cloning for type 1 to
type 3. Such as the source code to change the type name,
Recall
TP
(8)
TP FP
Precision refers to the proportion of real clone code
detected by the clone detection algorithm among candidate
code clones.
Precision
TP
(9)
TP FN
TP represents the intersection of the clone fragment
detected by ICCV and the real clone code fragment. FP
172