International Core Journal of Engineering 2020-26 | Page 194

modify the constant value, add whitespace, and other operations, generate a type-2 code pair. All source code fragments obtained after mutation insertion are mutation code fragments. In this process, all mutation related information is recorded to facilitate the calculation of recall and precision of detection results of the proposed method. The data set construction and mutation related information recording personnel include the author, 5 intra-school clone code research experts and 6 enterprise software developers. This paper takes function granularity as the research object and selects 219 source code fragments from 6 open source projects. algorithm to visualize the detected clone code information. ForceAtlas2 is a force-directed algorithm that combines the characteristics of several well-known network layout algorithms [31][32][33], which spatializes the network by simulating physical systems. When the algorithm is running, the nodes repel each other like charged particles, while the edges attract the nodes like springs. When the algorithm is stable, it will reach the physical equilibrium state, resulting in a stable layout. In this paper, clone code fragment is set as node, clone relationship is edge to represent the relationship between clone codes, and the network analysis software Gephi is selected as the drawing tool to extract the clone fragment name and clone relationship for network drawing. The main steps of extraction algorithm are as follows: The specific mutation insertion process includes three cases. The first case is to add, delete white space character or comment on the source code fragment, and modify the annotation operation, so that the inserted code fragment and the source code fragment form a type-1 clone pair. The second case is to modify the source code fragment data type, identifier and constant value and other operations, so that the source code fragment and mutation inserted code fragment form a type-2 clone pair. The third case is to delete, modify or add a small number of statements in the source code fragment, and the modified code fragment has the same function as the source code fragment, so that form a type-3 clone pair. After mutation insertion, 275 mutation fragments were obtained, so there were 494 clone code fragments in the data set. In order to make the results more objective, on this basis, add 72 non-clone code fragments, namely noise fragments, and any two noise fragments are not clone. The basic project information is shown in table I. Algorithm 4 Extract Clone Information Algorithm. function EXTRACTION(result_xml) node = {}, edge = {}, N = [], file = [], clone_file = [], flag = FALSE for each line ę all lines do if "similarityf" in line then N.append(clone_similarity) end if if "file" in line then if node.get(file_name) is None node[file_name] ĕ 0 end if node[file_name] ĕ node[file_name] + 1 if Fflag then file.append(file_name) flag ĕ TRUE end if elif flag then clone_file.append(file_name) flag ĕ FALSE end if end if end for for i in range(N.length) do edge[file[i]] ĕ {clone_file[i] : N[i]} end for return node, edge end function TABLE I. P ROJECT BASIC INFORMATION Project Version Project size/row pandas 0.24.0 384384 Basic function scipy 1.2.0 318131 django 2.1.4 333179 pytorch scikit- learn keras 1.0.0 244439 Data structures and data analysis tools Scientific computing tool A rapidly developed high-level Python Web framework End-to-end deep learning platform 0.20.2 230534 Data mining and data analysis tools 2.2.4 66156 Advanced neural network API B. Evaluation index The detection result is a dichotomy problem, that is, the code fragment is a clone fragment or not a clone fragment. Therefore, the recall rate and precision are adopted to evaluate the experimental results [23][36][37]. Recall refers to the ratio of all clone code detected to the total number of clones. IV. E XPERIMENTAL ANALYSIS A. Data set Since the methods proposed by the researchers are not uniform in terms of detection granularity, language orientation and system size, there is no international standard test data set for evaluating clone code detection tools. Svajlenko et al. proposed a JAVA code set, BigCloneBench[34], but it is limited by language and only ten functions, and cannot be applied to all detection tools. In order to accurately and objectively evaluate ICCV, this paper draws on the variation insertion[35] test method proposed by the Roy team. This method does not rely on any tools or programming language, can be more fair and independent evaluation of test results. The core idea of it is artificially in a given source code snippet to modify, add or delete some statements, generate code cloning for type 1 to type 3. Such as the source code to change the type name, Recall TP (8) TP  FP Precision refers to the proportion of real clone code detected by the clone detection algorithm among candidate code clones. Precision TP (9) TP  FN TP represents the intersection of the clone fragment detected by ICCV and the real clone code fragment. FP 172