International Core Journal of Engineering 2020-26 | Page 191

information. In the coverage tree map, the file is expressed as a rectangular block, through the size of the rectangular block, the positional relationship between the rectangular blocks, and the color of the rectangular block to indicate the number of lines of code in the file, the inclusion relationship between codes and the coverage of code cloning in the file, move the cursor to the rectangular block, also can display more detailed code information. However, coverage tree map does not reflect the relationship between clone code. detection methods in the academic circle[4][9][10][13][16][17] [19][20], which are mainly Text-based, Lexical-based, Grammar-based and Semantics- based detection method. Text-based clone code detection methods mainly include NiCad[9], SDD[10], Dup[11], Duploc[12] and so on. NiCad first preprocesses the source code, then converts it according to specific rules, and finally uses the dynamic matching mode to find the longest common subsequence, and finally can detect the clone code of type-1 to type-3. SDD directly establishes an inverted index on the source code, and uses the n-Nearest Neighbor algorithm for detection, which can detect clone code from type-1 to type- 3. However, since the source code itself has meaning, you will lose a lot of information by treating it as just text. Scatter plot can show the relationship between clone code. The CCFinder proposed by Kamiya et al adopts scatter plot. In a scatter plot, the horizontal and vertical coordinates are all code elements, and the clone relationship between code fragments is represented by points or lines. But when the scale of the system is large, the effect of the graph is very compact, and no other information can be obtained from it except which module has a large clone proportion. Lexical-based clone code detection technologies mainly contain CCFinder[4], CP-Miner[13], CCLearner[14], Boreas[15], etc. CCFinder converts the source code into regularized sequences and parameterized symbols according to certain rules, and then uses the suffix tree matching algorithm for detection. CP-Miner uses frequent sub-item mining technology to detect large-scale systems, both of which can detect type-1 and type-2 clone code.The lexical approach to representing code ignores the syntax and semantic information of the source code. Line chart can only reflect the information of a clone fragment. The horizontal and vertical coordinates are respectively the number of lines where the code is cloned and the location of the file where the segment is. Bar chart can show the distribution of clone group in the file, with each column representing a file and each clone group represented by a color.Clone Visualizer[22] proposed by Zhang et al. used line chart and bar chart to present clone code. However, these two methods only apply to display of less clone data. Gramm-based clone code detection methods mainly include Deckard[16], CloneDR[17], CDLH[18] and so on. The principle of Deckard and CloneDR is to parse the source code into AST, and then use the tree matching technique and locally sensitive hash matching algorithm to compare the similarity. Deckard can detect type-1 to type-3 clone code, and CloneDR can detect clone fragments of type-1 and type-2. Due to the complexity of tree matching algorithm, the computational overhead of this method is relatively large. III. I MAGE - BASED C LONE C ODE D ETECTION AND V ISUALIZATION This paper proposes a code clone detection based on image similarity and visualization method (ICCV). Its flow chart is shown in figure 1, which consists of five core steps: (1) Remove the non-function code such as comments and whitespace in the source code, and extract the code function fragments; (2) Add highlight to the preprocessed code snippet and convert it to image; (3) Crop, fill and resize image; (4) The standardized code image is detected using the Jaccard distance and perceptual hash algorithm to obtain clone code information and return the result; (5) Visualization of clone code data based on FA2. Semantic-based clone code detection technologies mainly include Duplix[19], ConQAT[20], and the method of clone detection by using slice of isomorphic program dependent graph[21]. Duplix generates program dependency graph for source code, and uses the isomorphic program dependency graph subgraph matching method of post-order section for detection, and clones from type-1 to type-4 can be obtained. ConQAT symbolizes the source code and USES the detection algorithm based on the suffix tree and index to detect type-1 and type-2 clone code. The algorithm used by graph-based detection technology has high space-time complexity and computational overhead. A. Pre-processing Source code preprocessing is the basic work of image- based clone code detection and visualization. Based on the Python language as the research object, the source code preprocessing steps mainly include removing the non- function code such as single-line comment, multi-line comment and white space in the source code, identifying and marking the code fragment according to the function granularity, highlight the code based on the key weights of the keyword, data type, function name, identifier, and number or string in the code. B. Clone data visualization Due to the large amount of detected clone code data, it is not conducive to developers to understand the complex clone information in the source code, therefore, it is necessary to visualize the clone code and present it to developers in a graphical manner. Most of the existing visualization methods appear as accessory tools of clone code detection technology, including Coverage Tree Map, Scatter Plot, Line Chart, Bar Chart and so on. Image-based clone code detection and visualization is to calculate the similarity according to the pixel distribution matching in the source code image. If two code fragments are cloned, most of the code pixels are aligned between them, and vice versa. Therefore, code preprocessing will directly affect the subsequent process and detection results. Coverage tree map can be used to show the distribution of clone code in the source code. Clone Detective 1 in Visual Studio is using this technique to visualize clone 1 https://github.com/terrajobst/clonedetective-vs 169