International Core Journal of Engineering 2020-26 | Page 191
information. In the coverage tree map, the file is expressed
as a rectangular block, through the size of the rectangular
block, the positional relationship between the rectangular
blocks, and the color of the rectangular block to indicate the
number of lines of code in the file, the inclusion
relationship between codes and the coverage of code
cloning in the file, move the cursor to the rectangular block,
also can display more detailed code information. However,
coverage tree map does not reflect the relationship between
clone code.
detection
methods
in
the
academic
circle[4][9][10][13][16][17] [19][20], which are mainly
Text-based, Lexical-based, Grammar-based and Semantics-
based detection method.
Text-based clone code detection methods mainly
include NiCad[9], SDD[10], Dup[11], Duploc[12] and so
on. NiCad first preprocesses the source code, then converts
it according to specific rules, and finally uses the dynamic
matching mode to find the longest common subsequence,
and finally can detect the clone code of type-1 to type-3.
SDD directly establishes an inverted index on the source
code, and uses the n-Nearest Neighbor algorithm for
detection, which can detect clone code from type-1 to type-
3. However, since the source code itself has meaning, you
will lose a lot of information by treating it as just text.
Scatter plot can show the relationship between clone
code. The CCFinder proposed by Kamiya et al adopts
scatter plot. In a scatter plot, the horizontal and vertical
coordinates are all code elements, and the clone relationship
between code fragments is represented by points or lines.
But when the scale of the system is large, the effect of the
graph is very compact, and no other information can be
obtained from it except which module has a large clone
proportion.
Lexical-based clone code detection technologies mainly
contain CCFinder[4], CP-Miner[13], CCLearner[14],
Boreas[15], etc. CCFinder converts the source code into
regularized sequences and parameterized symbols
according to certain rules, and then uses the suffix tree
matching algorithm for detection. CP-Miner uses frequent
sub-item mining technology to detect large-scale systems,
both of which can detect type-1 and type-2 clone code.The
lexical approach to representing code ignores the syntax and
semantic information of the source code.
Line chart can only reflect the information of a clone
fragment. The horizontal and vertical coordinates are
respectively the number of lines where the code is cloned
and the location of the file where the segment is. Bar chart
can show the distribution of clone group in the file, with
each column representing a file and each clone group
represented by a color.Clone Visualizer[22] proposed by
Zhang et al. used line chart and bar chart to present clone
code. However, these two methods only apply to display of
less clone data.
Gramm-based clone code detection methods mainly
include Deckard[16], CloneDR[17], CDLH[18] and so on.
The principle of Deckard and CloneDR is to parse the
source code into AST, and then use the tree matching
technique and locally sensitive hash matching algorithm to
compare the similarity. Deckard can detect type-1 to type-3
clone code, and CloneDR can detect clone fragments of
type-1 and type-2. Due to the complexity of tree matching
algorithm, the computational overhead of this method is
relatively large.
III. I MAGE - BASED C LONE C ODE D ETECTION AND
V ISUALIZATION
This paper proposes a code clone detection based on
image similarity and visualization method (ICCV). Its flow
chart is shown in figure 1, which consists of five core steps:
(1) Remove the non-function code such as comments and
whitespace in the source code, and extract the code function
fragments; (2) Add highlight to the preprocessed code
snippet and convert it to image; (3) Crop, fill and resize
image; (4) The standardized code image is detected using
the Jaccard distance and perceptual hash algorithm to obtain
clone code information and return the result; (5)
Visualization of clone code data based on FA2.
Semantic-based clone code detection technologies
mainly include Duplix[19], ConQAT[20], and the method
of clone detection by using slice of isomorphic program
dependent graph[21]. Duplix generates program
dependency graph for source code, and uses the isomorphic
program dependency graph subgraph matching method of
post-order section for detection, and clones from type-1 to
type-4 can be obtained. ConQAT symbolizes the source
code and USES the detection algorithm based on the suffix
tree and index to detect type-1 and type-2 clone code. The
algorithm used by graph-based detection technology has
high space-time complexity and computational overhead.
A. Pre-processing
Source code preprocessing is the basic work of image-
based clone code detection and visualization. Based on the
Python language as the research object, the source code
preprocessing steps mainly include removing the non-
function code such as single-line comment, multi-line
comment and white space in the source code, identifying
and marking the code fragment according to the function
granularity, highlight the code based on the key weights of
the keyword, data type, function name, identifier, and
number or string in the code.
B. Clone data visualization
Due to the large amount of detected clone code data, it
is not conducive to developers to understand the complex
clone information in the source code, therefore, it is
necessary to visualize the clone code and present it to
developers in a graphical manner. Most of the existing
visualization methods appear as accessory tools of clone
code detection technology, including Coverage Tree Map,
Scatter Plot, Line Chart, Bar Chart and so on.
Image-based clone code detection and visualization is to
calculate the similarity according to the pixel distribution
matching in the source code image. If two code fragments
are cloned, most of the code pixels are aligned between
them, and vice versa. Therefore, code preprocessing will
directly affect the subsequent process and detection results.
Coverage tree map can be used to show the distribution
of clone code in the source code. Clone Detective 1 in
Visual Studio is using this technique to visualize clone
1
https://github.com/terrajobst/clonedetective-vs
169