International Core Journal of Engineering 2020-26 | Page 190
2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM)
Image-based Clone Code Detection and
Visualization
Yafang Wang a , and Dongsheng Liu b
College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China.
a
[email protected]
b
Corresponding author: [email protected]
rules, such as strict rule definition, complex intermediate
processes, and there is no uniform standard for source code
standardization. Currently, the commonly used visualization
technologies of clone code detection data include scatter
plot, coverage tree map, bar chart, etc., but these methods
are only applicable to the clone data presentation of small
scale systems, which cannot reflect the complexity of clone
code in software systems with large data volume, and
cannot give play to the due value of detected data. For this
purpose, this paper proposes a Code Clone detection based
on Image Similarity and Visualization (ICCV). In the
development process, the first impression programmers face
when looking at source code in IDE during development are
visual images of code. To manually look for clones in their
software projects, one roughly scans the code by looking for
similar shape and layout of two source code fragments.
Once a similar piece of code is found visually, one can
perform a more fine-grained checking of the source code
text, its syntax, or its behaviours. We follow that intuition
of using visual similarity between source code images to
look for real clones. This paper follows the intuition of
visual similarity between source code images to find clones.
Clone codes are determined by representing source code
fragments as images, detecting similarity between images,
and using Force Atlas2 algorithm to visualize code clone
data. The code clone detection method proposed in this
paper converts the code into images for detection.
Compared with other traditional detection techniques with
strict definition of transformation rules, this method is more
flexible and robust, and provides a new perspective for code
cloning detection research on the representation of source
code.
Abstract—Currently, research mainly focuses on four
perspectives of text, vocabulary, grammar and semantics in
the field of clone code detection. However, few breakthroughs
have been made in the effect of clone code detection for a long
time. For this purpose, a new image-based clone code detection
and visualization (ICCV) is proposed with the inspiration of
image processing. First, the source code is preprocessed by
removing comments, whitespace, etc. from which a "clean"
function fragment can be obtained, and the identifiers,
keywords, etc. in the function can also be highlighted; Then
the processed source code is converted into images and these
images are normalized; Finally, Jaccard distance and
perceptual hash algorithm are used to detect and visualize the
clone code information. In order to verify the validity of the
experiment, six open source software were used to constitute
the evaluation data set for testing. The experimental results
show that ICCV can detect 100% type-1 clone code, 88% type-
2 clone code and 60% type-3 clone code, which proves the
good effect of ICCV on clone code detection.
Keywords—clone code; clone
Jaccard distance; perceptual hash.
detection;
visualization;
I. I NTRODUCTION
Copying code snippets, and then reusing the code by
pasting and adjusting (such as adding, deleting, or
modifying statements) is a common practice in software
development. This reuse mechanism causes a large number
of identical or similar code fragments in the code repository,
namely clone code. In addition, using a particular
development framework or reusing design patterns can also
result in clone code. Clone code is an important research
content in the field of software engineering, which is
embodied in many fields such as software maintenance,
software evolution, software quality, software reuse and
software authorization, and anti-plagiarism. Empirical
studies have shown that a software system may have 20% to
30% of clone code[1][2][3][4], sometimes even up to
50%[5]. A large number of clone code will have a negative
impact on the software system[6], for example, repeated use
of code snippets containing unknown errors may lead to
error propagation, reduce the reliability of the software
system, and so on. Therefore, the detection, identification
and presentation of clone code in the software system can
help developers to identify and better understand the clone
code, thus reducing the extra labor cost and the maintenance
cost of the software. In order to solve this problem, many
good code clone detection methods have been proposed in
academic. However, these methods need to convert the
source code into Token, AST (Abstract Syntax Tree) or
PDG(Program Dependence Graph) according to their own
978-1-7281-4691-1/19/$31.00 ©2019 IEEE
DOI 10.1109/AIAM48774.2019.00041
II. R ELATED W ORK
Scholars have been studying clone code since more than
20 years ago[7]. In this field, the classification standard
widely recognized by researchers at present is the four-
category standard proposed by Bellon et al. according to the
degree of clone code[8]. Type-1 refers to a code pair that
removes spaces and comments exactly the same; Type-2
refers to code pairs that are identical except for type names,
identifiers, and constants; Type-3 refers to a pair of code
where some statements are added, deleted, modified, or
where identifiers and types are replaced, but the syntactic
structure is basically the same; Type-4 refers to code pairs
that are semantically similar but have different syntactic
structures.
A. Clone Code Detection method
At present, there are many excellent code cloning
168