International Core Journal of Engineering 2020-26 | Page 190

2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM) Image-based Clone Code Detection and Visualization Yafang Wang a , and Dongsheng Liu b College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China. a [email protected] b Corresponding author: [email protected] rules, such as strict rule definition, complex intermediate processes, and there is no uniform standard for source code standardization. Currently, the commonly used visualization technologies of clone code detection data include scatter plot, coverage tree map, bar chart, etc., but these methods are only applicable to the clone data presentation of small scale systems, which cannot reflect the complexity of clone code in software systems with large data volume, and cannot give play to the due value of detected data. For this purpose, this paper proposes a Code Clone detection based on Image Similarity and Visualization (ICCV). In the development process, the first impression programmers face when looking at source code in IDE during development are visual images of code. To manually look for clones in their software projects, one roughly scans the code by looking for similar shape and layout of two source code fragments. Once a similar piece of code is found visually, one can perform a more fine-grained checking of the source code text, its syntax, or its behaviours. We follow that intuition of using visual similarity between source code images to look for real clones. This paper follows the intuition of visual similarity between source code images to find clones. Clone codes are determined by representing source code fragments as images, detecting similarity between images, and using Force Atlas2 algorithm to visualize code clone data. The code clone detection method proposed in this paper converts the code into images for detection. Compared with other traditional detection techniques with strict definition of transformation rules, this method is more flexible and robust, and provides a new perspective for code cloning detection research on the representation of source code. Abstract—Currently, research mainly focuses on four perspectives of text, vocabulary, grammar and semantics in the field of clone code detection. However, few breakthroughs have been made in the effect of clone code detection for a long time. For this purpose, a new image-based clone code detection and visualization (ICCV) is proposed with the inspiration of image processing. First, the source code is preprocessed by removing comments, whitespace, etc. from which a "clean" function fragment can be obtained, and the identifiers, keywords, etc. in the function can also be highlighted; Then the processed source code is converted into images and these images are normalized; Finally, Jaccard distance and perceptual hash algorithm are used to detect and visualize the clone code information. In order to verify the validity of the experiment, six open source software were used to constitute the evaluation data set for testing. The experimental results show that ICCV can detect 100% type-1 clone code, 88% type- 2 clone code and 60% type-3 clone code, which proves the good effect of ICCV on clone code detection. Keywords—clone code; clone Jaccard distance; perceptual hash. detection; visualization; I. I NTRODUCTION Copying code snippets, and then reusing the code by pasting and adjusting (such as adding, deleting, or modifying statements) is a common practice in software development. This reuse mechanism causes a large number of identical or similar code fragments in the code repository, namely clone code. In addition, using a particular development framework or reusing design patterns can also result in clone code. Clone code is an important research content in the field of software engineering, which is embodied in many fields such as software maintenance, software evolution, software quality, software reuse and software authorization, and anti-plagiarism. Empirical studies have shown that a software system may have 20% to 30% of clone code[1][2][3][4], sometimes even up to 50%[5]. A large number of clone code will have a negative impact on the software system[6], for example, repeated use of code snippets containing unknown errors may lead to error propagation, reduce the reliability of the software system, and so on. Therefore, the detection, identification and presentation of clone code in the software system can help developers to identify and better understand the clone code, thus reducing the extra labor cost and the maintenance cost of the software. In order to solve this problem, many good code clone detection methods have been proposed in academic. However, these methods need to convert the source code into Token, AST (Abstract Syntax Tree) or PDG(Program Dependence Graph) according to their own 978-1-7281-4691-1/19/$31.00 ©2019 IEEE DOI 10.1109/AIAM48774.2019.00041 II. R ELATED W ORK Scholars have been studying clone code since more than 20 years ago[7]. In this field, the classification standard widely recognized by researchers at present is the four- category standard proposed by Bellon et al. according to the degree of clone code[8]. Type-1 refers to a code pair that removes spaces and comments exactly the same; Type-2 refers to code pairs that are identical except for type names, identifiers, and constants; Type-3 refers to a pair of code where some statements are added, deleted, modified, or where identifiers and types are replaced, but the syntactic structure is basically the same; Type-4 refers to code pairs that are semantically similar but have different syntactic structures. A. Clone Code Detection method At present, there are many excellent code cloning 168