International Core Journal of Engineering 2020-26 | Page 192

&2'( &2'( PY &2'( PY PY 1.Pre processing +70/! +70/! +70/! 2.Image Conversion 31* 31* 31*    3.Image 4.Similarity  M easurement  Normaliz ation           ! ;0/ 5.Da ta visual ization Figure 1. Overall research route. The functions extracted in this paper include the loop nesting depth, not just the information on the lexical level [23]. The main steps of the function extraction algorithm are as follows: source pixel is set to black, and its gray value is 0, while the pixel of source code is calculated by averaging the RGB value of the pixel point, which means that the pixel number of source code contained in the image can be calculated according to the number of non-zero elements in the matrix. Algorithm 1 Find Function Algorithm. When the image similarity detection is performed, the original images to be detected have the same size with each other and cannot be scaled at will, so as to avoid the dislocation of pixel points in the image and the loss of clone pairs due to the different font sizes of the code in the image. Based on this, all images need to be normalized. For images of different lengths, use black to complement a relatively short image to be equal to the length of the longer image. Since the generated image width is uniform pixels, in order to reduce the problem of excessive similarity due to excessive proportion of non-code pixels in the pHash detection process, it is necessary to greatly cut non-code pixels in width, and also ensure that the widths between the images are the same. The main steps of the image normalization algorithm are as follows: function FINDFUNCTION(pyFile) F = [], subF = [], Fflag = False, subFflag = False for each line ę all lines do if "def" in each line and not Fflag then F.append(each line[pos:]) end if if Fflag then if "def" in each line and not subFflag then subF.append(each line[pos:]) end if if subFflag then subF.append(each line) end if else F.append(each line) end if end for return F, subF end function Algorithm 2 Normalize Image Size Algorithm. function UNIFORMIMAGESIZE(pngList) W = [] for each png ę pngList do if maxH < png.heingt then maxH ĕ png.height end if img ĕ png.convert(grayscale) img ĕ img.array for i in range(img.height) do for j in range(decrease(img.width)) do if img[i][j] Į 0 W.append(j + 1) end if end for end for weidth ĕ max(W) if maxW < weidth then maxW ĕ weidth end if return maxH, maxW end function B. Image Conversion The preprocessed source code needs to be converted into images, and then according to the calculated similarity between images, clone code detection is carried out to determine the code clone pair. In this paper, with the help of text processing tool Highlight, function codes extracted by preprocessing are added in batches to be highlighted and converted into HTML files, so that the codes with high weight play a role in image similarity detection, thereby reducing the proportion of unimportant codes such as numbers and strings in the results. This paper uses Python imgkit to convert the HTML files for each function in the file to Portable Network Graphics (PNG). The image is read into storage in a two- dimensional matrix of size m*n. Each value in the matrix ranges from 0 to 255, representing the 8-bit grayscale image of the original image. The grayscale value is obtained by averaging the red, green and blue (RGB) values in the color channel of pixel points. D. Similarity Measurement There are two algorithms of calculating image similarity: Jaccard Distance and perceptual hash (pHash). C. Image Normalization The graphical function code fragments cannot be directly used for similarity detection, which is due to the difference in pixels and proportions between directly converted images, which does not meet the detection conditions of Jaccard distance and perceptual hash algorithm. In order to solve this problem, the source code converted image needs to be normalized. Using Jaccard distance detection code to clone fragments is relatively fast and can process a large amount of data, but the effect is not ideal. This method can only detect simpler clone code, which cannot be accurately identified for more complicated situations. Therefore, pHash algorithm is used for further detection. the pHash algorithm preserves the low-frequency part of the image to the greatest extent by discrete cosine transform. As long as In this paper, the image background, that is, the non- 170