International Core Journal of Engineering 2020-26 | Page 193

of the i-th row and the j-th column are denoted as dij. The difference between the two images is represented by the number of non-zero elements in matrix D, called diff. As shown in figure 2(a) and figure 2(b), given a pair of source code images bubble_sort1 and bubble_sort2, their differences is shown in figure 2(c), and the diff value is the number of non-zero pixels in the red-boxed areas of the image. the overall structure of the image remains unchanged, its corresponding hash value is basically unchanged, which can better identify relatively complex clones situations. 1) Jaccard Distance Jaccard distance[24][25][26][27] is an indicator used to measure the difference between two sets. It is a complement of Jaccard similarity coefficient, defined as 1 minus Jaccard similarity coefficient. In order to calculate Jaccard distance, the difference between the two code images based on zero- norm is given as follows. Two equal-sized images are given, respectively represented by two-dimensional matrices A and B of size m×n. Matrix D is a element-wise difference matrix derived from matrices A and B, where the elements Fig.2(a)bubble_sort1 D A  B (1) diff Fig.2(b)bubble_sort2 Fig.2(c)diff ¦ d ij ( d ij  D | d ij z 0) (2) Fig.2(d)sum Figure 2. Code segment the matrix of the frequency coefficient. Select the upper left area elements to calculate the image hash value. By comparing the distance between the hash values of two images, we can judge whether the images are similar or not. The main steps of pHash algorithm are as follows: Algorithm 3 pHash Algorithm. In this paper, the matrix obtained by adding matrix A and B is denoted as matrix S. Sij represents the elements of the i-th row and the j-th column in matrix S. The area of source code text contained in these two images is represented by the number of non-zero elements in the matrix S, and is denoted as sum. In order to limit the distance between (0,1), this paper uses sum to normalize the distance. As shown in figure 2(d), the value of sum is the number of non-zero pixels in the yellow-boxed areas of the figure. S sum ¦ s ( s ij ij function pHash(picture) H = [], picture.size. reduce(n×n), picture ĕ picture.convert(grayscale) Matrix_(nhn) ĕ DCT(G_picture), matrix_(khk) ĕTop_left(Matrix_(nhn)) avg ĕ average(Matrix_(nhn)) for i in range(n) do for j in range(n) do if matrix[i][j] ı avg then H.append(1) else H.append(0) end if end for end for return H end function A  B (3)  S | s ij z 0) (4) Since the background occupies a large proportion in the image, and the backgrounds always match each other, in order to prevent the influence caused by this situation, the distance value far smaller than the actual situation is generated, this paper selects the sum of the number of pixels in the code area, not the total number of pixels in the image. Then the normalized distance is calculated according to the diff value and sum value, and the similarity is complementary to the distance. The normalized distance between matrices A and B is denoted as distance, and the similarity between the two images represented by matrices A and B is denoted as similarity. The maximum pHash distance is used to normalize the calculated distance, and the similarity and distance complement each other. The pHash distance between matrix A and B is expressed by pHash(A, B). The maximum pHash distance between any two matrices is denoted as pHashmax, and the similarity between the two images represented by matrix A and B is expressed by variable similarity. distance diff sum (5) similarity 1  distance (6) similarity 1  pHash ( A , B ) pHash max 2) Perceptual hash In this paper, the perceptual hash algorithm [28][29] is used for clone detection of code image. PHash algorithm uses Discrete Cosine Transform (DCT) to convert the image in the pixel domain into the frequency domain, and obtain (7) E. Data visualization This paper selects the ForceAtlas2 (FA2) [30] layout 171