International Core Journal of Engineering 2020-26 | Page 193
of the i-th row and the j-th column are denoted as dij. The
difference between the two images is represented by the
number of non-zero elements in matrix D, called diff. As
shown in figure 2(a) and figure 2(b), given a pair of source
code images bubble_sort1 and bubble_sort2, their
differences is shown in figure 2(c), and the diff value is the
number of non-zero pixels in the red-boxed areas of the
image.
the overall structure of the image remains unchanged, its
corresponding hash value is basically unchanged, which can
better identify relatively complex clones situations.
1) Jaccard Distance
Jaccard distance[24][25][26][27] is an indicator used to
measure the difference between two sets. It is a complement
of Jaccard similarity coefficient, defined as 1 minus Jaccard
similarity coefficient. In order to calculate Jaccard distance,
the difference between the two code images based on zero-
norm is given as follows. Two equal-sized images are given,
respectively represented by two-dimensional matrices A
and B of size m×n. Matrix D is a element-wise difference
matrix derived from matrices A and B, where the elements
Fig.2(a)bubble_sort1
D A B (1)
diff
Fig.2(b)bubble_sort2
Fig.2(c)diff
¦ d
ij
( d ij D | d ij z 0)
(2)
Fig.2(d)sum
Figure 2. Code segment
the matrix of the frequency coefficient. Select the upper left
area elements to calculate the image hash value. By
comparing the distance between the hash values of two
images, we can judge whether the images are similar or not.
The main steps of pHash algorithm are as follows:
Algorithm 3 pHash Algorithm.
In this paper, the matrix obtained by adding matrix A
and B is denoted as matrix S. Sij represents the elements of
the i-th row and the j-th column in matrix S. The area of
source code text contained in these two images is
represented by the number of non-zero elements in the
matrix S, and is denoted as sum. In order to limit the
distance between (0,1), this paper uses sum to normalize the
distance. As shown in figure 2(d), the value of sum is the
number of non-zero pixels in the yellow-boxed areas of the
figure.
S
sum
¦ s ( s
ij
ij
function pHash(picture)
H = [], picture.size. reduce(n×n), picture ĕ
picture.convert(grayscale)
Matrix_(nhn) ĕ DCT(G_picture), matrix_(khk)
ĕTop_left(Matrix_(nhn))
avg ĕ average(Matrix_(nhn))
for i in range(n) do
for j in range(n) do
if matrix[i][j] ı avg then
H.append(1)
else
H.append(0)
end if
end for
end for
return H
end function
A B (3)
S | s ij z 0)
(4)
Since the background occupies a large proportion in the
image, and the backgrounds always match each other, in
order to prevent the influence caused by this situation, the
distance value far smaller than the actual situation is
generated, this paper selects the sum of the number of
pixels in the code area, not the total number of pixels in the
image. Then the normalized distance is calculated according
to the diff value and sum value, and the similarity is
complementary to the distance. The normalized distance
between matrices A and B is denoted as distance, and the
similarity between the two images represented by matrices
A and B is denoted as similarity.
The maximum pHash distance is used to normalize the
calculated distance, and the similarity and distance
complement each other. The pHash distance between matrix
A and B is expressed by pHash(A, B). The maximum
pHash distance between any two matrices is denoted as
pHashmax, and the similarity between the two images
represented by matrix A and B is expressed by variable
similarity.
distance diff sum (5)
similarity 1 distance (6)
similarity 1 pHash ( A , B ) pHash max
2) Perceptual hash
In this paper, the perceptual hash algorithm [28][29] is
used for clone detection of code image. PHash algorithm
uses Discrete Cosine Transform (DCT) to convert the image
in the pixel domain into the frequency domain, and obtain
(7)
E. Data visualization
This paper selects the ForceAtlas2 (FA2) [30] layout
171