International Core Journal of Engineering 2020-26 | Page 192
&2'(
&2'(
PY &2'(
PY
PY
1.Pre
processing
+70/!
+70/!
+70/!
2.Image
Conversion
31*
31*
31*
3.Image
4.Similarity
M easurement
Normaliz ation
!
;0/
5.Da ta
visual ization
Figure 1. Overall research route.
The functions extracted in this paper include the loop
nesting depth, not just the information on the lexical level
[23]. The main steps of the function extraction algorithm
are as follows:
source pixel is set to black, and its gray value is 0, while the
pixel of source code is calculated by averaging the RGB
value of the pixel point, which means that the pixel number
of source code contained in the image can be calculated
according to the number of non-zero elements in the matrix.
Algorithm 1 Find Function Algorithm.
When the image similarity detection is performed, the
original images to be detected have the same size with each
other and cannot be scaled at will, so as to avoid the
dislocation of pixel points in the image and the loss of clone
pairs due to the different font sizes of the code in the image.
Based on this, all images need to be normalized. For images
of different lengths, use black to complement a relatively
short image to be equal to the length of the longer image.
Since the generated image width is uniform pixels, in order
to reduce the problem of excessive similarity due to
excessive proportion of non-code pixels in the pHash
detection process, it is necessary to greatly cut non-code
pixels in width, and also ensure that the widths between the
images are the same. The main steps of the image
normalization algorithm are as follows:
function FINDFUNCTION(pyFile)
F = [], subF = [], Fflag = False, subFflag = False
for each line ę all lines do
if "def" in each line and not Fflag then
F.append(each line[pos:])
end if
if Fflag then
if "def" in each line and not subFflag
then
subF.append(each line[pos:])
end if
if subFflag then
subF.append(each line)
end if
else
F.append(each line)
end if
end for
return F, subF
end function
Algorithm 2 Normalize Image Size Algorithm.
function UNIFORMIMAGESIZE(pngList)
W = []
for each png ę pngList do
if maxH < png.heingt then
maxH ĕ png.height
end if
img ĕ png.convert(grayscale)
img ĕ img.array
for i in range(img.height) do
for j in range(decrease(img.width)) do
if img[i][j] Į 0
W.append(j + 1)
end if
end for
end for
weidth ĕ max(W)
if maxW < weidth then
maxW ĕ weidth
end if
return maxH, maxW
end function
B. Image Conversion
The preprocessed source code needs to be converted
into images, and then according to the calculated similarity
between images, clone code detection is carried out to
determine the code clone pair. In this paper, with the help of
text processing tool Highlight, function codes extracted by
preprocessing are added in batches to be highlighted and
converted into HTML files, so that the codes with high
weight play a role in image similarity detection, thereby
reducing the proportion of unimportant codes such as
numbers and strings in the results.
This paper uses Python imgkit to convert the HTML
files for each function in the file to Portable Network
Graphics (PNG). The image is read into storage in a two-
dimensional matrix of size m*n. Each value in the matrix
ranges from 0 to 255, representing the 8-bit grayscale image
of the original image. The grayscale value is obtained by
averaging the red, green and blue (RGB) values in the color
channel of pixel points.
D. Similarity Measurement
There are two algorithms of calculating image similarity:
Jaccard Distance and perceptual hash (pHash).
C. Image Normalization
The graphical function code fragments cannot be
directly used for similarity detection, which is due to the
difference in pixels and proportions between directly
converted images, which does not meet the detection
conditions of Jaccard distance and perceptual hash
algorithm. In order to solve this problem, the source code
converted image needs to be normalized.
Using Jaccard distance detection code to clone
fragments is relatively fast and can process a large amount
of data, but the effect is not ideal. This method can only
detect simpler clone code, which cannot be accurately
identified for more complicated situations. Therefore,
pHash algorithm is used for further detection. the pHash
algorithm preserves the low-frequency part of the image to
the greatest extent by discrete cosine transform. As long as
In this paper, the image background, that is, the non-
170