International Core Journal of Engineering 2020-26 | Page 139
experiments on multiple datasets show that the problem
caused by larger eigenvalues is much less than that caused by
smaller eigenvalues. So we mainly discuss how to solve the
problem that a cluster is divided into multiple clusters. So we
propose cluster similarity, which is consistent with the
similarity of two clusters. We misread them as a whole.
Similarity of two clusters is described as follows: for any two
clusters C 1 and C 2 belong to data set, there are more than
two pairs of data points belonging to two clusters, but they
belongs to NE k -nearest neighbor each other. For instance,
suppose there are data points
y 1 , y 2 belong to cluster
satisfied:
represent the points contained in the neighborhood radius
and radius, respectively.
K-means clustering is the most famous partition
clustering algorithm. Because of its simplicity and
efficiency, it has become the most widely used clustering
algorithm. Given a set of data points and the number of
clusters required, K is specified by the user. The k-means
algorithm divides the data into K clusters repeatedly
according to a distance function.
The full name of BIRCH is the balanced iteration
protocol and clustering based on hierarchical method. It
only needs to scan the data set one time to cluster. The
parameter of Birch's number of optional categories, if the
user specifies the number of categories, the clustering
results will return the corresponding number of clusters.
x 1 , x 2 belong to cluster C 1 ,
C 2 , if the following conditions are
y 1 NE k NN ( x 1 ) , y 1 RNN ( x 1 )
B. Experiment And Result
Figure 1 shows the clustering results of each approaches
on data ‘flame’. flame set is divided into upper and lower
two clusters, K-Means(c) and Birch(d) two algorithms,
although we provided the correct number of categories 2, it
still can’t cluster correctly, and it can't recognize two noise
points. as for DBSCAN(b), set the parameters eps=1, minpts
= 2, Because there is no obvious distinction between the two
clusters. it can’t recognize two clusters, DBNNN(a) obtain
the right number of clusters and correctly clusters all normal
points.
y 2 NE k NN ( x 2 ) , y 2 RNN ( x 2 )
We will merge C 1 and C 2 two clusters into one cluster.
In order to ensure that each cluster is complete and can not
be combined with any other cluster to generate a new cluster,
we iterate the similarity step until the allocation of each data
point in the data set does not change. Similarity is simply
described by the algorithm:
Algorithm 2.3 Similarity
Input assign
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
Flag = 1
this_cluster = 1
pair_count = 0
While(Flag):
For k in assign = this_cluster:
For j in assign z this_cluster:
If has k NE k NN ( j ) , k RNN ( j )
pair_count++
if (pair_count = 2):
assign[ C k ] = assign[ C j ]
11)
if notChange(assign):
12)
Flag = 0
Finally, when all points other than noise are allocated, a
similarity judgment is made for the remaining noise points.
In order to determine a noise point x, we first traverse its
NE k -nearest neighbor and find the nearest point y which is
(a) (b)
(c) (d)
Figure 1. approaches on flame set
not a noise point. If they are NE k - nearest neighbors, we
consider the noise point x and y as the same cluster. If there
is no such y point, then x is a noise point.
Figure 2 shows the clustering results of each approaches
on data ‘compound’. DBNNN(a) can reasonably classify all
data points, but the uneven part on the right can not be
distinguished. Because EPS and minpts can not be changed
once they are set, it is difficult to recognize the difference of
density correctly. Here we set EPS = 1.5, minpts = 2.
DSBCAN algorithm ignores the sparse area directly and
regards it as a noise point. As for K-Means(c) and Birch(d),
because they are partition-based clustering methods, they do
not have good performance in irregular shape data sets.
III. E XPERIMENT A ND R ESULT
A. Date sets and algorithm
In order to verify the effectiveness of the algorithm, we
conduct experiments on four synthetic datasets with various
shapes and densities are found on UCI for experiments.
The algorithms used for comparison are DBSCAN, K-
Means, Birch. DBSCAN is the most famous density-based
clustering algorithm. It identifies the core part by the
sparseness of samples. Therefore, clusters of arbitrary shape
can be found. It requires two parameters EPS and minpts to
117