International Core Journal of Engineering 2020-26 | Page 139

experiments on multiple datasets show that the problem caused by larger eigenvalues is much less than that caused by smaller eigenvalues. So we mainly discuss how to solve the problem that a cluster is divided into multiple clusters. So we propose cluster similarity, which is consistent with the similarity of two clusters. We misread them as a whole. Similarity of two clusters is described as follows: for any two clusters C 1 and C 2 belong to data set, there are more than two pairs of data points belonging to two clusters, but they belongs to NE k -nearest neighbor each other. For instance, suppose there are data points y 1 , y 2 belong to cluster satisfied: represent the points contained in the neighborhood radius and radius, respectively. K-means clustering is the most famous partition clustering algorithm. Because of its simplicity and efficiency, it has become the most widely used clustering algorithm. Given a set of data points and the number of clusters required, K is specified by the user. The k-means algorithm divides the data into K clusters repeatedly according to a distance function. The full name of BIRCH is the balanced iteration protocol and clustering based on hierarchical method. It only needs to scan the data set one time to cluster. The parameter of Birch's number of optional categories, if the user specifies the number of categories, the clustering results will return the corresponding number of clusters. x 1 , x 2 belong to cluster C 1 , C 2 , if the following conditions are y 1  NE k NN ( x 1 ) , y 1  RNN ( x 1 ) B. Experiment And Result Figure 1 shows the clustering results of each approaches on data ‘flame’. flame set is divided into upper and lower two clusters, K-Means(c) and Birch(d) two algorithms, although we provided the correct number of categories 2, it still can’t cluster correctly, and it can't recognize two noise points. as for DBSCAN(b), set the parameters eps=1, minpts = 2, Because there is no obvious distinction between the two clusters. it can’t recognize two clusters, DBNNN(a) obtain the right number of clusters and correctly clusters all normal points. y 2  NE k NN ( x 2 ) , y 2  RNN ( x 2 ) We will merge C 1 and C 2 two clusters into one cluster. In order to ensure that each cluster is complete and can not be combined with any other cluster to generate a new cluster, we iterate the similarity step until the allocation of each data point in the data set does not change. Similarity is simply described by the algorithm: Algorithm 2.3 Similarity Input assign 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) Flag = 1 this_cluster = 1 pair_count = 0 While(Flag): For k in assign = this_cluster: For j in assign z this_cluster: If has k  NE k NN ( j ) , k  RNN ( j ) pair_count++ if (pair_count = 2): assign[ C k ] = assign[ C j ] 11) if notChange(assign): 12) Flag = 0 Finally, when all points other than noise are allocated, a similarity judgment is made for the remaining noise points. In order to determine a noise point x, we first traverse its NE k -nearest neighbor and find the nearest point y which is (a) (b) (c) (d) Figure 1. approaches on flame set not a noise point. If they are NE k - nearest neighbors, we consider the noise point x and y as the same cluster. If there is no such y point, then x is a noise point. Figure 2 shows the clustering results of each approaches on data ‘compound’. DBNNN(a) can reasonably classify all data points, but the uneven part on the right can not be distinguished. Because EPS and minpts can not be changed once they are set, it is difficult to recognize the difference of density correctly. Here we set EPS = 1.5, minpts = 2. DSBCAN algorithm ignores the sparse area directly and regards it as a noise point. As for K-Means(c) and Birch(d), because they are partition-based clustering methods, they do not have good performance in irregular shape data sets. III. E XPERIMENT A ND R ESULT A. Date sets and algorithm In order to verify the effectiveness of the algorithm, we conduct experiments on four synthetic datasets with various shapes and densities are found on UCI for experiments. The algorithms used for comparison are DBSCAN, K- Means, Birch. DBSCAN is the most famous density-based clustering algorithm. It identifies the core part by the sparseness of samples. Therefore, clusters of arbitrary shape can be found. It requires two parameters EPS and minpts to 117