International Core Journal of Engineering 2020-26

2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM) Parameter free clustering algorithm based on density and natural nearest neighbor Yulun Wu Guilin University of Technology College of Information Science and Engineering Guilin, China e-mail:[email protected] categories: Partition-based cluster, clustering, Density-based clustering. Abstract—The purpose of clustering algorithm is to explore the correlation of a large number of unlabeled data, so as to find valuable information in chaotic data. There are many kinds of clustering algorithms, but most of them need one or more parameters to be selected based on experience, and the selection of parameters will directly affect the accuracy of clustering algorithm, for example, DBSCAN needs domain radius and number of radius points; k-means algorithm needs to know the number of clusters in advance, k-nearest neighbor algorithm needs to be selected. Choose the appropriate number of neighbors, etc. In order to realize parametric clustering, we adopt the concept of natural nearest neighbor, and let data points discover neighbors independently by iteration. To make up for the problem that the selected K value may not be appropriate, in the process of clustering, we proposes a method of similarity between clusters, which is used to modify the misclassified clusters. Finally, the similarity of observation points is proposed. To distinguish boundary points from outliers. By comparing with DBSCAN, BIRCH and K-MEANS, proved that our algorithm can achieve good performance than other algorithms. The general idea of partition-based clustering algorithm is to divide the data into k parts, among them, the classical algorithms are K-means[1] and K-medoids[2], in 2007 X- means[3] is proposed by Pelleg et al. which is an improved K-means algorithm, it only needs a parameter range, not a specific value as a parameter. In hierarchical clustering, a hierarchical nested clustering tree is created by calculating the similarity between data points of different categories. In the clustering tree, the original data points of different categories are the lowest level of the tree, and the top level of the tree is the root node of the cluster. There are two ways to create clustering tree: bottom-up merging and top-down splitting. Cure[4] and Birch[5] are the representative algorithm of this kind. In order to consider the proximity of the nearest neighbor node and the size of the adjacent region, the chameleon [6] is proposed. To solve the instability of clustering algorithm to noise, Balcan et al.[7] proposed robust hierarchical clustering. Keywords—cluster; parameter free; I. I NTRODUCTION Density-based clustering defines clusters as the largest set of densely connected points,it and can divide regions with sufficient densities into clusters, it can find clusters of arbitrary shape. DBSCAN [8] is the most typical representative algorithms in this kind of methods, it needs eps, minpts two parameters, if there are minpts points in the radius of eps, they are regarded as high density points. In data mining algorithm, clustering algorithm is one of the main research fields. The so-called "people clustering, things clustering", the purpose of clustering algorithm is to divide a group of disorderly data into several clusters without any prior knowledge, so as to make the similarity process between the data objects in each cluster. The degree should be as large as possible, and the smaller the similarity between different clusters, the better. In today's highly developed network, people's work and life are constantly generating a large amount of data at all times. For example, from small to user-friendly browsing pages to consumer shopping, these data seem to have no relevance. In fact, these small traces contain the interests and habits of each user. Detailed information, and these details are creating endless value for enterprises. With the above examples, it is common to say that in clustering analysis, different users constitute different data objects, and users' interests and habits constitute different attributes of data objects, and excellent clustering algorithm can accurately achieve "people clustering". Because clustering algorithm can obtain useful knowledge without supervision, more and more people focus on clustering algorithm, eager to discover the rules and values of cluttered data through clustering algorithm. Traditional clustering algorithms are mainly divided into several 978-1-7281-4691-1/19/$31.00 ©2019 IEEE DOI 10.1109/AIAM48774.2019.00030 Hierarchical-based D.Lian et al. proposed the LDBSCAN[9] with techniques for handling data with heterogeneous. To improve efficiency APSCAN[10] and DSet-dbscan[11] was proposed, because DBSCAN parameters are difficult to set, OPTICS[12] has been developed by Ankerst et al. Although the clustering algorithm has been improved in many aspects, the setting of parameters will always affect the clustering performance in the practical application of the clustering algorithm. Different parameters will lead to different clustering results. For a data set without prior knowledge and labels, it is difficult to give a reasonable parameter directly. In this paper, we introduce the concept of natural nearest neighbor(3N) which proposed by Dr. Zhou et al.[13] to adaptively generate nearest neighbor eigenvalues and reverse nearest neighbors of each point. Then we extend the cluster by density-based clustering, and regard the points whose 115

International Core Journal of Engineering 2020-26 | Page 137