International Core Journal of Engineering 2020-26 | Page 137
2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM)
Parameter free clustering algorithm based on
density and natural nearest neighbor
Yulun Wu
Guilin University of Technology
College of Information Science and Engineering
Guilin, China
e-mail:[email protected]
categories: Partition-based cluster,
clustering, Density-based clustering.
Abstract—The purpose of clustering algorithm is to explore
the correlation of a large number of unlabeled data, so as to
find valuable information in chaotic data. There are many
kinds of clustering algorithms, but most of them need one or
more parameters to be selected based on experience, and the
selection of parameters will directly affect the accuracy of
clustering algorithm, for example, DBSCAN needs domain
radius and number of radius points; k-means algorithm needs
to know the number of clusters in advance, k-nearest neighbor
algorithm needs to be selected. Choose the appropriate number
of neighbors, etc. In order to realize parametric clustering, we
adopt the concept of natural nearest neighbor, and let data
points discover neighbors independently by iteration. To make
up for the problem that the selected K value may not be
appropriate, in the process of clustering, we proposes a method
of similarity between clusters, which is used to modify the
misclassified clusters. Finally, the similarity of observation
points is proposed. To distinguish boundary points from
outliers. By comparing with DBSCAN, BIRCH and K-MEANS,
proved that our algorithm can achieve good performance than
other algorithms.
The general idea of partition-based clustering algorithm
is to divide the data into k parts, among them, the classical
algorithms are K-means[1] and K-medoids[2], in 2007 X-
means[3] is proposed by Pelleg et al. which is an improved
K-means algorithm, it only needs a parameter range, not a
specific value as a parameter.
In hierarchical clustering, a hierarchical nested clustering
tree is created by calculating the similarity between data
points of different categories. In the clustering tree, the
original data points of different categories are the lowest
level of the tree, and the top level of the tree is the root node
of the cluster. There are two ways to create clustering tree:
bottom-up merging and top-down splitting.
Cure[4] and Birch[5] are the representative algorithm of
this kind. In order to consider the proximity of the nearest
neighbor node and the size of the adjacent region, the
chameleon [6] is proposed. To solve the instability of
clustering algorithm to noise, Balcan et al.[7] proposed
robust hierarchical clustering.
Keywords—cluster; parameter free;
I. I NTRODUCTION
Density-based clustering defines clusters as the largest
set of densely connected points,it and can divide regions with
sufficient densities into clusters, it can find clusters of
arbitrary shape. DBSCAN [8] is the most typical
representative algorithms in this kind of methods, it needs
eps, minpts two parameters, if there are minpts points in the
radius of eps, they are regarded as high density points.
In data mining algorithm, clustering algorithm is one of
the main research fields. The so-called "people clustering,
things clustering", the purpose of clustering algorithm is to
divide a group of disorderly data into several clusters without
any prior knowledge, so as to make the similarity process
between the data objects in each cluster. The degree should
be as large as possible, and the smaller the similarity between
different clusters, the better. In today's highly developed
network, people's work and life are constantly generating a
large amount of data at all times. For example, from small to
user-friendly browsing pages to consumer shopping, these
data seem to have no relevance. In fact, these small traces
contain the interests and habits of each user. Detailed
information, and these details are creating endless value for
enterprises. With the above examples, it is common to say
that in clustering analysis, different users constitute different
data objects, and users' interests and habits constitute
different attributes of data objects, and excellent clustering
algorithm can accurately achieve "people clustering".
Because clustering algorithm can obtain useful knowledge
without supervision, more and more people focus on
clustering algorithm, eager to discover the rules and values
of cluttered data through clustering algorithm. Traditional
clustering algorithms are mainly divided into several
978-1-7281-4691-1/19/$31.00 ©2019 IEEE
DOI 10.1109/AIAM48774.2019.00030
Hierarchical-based
D.Lian et al. proposed the LDBSCAN[9] with techniques
for handling data with heterogeneous. To improve efficiency
APSCAN[10] and DSet-dbscan[11] was proposed, because
DBSCAN parameters are difficult to set, OPTICS[12] has
been developed by Ankerst et al.
Although the clustering algorithm has been improved in
many aspects, the setting of parameters will always affect the
clustering performance in the practical application of the
clustering algorithm. Different parameters will lead to
different clustering results. For a data set without prior
knowledge and labels, it is difficult to give a reasonable
parameter directly.
In this paper, we introduce the concept of natural nearest
neighbor(3N) which proposed by Dr. Zhou et al.[13] to
adaptively generate nearest neighbor eigenvalues and reverse
nearest neighbors of each point. Then we extend the cluster
by density-based clustering, and regard the points whose
115