Releases: farhadabedinzadeh/Auto-CVI-Tool
Auto-CVI-Tool
Auto-CVI-Tool
An Automatic Toolbox for Cluster Validity Indexes (CVI)
Cluster analysis involves identifying clusters' optimal number and natural division through automatic clustering. A cluster validity index (CVI) can easily estimate the number of clusters. Several cluster solutions have been proposed in the literature regarding intra-cluster cohesiveness and inter-cluster separation. Despite this, it is crucial to identify the situations where these CVIs work well and their limitations.
To estimate the number of clusters, this toolbox presents 28 robust CVIs. It is highly user-friendly and does not require any coding knowledge.
Without writing a single line of code, it is possible to compare 28 CVIs and visualize the results comparably.
When the data is loaded, all parameters will be automatically selected by the user, or the default setting will be used. The CVIs can be compared without any additional programming.
It is important to note that one section of this paper was used in developing this toolbox; I would appreciate a citation to both the reference article and myself if you use any part of this toolbox.
A cluster validity index (CVI) estimates the quality of a clustering solution by defining a relationship between intracluster cohesiveness (within-group scatter) and intercluster separation (between-group scatter). Table1
summarizes the 22 CVIs examined in this toolbox. Each CVI is identified by an acronym in the table, followed by an up arrow ↑
or a down arrow ↓
to indicate whether the index is maximized or minimized, respectively.
Table1
no. | Index | Full Name & Accronym | Min\Max |
---|---|---|---|
1 | chindex | Calinski-Harabasz index (ch). | ↑ |
2 | cindex | C index (cind). | ↓ |
3 | copindex | COP index (cop). | ↓ |
4 | csindex | CS index (cs). | ↓ |
5 | cvddindex | Index based on density-involved distance (cvdd). | ↑ |
6 | cvnnindex | Index based on nearest neighbors (cvnn). | ↓ |
7 | dbindex | Davies-Bouldin index (db). | ↓ |
8 | db2index | Enhanced Davies-Bouldin index (db2). | ↓ |
9 | dbcvindex | Density-based index (dbcv). | ↑ |
10 | dunnindex | Dunn index (dunn). | ↑ |
11 | gd31index | Dunn index variant 3,1 (gd31). | ↑ |
12 | gd33index | Dunn index variant 3,3 (gd33). | ↑ |
13 | gd41index | Dunn index variant 4,1 (gd41). | ↑ |
14 | gd43index | Dunn index variant 4,3 (gd43). | ↑ |
15 | gd51index | Dunn index variant 5,1 (gd51). | ↑ |
16 | gd53index | Dunn index variant 5,3 (gd53). | ↑ |
17 | lccvindex | Index based on local cores (lccv). | ↑ |
18 | pbmindex | PBM index (pbm). | ↑ |
19 | sdbwindex | S_Dbw validity index (sdbw). | ↓ |
20 | sfindex | Score Function index (sf). | ↑ |
21 | silindex | Silhouette index (sil). | ↑ |
22 | ssddindex | Index based on shapes, sizes, densities, and separation distances (ssdd). | ↓ |
23 | svindex | SV index (sv). | ↑ |
24 | symindex | Symmetry index (sym). | ↑ |
25 | symdbindex | Davies-Bouldin index based on symmetry (sdb). | ↓ |
26 | symdunnindex | Dunn index based on symmetry (sdunn). | ↑ |
27 | wbindex | WB index (wb). | ↓ |
28 | xbindex | Xie-Beni index (xb). | ↓ |
How to Use?
There are two scripts named KMeans_Evaluation.m
and Hierarchichal_Evaluation.m
; they evaluate the clustering based on KMeans
and Hierarchichal Clustering
, resepctively.
KMeans_Evaluation.m
parameter settingsdata
: dataload data
DistanceKMeans
: Distance Type for k-means clustering(Table2)
-
DistanceKMeans = DistKMeans;
-
Kmax
: Maximum Number of Clusters-
Kmax = 6; % Maximum Number of Cluster clust = zeros(size(data,1),Kmax); for k=1:Kmax clust(:,k) = kmeans(data,k,'distance',DistanceKMeans); end
-
CVI
: Select form(Table1)
-
%% Select CVI CVI = Select_CVI_KMeans; % Evaluation of the clustering solutions eva = evalcvi(clust,CVI, data);
-
Table2
No. | Distance |
---|---|
2 | sqeuclidean |
3 | cityblock |
4 | hamming |
5 | correlation |
6 | cosine |
You may compare multiple CVIs simultaneously by executing the following code:
CVIs = Select_Multiple_CVI_KMeans;
Multiple_Result = Do_Multiple(CVIs,clust,data);
Also, it's possible to visualize the result automatically.
Hierarchichal_Evaluation.m
parameter settingsdata
: dataload data
HierarchichalMethod
: Method for Hierarchical Cluster Tree(Table3)
-
Z = linkage(data, HierarchichalMethod);
-
Kmax
: Maximum Number of Clusters-
Kmax = 6; % Maximum Number of Cluster for k=1:Kmax clust(:,k) = cluster(Z, 'maxclust', k); end
-
DistanceType
: Type of pairwise distance between two sets of observations(Table4)
-
DistanceType = Distance_PDIST2; DXX = pdist2(data,data,DistanceType);
-
CVI
: Select form(Table1)
-
CVI = Select_CVI_Hierarchichal; eva = evalcvi(clust,CVI, DXX);
-
If you wish to compare multiple CVIs,run the following code
CVIs = Select_Multiple_CVI_Hierarchichal;
Multiple_Result = Do_Multiple(CVIs,clust,DXX);
Table3
No. | Method |
---|---|
2 | average |
3 | centroid |
4 | complete |
5 | median |
6 | single |
7 | ward |
Table4
euclidean | seuclidean |
squaredeuclidean | cityblock |
minkowski | jaccard |
chebychev | mahalanobis |
correlation | cosine |
spearman | hamming |
Visualization
Refrences
A. José-García and W. Gómez-Flores.
A survey of cluster validity indices for automatic data clustering using differential evolution.
The Genetic and Evolutionary Computation Conference* (GECCO '21), Lille, France, 2021.
DOI: 10.1145/3449639.3459341