Dunn Index

Dunn Index Economy Insights

In unsupervised learning, the labeling information for training samples is unknown, and the goal is to uncover the nature and patterns of the data by examining the unlabeled training samples and to provide a basis for further data analysis
2021-11-21, by Ted Jackman, Independent Financial Adviser

#Dunn Index || #Clasterization || #Data Analytics ||

Table of contents:



The Dunn Index (DI) (introduced by J.K.Dunn in 1974) is a metric for evaluating clustering algorithms. It is part of a group of confidence indices, including the Davis-Buldin index or the Silhouette index, since it is an internal scoring scheme in which the result is based on the clustered data itself.

The problem of assessing the quality in the clustering problem is intractable at least for two reasons, like mentioned here: https://python.org/dunn-index-and-db-index-cluster-validity-indices-set-1/.

  • Kleinberg's impossibility theorem - there is no optimal clustering algorithm.
  • Many clustering algorithms are unable to determine the true number of clusters in the data. Most often, the number of clusters is fed to the input of the algorithm and is selected by several runs of the algorithm.

Various performance metrics are used to evaluate different machine learning algorithms. In the case of a classification problem, we have various performance metrics to gauge how good our model is. For cluster analysis, a similar question is how to evaluate the "quality factor" of the resulting clusters?

The Dunn Index

Why do we need cluster validity indices?

  • Compare clustering algorithms.
  • Compare two sets of clusters.
  • Compare two clusters, that is, which one is better in terms of compactness and connectivity.
  • To determine if a random structure exists in the data due to noise.

As a rule, cluster confidence measures are subdivided into 3 classes, they are:

  • Internal check of the cluster: the clustering result is evaluated based on the data of the cluster itself (internal information) without reference to external information.
  • External Cluster Validation: The clustering results are evaluated based on some externally known results, such as externally provided class labels.
  • Relative cluster validation: The clustering results are evaluated by varying different parameters for the same algorithm (for example, changing the number of clusters).
  • In addition to the term "cluster confidence index", we need to know the distance between clusters d (a, b) between two clusters a, b and the intra-cluster index D (a) of cluster a.

Machine Learning Notes - Clustering

In unsupervised learning, the labeling information for training samples is unknown, and the goal is to uncover the nature and patterns of the data by examining the unlabeled training samples and to provide a basis for further data analysis. Clustering is the most widely used.

Clustering attempts to split the samples in a dataset into several generally disjoint subsets, and each subset is called a cluster.

As a separate process, clustering skills are used to find the internal structure of the distribution of data, and can also be used as a precursor for other learning tasks such as classification.

Two main problems with clustering algorithms: measuring performance and calculating distance

Dunn index and DB index

The clustering performance metric has also become a validity metric that is similar to the supervised learning performance metric. For the clustering result, to assess its quality, it is necessary to use a certain efficiency indicator. On the other hand, if the performance metric that will ultimately be used is clear, you can directly use it as a target for optimizing the clustering process to get the best clustering results that meet the requirements.

The clustering result has high "intra-cluster similarity" and "low inter-cluster similarity".

There are roughly two types of clustering performance metrics: one is comparing the clustering result to a specific “reference model” called an external indicator, and the other is to directly test the clustering result of an exam room without using any reference model called internal.

Boss Heights

Boss Heights contributor to abundance.org.uk
Public figure