Skip to content

Module for measuring feature importance for any clustering method.

License

Notifications You must be signed in to change notification settings

msoczi/clustfeatimp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


FEATURE IMPORTANCES FOR CLUSTERING WITH clustfeatimp

clustfeatimp is a module for measuring feature importance for any clustering method.

Table of Contents:
  1. About
  2. Installation
  3. Example

About

The aim of this project was to create a tool for measuring feature importance for any clustering method.

The idea is simple. By providing the ClusteringExplainer object with data for the clustering model and its results, we can construct a multiclass classifier. The classifier learns the appropriate dependencies in the data, and a side effect of learning is a list of variables with their significance.

idea_graph


More specific:
The idea was to transform unsupervised learning methods to specific supervised methods which are easily interpreted (like tree-based methods). In this implementation I used the XGBoost model.
As segmentation models often produce clusters that are unbalanced, the classifier uses the parameter, which control the balance of many classes weights. As a result, we do not have to worry that our clusters differ in the number of observations.
Since the XGBoost model contains many hyperparameters that must be determined before starting the learning process, I used a simple Bayesian hyperparamert optimization with a small number of iterations (which you can define yourself). This allows you to quickly and efficiently find the best set of hyperparameters for a classifier.
It is also possible to skip the hyperparameter optimization process. This speeds up the operation of the algorithm, however, it should be remembered that the default set of parameters does not always give good results.
However, I recommend using Bayesian optimization (fit_hiperparams = True) even with a small number of iterations (5 by default).

The specific form of the XGBoost model allows to measure the significance of variables. This implementation uses the Gain measure.
The clustfeatimp module also allows you to validate the quality of the created classifier using the balanced_accuracy_score and confusion_matrix.

How to get the feature importances for any clustering method?

  1. Perform clustering with any method.
  2. Use ClusteringExplainer to measure feature importances.
  3. Get the results 😃

Below are plots showing two-dimensional relationships between the clusters for the two most important variables and two irrelevant variables.
You can see a clear division into 3 segments with two most important variables in the left plot.
It is not possible to make a clear clustering into 3 segments when using two irrelevant variables.
featimp2d

Installation

  1. Use pip to install module from github
pip install -e git+https://github.com/msoczi/clustfeatimp#egg=clustfeatimp
  1. Run python and import module:
import clustfeatimp as cfi

Example

import clustfeatimp as cfi

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Create dataset
X, _ = make_blobs(n_samples=300, centers=5, n_features=2, random_state=7)

# Clustering with KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)

# Assign cluster values to the variable y
y = kmeans.labels_

# Create ClusteringExplainer object and fit to the data
clust_explnr = cfi.ClusteringExplainer()
clust_explnr.fit(X, y)

# Feature importance for clustering variables
print('--- Feature Importance for KMeans clustering ---')
print(clust_explnr.feature_importance)

# Plot with feature importance
clust_explnr.plot_importances();plt.show()

# Plot 2D
plt.scatter(X[:,0], X[:,1], c=y)
plt.xlabel('f0');plt.ylabel('f1');plt.show()

Here is a notebook with example.

Contact

Mateusz Soczewka - msoczewkas@gmail.com

Thank you for any comments. 😃

Releases

No releases published

Packages

No packages published

Languages