GitHub - msoczi/clustfeatimp: Module for measuring feature importance for any clustering method.

FEATURE IMPORTANCES FOR CLUSTERING WITH `clustfeatimp`

clustfeatimp is a module for measuring feature importance for any clustering method.

Table of Contents:

About
Installation
Example

About

The aim of this project was to create a tool for measuring feature importance for any clustering method.

The idea is simple. By providing the ClusteringExplainer object with data for the clustering model and its results, we can construct a multiclass classifier. The classifier learns the appropriate dependencies in the data, and a side effect of learning is a list of variables with their significance.

More specific:
The idea was to transform unsupervised learning methods to specific supervised methods which are easily interpreted (like tree-based methods). In this implementation I used the XGBoost model.
As segmentation models often produce clusters that are unbalanced, the classifier uses the parameter, which control the balance of many classes weights. As a result, we do not have to worry that our clusters differ in the number of observations.
Since the XGBoost model contains many hyperparameters that must be determined before starting the learning process, I used a simple Bayesian hyperparamert optimization with a small number of iterations (which you can define yourself). This allows you to quickly and efficiently find the best set of hyperparameters for a classifier.
It is also possible to skip the hyperparameter optimization process. This speeds up the operation of the algorithm, however, it should be remembered that the default set of parameters does not always give good results.
However, I recommend using Bayesian optimization (fit_hiperparams = True) even with a small number of iterations (5 by default).

The specific form of the XGBoost model allows to measure the significance of variables. This implementation uses the Gain measure.
The clustfeatimp module also allows you to validate the quality of the created classifier using the balanced_accuracy_score and confusion_matrix.

How to get the feature importances for any clustering method?

Perform clustering with any method.
Use ClusteringExplainer to measure feature importances.
Get the results 😃

Below are plots showing two-dimensional relationships between the clusters for the two most important variables and two irrelevant variables.
You can see a clear division into 3 segments with two most important variables in the left plot.
It is not possible to make a clear clustering into 3 segments when using two irrelevant variables.

Installation

Use pip to install module from github

pip install -e git+https://github.com/msoczi/clustfeatimp#egg=clustfeatimp

Run python and import module:

import clustfeatimp as cfi

Example

import clustfeatimp as cfi

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Create dataset
X, _ = make_blobs(n_samples=300, centers=5, n_features=2, random_state=7)

# Clustering with KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)

# Assign cluster values to the variable y
y = kmeans.labels_

# Create ClusteringExplainer object and fit to the data
clust_explnr = cfi.ClusteringExplainer()
clust_explnr.fit(X, y)

# Feature importance for clustering variables
print('--- Feature Importance for KMeans clustering ---')
print(clust_explnr.feature_importance)

# Plot with feature importance
clust_explnr.plot_importances();plt.show()

# Plot 2D
plt.scatter(X[:,0], X[:,1], c=y)
plt.xlabel('f0');plt.ylabel('f1');plt.show()

Here is a notebook with example.

Contact

Mateusz Soczewka - msoczewkas@gmail.com

Thank you for any comments. 😃

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
docs		docs
exmpl		exmpl
img		img
src/clustfeatimp		src/clustfeatimp
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FEATURE IMPORTANCES FOR CLUSTERING WITH `clustfeatimp`

About

Installation

Example

Contact

About

Releases

Packages

Languages

License

msoczi/clustfeatimp

Folders and files

Latest commit

History

Repository files navigation

FEATURE IMPORTANCES FOR CLUSTERING WITH clustfeatimp

About

Installation

Example

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

FEATURE IMPORTANCES FOR CLUSTERING WITH `clustfeatimp`

Packages