Data Typology

This library computes the data typology of real-world imbalanced datasets, in order to measure the complexity of the associated classification problem:

Minority class examples are categorized into 4 main types -- Safe, Borderline, Rare, and Outlier examples. The data typology of each dataset is associated to the difficulty of classification;
To identify the type of each data point, the class labels of its k-nearest neighbours are evaluated;
As per the original formulation, a neighbourhood of k = 5 and the HVDM distance (Heterogeneous Value Difference Metric) is considered by default. Other distance functions can be explored (implemented in distances folder);
The distances considered to compute the neighbourhoods support both numeric and categorical data, as well as missing values.

Types of Examples

Depending on their local neighbourhood (typically k = 5) examples can be categorized into 4 categories, as follows:

Safe examples have 0 or 1 neighbours of the opposite class;
Borderline examples have 2 or 3 neighbours of the opposite class;
Rare examples have 4 neighbours of the opposite class. Additionally, the only neighbour of the same class should be either an outlier example, or a rare example as well;
Outlier examples have all 5 neighbours of the opposite class.

Bellow is a representation of different example types: A is a safe example, surrounded by neighbours of its class; B is an outlier example, isolated in an area of the opposite class; C and D are rare examples and finally, E and F are borderline examples, located near the decision border between classes:

Example of use:

The arff-to-mat folder contains a set of files to read .arff files, whereas the DATA folder contains some .arff files for testing.

The arguments required by distance functions are overall the following:

X: matrix of data (patterns x features);
T: column vector of classes;
feature_types: 1/0 row vector indicating whether a feature in column i is categorical (1) or continuous (0);
min_class: class target of the minority class (1 in this project)

Consider the thoracic.arff dataset provided in DATA folder, containing continuous features (PRE4, PRE5, AGE), and categorical features (the remaining), both binary and nominal. Here's an example of out to compute the data typology:

% Import necessary libraries
addpath('arff-to-mat');
addpath('categorize');
addpath('distances');
addpath('data');
addpath('utils');

% Load the dataset (must contain the .arff extension)
filename = 'thoracic.arff';
data = arff2double(filename);
X = data.X; 
T = data.Y;
min_class = 1; % In this project, 1 is the minority and positive class
feature_types = data.isNomBin; % bool array of nominal (1) or numeric (0) features
distance_metric = 'HVDM-original';

% Calculate data typology
[S,B,R,O,data_types,D] = categorizeDataset(X, T, feature_types, min_class, distance_metric);

The output comprised the following:

S, B, R, O: Percentage of safe, borderline, rare, and outlier examples, respectively:

>> [S, B, R, O]

ans =

    1.4300   42.8600   15.7100   40.0000

data_types: A string declaring the type of each example in data ('S/B/R/O' or 'X' denoting other class(es) rather than the minority).

data_types =

    'XXXXBXORXXXXXBXXXXXXXXXXBXXOXXRXXXXXXXXXBOXSXXXBXXXXXXXXXBXXRXXXXXXOXXXXXXXBXOXXXXXXXXXXXXXXBXXXRXXBXXXXXXXXXXXXXXXBXXXXXXXXXXXOOXXXOXXXXBXXXXXRXXXXXXXOOXXXXXXXXXXXXXXXXBXXXRXOXXXXXXXXXOXXXXXOXXXROXXXXXXXXXXXXBXBXBXXXXXXBXXXXXXOXBXXXXXXXXBBXXBXXXXXXXXOXXXXXXXXOXXXXXXXBXOXXXOXXXXXXXXXXXXXXXXXXXXXRBXXXXOXXRXXXXXXOXXXXXXXXOXXXXXXXXXXXXOXXXXXOXXXXXXXOXBXXXXXXRXXXXBXXXXRXXXXXXXXXXXXXXXXXXBXXXXXXXXXXXXXXXXXOXXXXXOXXXXXXXXBBOXXXXBXXXXXXXXXXXXXXXXXXXXXXBXXXXXXXXXXXXXBXXXXXX'

D: returns the distances between every pair of patterns. Note that although thoracic.arff is originally complete (i.e., it does not contain missing data), the heterogeneous functions can also handle missing values internally.

Citation Request:

If you plan to use this library, please consider referring to the following papers:

@article{Santos2022,
  title={On the joint-effect of class imbalance and overlap: a critical review},
  author={Santos, Miriam Seoane and Abreu, Pedro Henriques and Japkowicz, Nathalie and Fern{\'a}ndez, Alberto and Soares, Carlos and Wilk, Szymon and Santos, Joao},
  journal={Artificial Intelligence Review},
  pages={1--69},
  year={2022},
  publisher={Springer}
}

@article{Santos2023,
  title={A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research},
  author={Santos, Miriam Seoane and Abreu, Pedro Henriques and Japkowicz, Nathalie and Fern{\'a}ndez, Alberto and Santos, Jo{\~a}o},
  journal={Information Fusion},
  volume={89},
  pages={228--253},
  year={2023},
  publisher={Elsevier}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
arff-to-mat		arff-to-mat
categorize		categorize
data		data
distances		distances
utils		utils
README.md		README.md
data_typology.jpg		data_typology.jpg
main.m		main.m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Typology

Types of Examples

Example of use:

Citation Request:

References

About

Releases

Packages

Languages

miriamspsantos/data-typology

Folders and files

Latest commit

History

Repository files navigation

Data Typology

Types of Examples

Example of use:

Citation Request:

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages