Source code of "Semi-Supervised Clustering with Inaccurate Pairwise Annotations" (Gribel, Gendreau and Vidal, 2021).
Semi-Supervised Clustering with Inaccurate Pairwise Annotations: https://arxiv.org/abs/2104.02146
To run the SSC-IPA algorithm, open the Julia terminal and try the following commands:
julia> include("Optimizer.jl")
julia> in = Input(seed, max_it, supervision_flag, prior)
julia> main("dataset", "must_graph", "cannot_graph", in)
julia> include("Optimizer.jl")
julia> in = Input(1234, 50, 1, 0.9)
julia> main("vertebral.data", "vertebral-must.link", "vertebral-cannot.link", in)
seed
: Numerical seed
max_it
: Maximum number of iterations the algorithm will take.
supervision_flag
: Determines if pairwise supervision is used (0: unsupervised algorithm, 1: semi-supervised algorithm).
prior
: Prior estimation regarding the experts' accuracy (between 0 and 1; enter -1 for no priors)
dataset
: Dataset file. Important: You must provide a file with the .data
extension along with a labels (ground-truth) file. The labels file must have the .label
extension. Example: For a dataset named "vertebral.data", you must provide the "vertebral.label" file in the same folder.
must_graph
: Must-link graph file.
cannot_graph
: Cannot-link graph file.
Important: The dataset, labels, must-link graph, and cannot-link graph files must be within the /data
folder inside the project.
Dataset files. The dataset file has N
rows and D
columns, where N
is the number of data samples and D
is the number of features. Each line contains the values of the D
features of a data sample, where xij correspond to the j-th feature of the i-th sample of the data. Each feature value is separated by a single space, as depicted in the scheme below:
x11 | x12 | x13 | ... | x1d |
---|---|---|---|---|
x21 | x22 | x23 | ... | x2d |
... | ... | ... | ... | ... |
xn1 | xn2 | xn3 | ... | xnd |
Important: The dataset files must have the .data
extension.
Graph files. A graph file (must-link or cannot-link) has m
rows and 3 columns, where m
is the number of connections (links) in the graph. The first two columns represent the two data samples of an edge, whereas and third column represents the edge weight. The scheme below describes a graph file, where si and ti are two connected samples, and wi is the corresponding edge weight:
s1 | t1 | w1 |
---|---|---|
s2 | t2 | w2 |
... | ... | ... |
sm | tm | wm |
Labels files. The content of a labels file exhibits the cluster of each sample of the dataset according to the ground-truth, where yi corresponds to the label of the i-th sample:
y1
y2
...
yn
Important: The labels files must have the .label
extension.