Skip to content

Latest commit

 

History

History
99 lines (70 loc) · 4.67 KB

schema.md

File metadata and controls

99 lines (70 loc) · 4.67 KB

(WIP) Entity-Relationship (ER) Diagram

It's hard to keep up-to-date with so many files and results constantly drawn week after week. In this file, we reflect on potential structures that underly and represent the core aspects of the decisions made along the experiments pipeline.

1. What components/entities participate in this process?

Matrix: metadata concerning the storage of the different subsets.

Additional considerations when storing the matrices: these should be ordered by name.

Attribute Type Description
matrix_uuid str unique identifier, created based on filepath, matrix_type, features, (if possible hash of the file)
matrix_type str matrix purpose (e.g., train, validation, or test)
filepath str full filepath of the file with the matrix
storage_type str storage type of the matrix (e.g., local, hdfs, aws)
read_classpath str classpath of the method to read the matrix
read_parameters dict hyperparameters used to read the file (e.g., compressed files, fancy header or indexing processing)
num_rows int number of rows/instances/examples in the dataset
num_cols int number of columns in the dataset
name_cols List[str] ordered list of the columns
target_cols List[str] list of columns used as target columns
id_cols List[str] list of columns used as unique identifiers of the examples

Split: composite metadata structure, builds on top of Matrix. Represents a set of matrices built with the purpose of being evaluated together.

Attribute Type Description
split_uuid str unique identifier, created based on constituent matrices unique identifiers (matrix_uuid).
train_matrix_uuid str foreign key to training matrix
test_matrix_uuid str foreign key to test matrix
validation_matrix_uuid Optional[str] foreign key to validation matrix (note: in some cases we may just use a simple two-way holdout split)

Model config:

Attribute Type Description
model_config_uuid str unique identifier, created based on constituent matrices unique identifiers (matrix_uuid).
model_classpath str model classpath (e.g., sklearn.trees.DecisionTreeClassifier)
model_hyperparameters Dict[str, Any] the set of hyperparameters of this model

Model:

Attribute Type Description
model_uuid str unique identifier, created based on model_config_uuid and matrix_uuid where the model was trained on.
model_config_uuid str unique identifier of the model config this model originated from
matrix_uuid str unique identifier of the matrix the model was trained on
model_filepath str the full filepath with the model's pickle

Predictions:

Attribute Type Description
predictions_uuid str unique identifier, hash of the predictions
predictions_filepath str filepath of the file with the predictions
model_uuid str unique identifier of the model that originated these predictions
matrix_uuid str unique identifier of the matrix concerning these predictions
columns List[str] list of predictions

Evaluations:

Attribute Type Description
eval_uuid str unique identifier

Dataset:

Attribute Type Description
dataset_uuid str unique identifier representing the dataset.
dataset_type str type of the dataset (e.g., question-answering, qa-ex)
features List[str] list of the features that comprise this dataset
target List[str] list of the target values of the dataset
preprocessing_classpath str classpath of the preprocessor of the dataset
preprocessing_hyperparameters List[str]

Experiment:

Attribute Type Description
experiment_uuid str unique identifier, created based on the hash with the configurations for the experiment. Ideally it should contain some version of the code used to generate it
experiment_configs Dict[str, Any] all configurations (except the user specific ones) that characterize one experiment
task str task type (e.g., regression, classification, calibration)
description str description of the purpose of this experiment (e.g., evaluate calibrators in ex QA datasets)

QUESTIONS TO REFLECT

  • Questions: how does this schema handle the calibration models?
  • Encodings:
  • Tokenizers ?