Skip to content

Dataset Spec Proposal

Felipe Olmos edited this page Apr 10, 2024 · 3 revisions

Dataset Interface Specification (Last Update: 2024-04-10)

Objectives

  • Have an easier way to construct an input multi-table dataset.
  • Notably close to popular packages such as FeatureTools.
  • Being the pivot object to helper functions such as:
    • Sorting a multi-table dataset
    • Fine-grained and convenient access to samples

Proposal

The Dataset objects (for the moment FileDataset and PandasDataset) implement a "Builder" pattern by means of an empty constructor and mutator methods. The mutator methods "fail early" so if all methods succeed the dataset should have only minor problems (eg. dangling tables).

This proposal also provides :

  • export/import function to build a Khiops DictionaryDomain objects.
  • sort by key function

Dataset Interface

  • Constructor:
    • PandasDataset() : Normal empty constructor.
    • FileDataset(header=True, sep="\t") : Construction option to specify file format.
  • add_table(self, name, source, key=None, main_table=False):
    • Adds a table to the dataset
    • Parameters:
      • name: str. Name of the table.
      • source:
        • FileDataset: str path (or URL) of the table.
        • PandasDataset: pandas.Dataframe.
      • key: str or list of str, optional. Key column(s) of the table.
    • Fails if:
      • key is not contained in the column list.
      • main_table == True but key is None.
      • main_table == True but there is already a main table set.
    • Notes:
      • This method obtains the column types with the corresponding heuristics.
  • remove_table(self, name)
    • Removes a table from the dataset. Any relation containing this table
    • Fails if: No table named name exists.
  • add_relation(self, parent_table_name, child_table_name, one_to_one=False)
    • Adds a relation to the dataset
    • Parameters:
      • parent_table_name: str. Name of the parent table.
      • child_table_name: str. Name of the child table.
      • No relation with that pair of tables exists.
    • Fails if:
      • No table named parent_table_name exist.
      • No table named child_table_name exist.
      • parent_table_name == child_table_name.
      • The key of any of the tables is None.
      • The key of parent_table_name is not contained in that of child_table_name.
  • remove_relation(self, parent_table_name, child_table_name)
    • Parameters:
      • parent_table_name: str. Name of the parent table.
      • child_table_name: str. Name of the child table.
    • Fails if:
      • No table named parent_table_name exist.
      • No table named child_table_name exist.
  • add_external_relation(self, parent_table_name, foreign_key, child_table_name)
    • Adds an external relation
    • Parameters:
      • parent_table_name: str. Name of the parent table.
      • foreign_key: str or list of str. Column name(s) of the parent table matching the key of child_table_name.
      • child_table_name: str. Name of the child table.
    • Notes:
      • An external relation is always one-to-one
    • Fails if:
      • No table named parent_table_name exist.
      • No table named child_table_name exist.
      • parent_table_name == child_table_name.
      • The key of the child table is None.
      • foreign_key is not equal to the key of child_table_name.
  • remove_external_relation(self, parent_table_name, child_table_name)
  • Parameters:
    • parent_table_name: str. Name of the parent table.
    • child_table_name: str. Name of the child table.
  • Fails if:
    • No table named parent_table_name exist.
    • No table named child_table_name exist.
    • No external relation with that pair of tables exists.

Helper Functions

  • sort_dataset(ds, engine="native", **kwargs)

    • Sorts each of the tables of the dataset by their keys.
    • Parameters:
      • ds: FileDataset or PandasDataset
      • engine: str
        • default: Uses the default sorting engine (Khiops for FileDataset, pandas.Dataframe.sort for Pandas)
        • khiops: Uses khiops as sorting engine.
      • kwargs: Parameters for pandas.Dataframe.sort
    • Returns: Another Dataset instance with the sorted tables.
    • Fails if:
      • There is a table with no key.
  • create_khiops_dictionary_domain(ds, override_types=None)

    • Creates a DictionaryDomain instance representing the schema of the dataset
    • Parameters:
      • ds: FileDataset or PandasDataset. The input dataset object.
      • override_types: dict . A dictionary whose keys are table name. The values are dict's whose keys are column names and values are Khiops types.
        • Ex: { "Tweets": {"Body": "Text"} }
    • Returns:
      • A DictionaryDomain instance with the schema for the dataset.
  • create_dataset(dictionary_file_path_or_domain, data_table_path, additional_data_tables):

    • Creates a FileDataset instance from a DictionaryDomain and the paths for its tables.
    • Notes:
      • It strips any derivation rule in the khiops dictionaries.
      • Not sure if it is worth it.
    • Returns: A FileDataset instance.
  • Versions of the core.api using FileDataset

    • The file dataset replace the following parameters
      • dictionary_file_path_or_domain (when it describes the input data)
      • dictionary_name
      • data_table_path
      • additional_data_tables
      • header_line
      • field_separator
      • output_additional_data_tables (see below)
    • Examples:
      • train_predictor_ds(ds, target_variable, results_dir, ...)
        • Returns: Same as train_predictor
      • deploy_model_ds(dictionary_file_path_or_domain, ds, results_dir, ...)
        • Returns: An output FileDataset with its files stored in results_dir
  • get_sample_dataset(name, type="pandas")

    • Gets a dataset for one of the Khiops samples
    • Parameters:
      • name: str. Name of the dataset.
      • type: str. Type of the dataset to construct. Either pandas or file.
    • Returns:
      • FileDataset if type="file"
      • PandasDataset if type="pandas"
    • Example:
      • get_sample_dataset("Accidents", type="file")
    • Notes:
      • It is mainly to simplify samples/tutorials.
      • It downloads the dataset if not stored locally.