update main README; add datasets README

mlfoundations · Aug 10, 2023 · 3469e2d · 3469e2d
1 parent 8213ae7
commit 3469e2d
Show file tree

Hide file tree

Showing 2 changed files with 54 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -77,23 +77,23 @@ A list of datasets, their names in TableShift, and the corresponding access
 levels are below. The string identifier is the value that should be passed as the `experiment` parameter
 to `get_dataset()` or the `--experiment` flag of `run_expt.py` and other training scripts.
 
-| Dataset                 | String Identifier         | Availability                                                                                       | Source                                                                                                                 |
-|-------------------------|---------------------------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
-| Voting                  | `anes`                    | Data Use Agreement ([source](https://electionstudies.org))                                         | [American National Election Studies (ANES)](https://electionstudies.org)                                               |
-| ASSISTments             | `assistments`             | Public                                                                                             | [Kaggle](https://www.kaggle.com/datasets/nicolaswattiez/skillbuilder-data-2009-2010)                                   |
-| Childhood Lead          | `nhanes_lead`             | Public                                                                                             | [National Health and Nutrition Examination Survey (NHANES)](https://www.cdc.gov/nchs/nhanes/index.htm)                 |
-| College Scorecard       | `college_scorecard`       | Public                                                                                             | [College Scorecard](http://collegescorecard.ed.gov)                                                                    |
-| Diabetes                | `brfss_diabetes`          | Public                                                                                             | [Behavioral Risk Factor Surveillance System (BRFSS)](https://www.cdc.gov/brfss/index.html)                             |
-| Food Stamps             | `acsfoodstamps`           | Public                                                                                             | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
-| HELOC                   | `heloc`                   | Data Use Agreement ([source](https://community.fico.com/s/explainable-machine-learning-challenge)) | [FICO](https://community.fico.com/s/explainable-machine-learning-challenge)                                            |
-| Hospital Readmission    | `diabetes_readmission`    | Public                                                                                             | [UCI](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)                           |
-| Hypertension            | `brfss_blood_pressure`    | Public                                                                                             | [Behavioral Risk Factor Surveillance System (BRFSS)](https://www.cdc.gov/brfss/index.html)                             |
-| ICU Length of Stay      | `mimic_extract_los_3`     | Data Use Agreement ([source](https://mimic.mit.edu/docs/gettingstarted/))                          | [MIMIC-iii](https://physionet.org/content/mimiciii/) via [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract) |
-| ICU Mortality           | `mimic_extract_mort_hosp` | Data Use Agreement ([source](https://mimic.mit.edu/docs/gettingstarted/))                          | [MIMIC-iii](https://physionet.org/content/mimiciii/) via [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract) |
-| Income                  | `acsincome`               | Public                                                                                             | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
-| Public Health Insurance | `acspubcov`               | Public                                                                                             | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
-| Sepsis                  | `physionet`               | Public                                                                                             | [Physionet](https://physionet.org/content/challenge-2019/)                                                             |
-| Unemployment            | `acsunemployment`         | Public                                                                                             | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
+| Dataset                 | String Identifier         | Availability                                                                                                 | Source                                                                                                                 |
+|-------------------------|---------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
+| Voting                  | `anes`                    | Public Credentialized Access ([source](https://electionstudies.org))                                         | [American National Election Studies (ANES)](https://electionstudies.org)                                               |
+| ASSISTments             | `assistments`             | Public                                                                                                       | [Kaggle](https://www.kaggle.com/datasets/nicolaswattiez/skillbuilder-data-2009-2010)                                   |
+| Childhood Lead          | `nhanes_lead`             | Public                                                                                                       | [National Health and Nutrition Examination Survey (NHANES)](https://www.cdc.gov/nchs/nhanes/index.htm)                 |
+| College Scorecard       | `college_scorecard`       | Public                                                                                                       | [College Scorecard](http://collegescorecard.ed.gov)                                                                    |
+| Diabetes                | `brfss_diabetes`          | Public                                                                                                       | [Behavioral Risk Factor Surveillance System (BRFSS)](https://www.cdc.gov/brfss/index.html)                             |
+| Food Stamps             | `acsfoodstamps`           | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
+| HELOC                   | `heloc`                   | Public Credentialized Access ([source](https://community.fico.com/s/explainable-machine-learning-challenge)) | [FICO](https://community.fico.com/s/explainable-machine-learning-challenge)                                            |
+| Hospital Readmission    | `diabetes_readmission`    | Public                                                                                                       | [UCI](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)                           |
+| Hypertension            | `brfss_blood_pressure`    | Public                                                                                                       | [Behavioral Risk Factor Surveillance System (BRFSS)](https://www.cdc.gov/brfss/index.html)                             |
+| ICU Length of Stay      | `mimic_extract_los_3`     | Public Credentialized Access ([source](https://mimic.mit.edu/docs/gettingstarted/))                          | [MIMIC-iii](https://physionet.org/content/mimiciii/) via [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract) |
+| ICU Mortality           | `mimic_extract_mort_hosp` | Public Credentialized Access ([source](https://mimic.mit.edu/docs/gettingstarted/))                          | [MIMIC-iii](https://physionet.org/content/mimiciii/) via [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract) |
+| Income                  | `acsincome`               | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
+| Public Health Insurance | `acspubcov`               | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
+| Sepsis                  | `physionet`               | Public                                                                                                       | [Physionet](https://physionet.org/content/challenge-2019/)                                                             |
+| Unemployment            | `acsunemployment`         | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |
 
 Note that details on the data source, which files to load, and the feature
 codings are provided in the TableShift source code for each dataset and data

diff --git a/docs/datasets.md b/docs/datasets.md
@@ -0,0 +1,37 @@
+# Accessing Open Credentialized Datasets
+
+This page gives general instructions on using the different types of datasets in TableShift. In particular, it describes the process for configuring a dataset with open credentialized access so that it can be used with TableShift.
+
+*tl;dr: No action is required for public datasets. For credentialized datasets, follow the provided links [here](https://tableshift.org/datasets.html) to obtain access, and download the necessary file(s) to the TableShift cache (`tableshift/tmp` by default).*
+
+### Overview
+
+All TableShift benchmark datasets are available to anyone, but some require action on the users' behalf to obtain access. The TableShift benchmark contains two types of datasets: public datasets (no usage restrictions) and datasets with open credentialized access. Open credentialized access means that access to a dataset is available to anyone, as long as they can provide certain credentials to the dataset maintainers (such as filling out a data use agreement or, in the case of sensitive human subjects data, completing necessary free human subjects training).
+
+Before beginning experiments with a specific benchmark dataset, verify the access level of the dataset. This can be done by checking the paper, the table in our main README in this repo, or the TableShift website. *If a dataset is marked as "Public", no action is required and the TableShift Python API will fetch the data automatically the first time it is used.* (After the first usage, the data will be fetched from a local cache.)
+
+### Accessing an Open Credentialized Dataset
+
+The instructions here are for accessing open credentialized datasets. For instructions on how to access the data files for each individual dataset, check the [datasets](https://tableshift.org/datasets.html) page on the TableShift website. The links to any data use agreement(s) and the specific files used are described for each dataset on that page under "Availability & Access".
+
+To use an open credentialized dataset:
+1. **Credentialization:** Complete any credentialization required for the dataset (described on the TableShift [datasets](https://tableshift.org/datasets.html) page).
+2. **File Download:** Download the necessary file(s) to the TableShift cache directory. By default, this is located at `tableshift/tmp`, but you can provide another `cache_dir` to the TableShift dataset constructors. No preprocessing or renaming of the files is necessary.
+
+After completing these steps, the dataset should be ready for use in the TableShift benchmark!
+
+### Example: American National Election Survey (ANES)
+
+Here we give a brief example of how to set up a public credentialized dataset, using the American National Election Survey (ANES) as an example.
+
+1. **Credentialization:** As listed on the TableShift [datasets](https://tableshift.org/datasets.html) page and the README of this repo, accessing the ANES data requires registering on the ANES website. Create an account.
+2. **File Download:** Access the September 16, 2022 Time Series Cumulative Data File (click "Data Center" > "Time Series Cumulative Data File (1948-2020)" > CSV). Download this file and place it at `tableshift/tmp`.
+
+You can verify your installation by running the following in a Python terminal:
+
+``` 
+from tableshift import get_dataset
+dset = get_dataset("anes")
+```
+
+To access any other public credentialized access dataset in the benchmark, follow the same steps above. Links to access datasets are on the [datasets](https://tableshift.org/datasets.html) page and the README at the root of this repo.