Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using mkdocs to serve docs from our markdown files #62

Merged
merged 5 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/docs-deploy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Deploy MkDocs to GitHub Pages

on:
push:
branches:
- main

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
persist-credentials: false

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install Dependencies
run: |
pip install mkdocs-material

- name: Build Site
run: mkdocs build --clean

- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./site
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ else:
print(error)
```

For a more in-depth explanation of usage visit [the usage documentation](/docs/USAGE.md).
For a more in-depth explanation of usage visit [the usage documentation](https://statisticsnorway.github.io/microdata-tools/usage.md).

### Data format description
A dataset as defined in microdata consists of one data file, and one metadata file.
Expand All @@ -102,6 +102,6 @@ The data file is a csv file seperated by semicolons. A valid example would be:
000000000000003;123;2020-01-01;2020-12-31;
000000000000004;123;2020-01-01;2020-12-31;
```
Read more about the data format and columns in [the documentation](/docs).
Read more about the data format and columns in [the documentation](https://statisticsnorway.github.io/microdata-tools/).

The metadata files should be in json format. The requirements for the metadata is best described through the [Pydantic model](/microdata_tools/validation/model/metadata.py), [the examples](/docs/examples), and [the documentation](/docs).
The metadata files should be in json format. The requirements for the metadata is best described through the [Pydantic model](/microdata_tools/validation/model/metadata.py), [the examples](/docs/examples), and [the metadata model](https://statisticsnorway.github.io/microdata-tools/metadata-model/).
19 changes: 10 additions & 9 deletions docs/USAGE.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
# USAGE

* [Get started](/docs/USAGE.md#get-started)
* [Validate dataset](/docs/USAGE.md#validate-dataset)
* [Validate metadata only](/docs/USAGE.md#validate-metadata)
* [Use metadata references](/docs/USAGE.md#use-metadata-references)

## Get started

Install microdata-tools through pip:
Expand Down Expand Up @@ -37,6 +32,7 @@ my-input-directory/


Import microdata-tools in your script and validate your files:

```py
from microdata_tools import validate_dataset

Expand All @@ -56,6 +52,7 @@ The input directory is set to the directory of the script by default.
If you wish to use a different directory, you can use the ```input_directory```-parameter:

```py

from microdata_tools import validate_dataset

validation_errors = validate_dataset(
Expand All @@ -67,11 +64,13 @@ if not validation_errors:
print("My dataset is valid")
else:
print("Dataset is invalid :(")
```
```

The validate function will temporarily generate some files in order to validate your dataset. To do this, it will create a working directory in the same location as your script, and delete it once it is done. Therefore, it is important that you have writing permissions in your directory. You can also choose to define the location of this directory yourself using the ```working_directory```-parameter. If you choose to do this, the validate function will only delete the files it generates.


```py

from microdata_tools import validate_dataset

validation_errors = validate_dataset(
Expand All @@ -84,8 +83,10 @@ if not validation_errors:
print("My dataset is valid")
else:
print("Dataset is invalid :(")
```
```

If you wish to keep the temporary files after the validation has run, you can do this with the ```keep_temporary_files```-parameter:

```py
from microdata_tools import validate_dataset

Expand All @@ -100,7 +101,7 @@ if not validation_errors:
print("My dataset is valid")
else:
print("Dataset is invalid :(")
```
```

## Validate metadata
What if your data is not yet done, but you want to start generating and validating your metadata? Keep your files in the same directory structure as described above, minus the csv file.
Expand All @@ -117,5 +118,5 @@ if not validation_errors:
print("Metadata looks good")
else:
print("Invalid metadata :(")
```
```
This will only check if all required fields are present, and that the metadata follows the correct structure. Since it does not have the data file it can not do the more complex validations. It may still be a helpful way to discover errors early.
Binary file added docs/assets/favicon.ico
Binary file not shown.
Binary file added docs/assets/microdata.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
106 changes: 106 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# microdata-tools
Tools for the [microdata.no](https://www.microdata.no/) platform

## Installation
`microdata-tools` can be installed from PyPI using pip:
```
pip install microdata-tools
```

## Usage
Once you have your metadata and data files ready to go, they should be named and stored like this:
```
my-input-directory/
MY_DATASET_NAME/
MY_DATASET_NAME.csv
MY_DATASET_NAME.json
```
The CSV file is optional in some cases.

### Package dataset
The `package_dataset()` function will encrypt and package your dataset as a tar archive. The process is as follows:

1. Generate the symmetric key for a dataset.
2. Encrypt the dataset data (CSV) using the symmetric key and store the encrypted file as `<DATASET_NAME>.csv.encr`
3. Encrypt the symmetric key using the asymmetric RSA public key `microdata_public_key.pem`
and store the encrypted file as `<DATASET_NAME>.symkey.encr`
4. Gather the encrypted CSV, encrypted symmetric key and metadata (JSON) file in one tar file.

### Unpackage dataset
The `unpackage_dataset()` function will untar and decrypt your dataset using the `microdata_private_key.pem`
RSA private key.

The packaged file has to have the `<DATASET_NAME>.tar` extension. Its contents should be as follows:

```<DATASET_NAME>.json``` : Required medata file.

```<DATASET_NAME>.csv.encr``` : Optional encrypted dataset file.

```<DATASET_NAME>.symkey.encr``` : Optional encrypted file containing the symmetrical key used to decrypt the dataset file. Required if the `.csv.encr` file is present.

Decryption uses the RSA private key located at ```RSA_KEY_DIR```.

The packaged file is then stored in `output_dir/archive/unpackaged` after a successful run or `output_dir/archive/failed` after an unsuccessful run.

## Example
Python script that uses a RSA public key named `microdata_public_key.pem` and packages a dataset:

```py
from pathlib import Path
from microdata_tools import package_dataset

RSA_KEYS_DIRECTORY = Path("tests/resources/rsa_keys")
DATASET_DIRECTORY = Path("tests/resources/input_package/DATASET_1")
OUTPUT_DIRECTORY = Path("tests/resources/output")

package_dataset(
rsa_keys_dir=RSA_KEYS_DIRECTORY,
dataset_dir=DATASET_DIRECTORY,
output_dir=OUTPUT_DIRECTORY,
)
```

### Validation

Once you have your metadata and data files ready to go, they should be named and stored like this:
```
my-input-directory/
MY_DATASET_NAME/
MY_DATASET_NAME.csv
MY_DATASET_NAME.json
```
Note that the filename only allows upper case letters A-Z, number 0-9 and underscores.


Import microdata-tools in your script and validate your files:
```py
from microdata_tools import validate_dataset

validation_errors = validate_dataset(
"MY_DATASET_NAME",
input_directory="path/to/my-input-directory"
)

if not validation_errors:
print("My dataset is valid")
else:
print("Dataset is invalid :(")
# You can print your errors like this:
for error in validation_errors:
print(error)
```

For a more in-depth explanation of usage visit [the usage documentation](/microdata-tools/USAGE).

### Data format description
A dataset as defined in microdata consists of one data file, and one metadata file.

The data file is a csv file seperated by semicolons. A valid example would be:
```csv
000000000000001;123;2020-01-01;2020-12-31;
000000000000002;123;2020-01-01;2020-12-31;
000000000000003;123;2020-01-01;2020-12-31;
000000000000004;123;2020-01-01;2020-12-31;
```

The metadata files should be in json format. The requirements for the metadata is best described through the [Pydantic model](https://github.com/statisticsnorway/microdata-tools/blob/main/microdata_tools/validation/model/metadata.py) and [the examples](https://github.com/statisticsnorway/microdata-tools/tree/main/docs/examples)
18 changes: 13 additions & 5 deletions docs/README.md → docs/metadata-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,33 @@ In addition to the examples of metadata json files present in this repository, t

### ROOT LEVEL FIELDS
These fields describe the dataset as a whole.

* **temporalityType**: The temporality type of the dataset. Must be one of FIXED, ACCUMULATED, STATUS or EVENT.
* **sensitivityLevel**: The sensitivity of the data in the dataset. Must be one of: PERSON_GENERAL, PERSON_SPECIAL, PUBLIC or NONPUBLIC.
* PERSON_GENERAL: general personal data, this category applies to information that is generally handled without further notification and is not especially sensitive. Email address is an example.
* PERSON_SPECIAL: special category of personal data, this is a category of data that is more sensitive. Health information is an example.
* PUBLIC: data that is publicly available
* NONPUBLIC: data that is not publicly available

* PERSON_GENERAL: general personal data, this category applies to information that is generally handled without further notification and is not especially sensitive. Email address is an example.
* PERSON_SPECIAL: special category of personal data, this is a category of data that is more sensitive. Health information is an example.
* PUBLIC: data that is publicly available
* NONPUBLIC: data that is not publicly available

* **spatialCoverageDescription**: The geographic area relevant to the data.
* **populationDescription**: Description of the dataset's population.


### DATAREVISION
These fields describe the current version of the dataset.

* **description**: Description of this version of the dataset.
* **temporalEnd (Optional)**: Description of why this dataset will not be updated anymore. Successor datasets can be optionally specified.

### IDENTIFIER VARIABLES
Description of the indentifier column of the dataset. It is represented as a list in the metadata model, but currently only one identifier is allowed per dataset. The identifiers are always based on a unit. A unit is centrally defined to make joining datasets across datastores easy.

* **unitType**: The unitType for this dataset identifier column. Must be one of: FAMILIE, FORETAK, HUSHOLDNING, JOBB, KJORETOY, KOMMUNE, KURS, PERSON or VIRKSOMHET.

### MEASURE VARIABLES
Description of the measure column of the dataset. It is represented as a list in the metadata model, but currently only one measure is allowed per dataset.

* **name**: Human readable name(Label) of the measure column. This should be similar to your dataset name. Example for PERSON_INNTEKT.json: "Person inntekt".
* **description**: Description of the column contents. Example: "Skattepliktig og skattefritt utbytte i... "
* **dataType**: DataType for the values in the column. One of: ["STRING", "LONG", "DOUBLE", "DATE"]
Expand All @@ -35,13 +41,15 @@ Description of the measure column of the dataset. It is represented as a list in

### MEASURE VARIABLES (with unitType)
You might find that some of your datasets contain a unitType in the measure column as well. Let's say you have a dataset PERSON_MOR where the identifier column is a population of unitType "PERSON", and the measure column is a population of unitType "PERSON". The measure here is representing the populations mothers. Then you may define it as such:

* **unitType**: The unitType for this dataset measure column. Must be one of: FAMILIE, FORETAK, HUSHOLDNING, JOBB, KJORETOY, KOMMUNE, KURS, PERSON or VIRKSOMHET.
* **name**: Human readable name(Label) of the measure column. This should be similar to your dataset name. Example for PERSON_MOR.json: "Person mor".
* **description**: Description of the column contents. Example: "Personens registrerte biologiske mor... "
* **description**: Description of the column contents. Example: "Personens registrerte biologiske mor..."


### VALUE DOMAIN
Describes the Value domain for the relevant variable. Either by codeList(enumerated value domain), or a description of expected values(described value domain).

* **description**: A description of the domain. Example for the variable "BRUTTO_INNTEKT": "Alle positive tall".
* **measurementUnitDescription**: A description of the unit measured. Example: "Norske Kroner"
* **measurementType**: A machine readable definisjon of the unit measured. One of: [CURRENCY, WEIGHT, LENGTH, HEIGHT, GEOGRAPHICAL]
Expand Down
11 changes: 11 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[data-md-color-scheme="mdata"] {
--md-primary-fg-color: #104050;
--md-accent-fg-color: #e94f35;

}

[data-md-color-scheme="slate"] {
--md-primary-fg-color: #104050;
--md-accent-fg-color: #e94f35;

}
53 changes: 53 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
site_name: Microdata-tools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

site_url: https://statisticsnorway.github.io/microdata-tools/
repo_url: https://github.com/statisticsnorway/microdata-tools

theme:
name: material
palette:
- scheme: mdata
toggle:
icon: material/weather-night
name: Switch to dark mode
- scheme: slate
toggle:
icon: material/weather-sunny
name: Switch to light mode
font:
text: Source Sans Pro
code: Source Code Pro
logo: assets/microdata.png
favicon: assets/favicon.ico
features:
- navigation.instant
- navigation.external
- content.code.copy
- content.code.select

pygments_style: default

extra_css:
- stylesheets/extra.css

nav:
- Getting Started: index.md
- The Metadata model: metadata-model.md
- Usage: USAGE.md
- Report an Issue:
- Issue template EN: issue_templates/issue_template_en.md
- Issue template NO: issue_templates/issue_template_no.md
- Releases: https://github.com/statisticsnorway/microdata-tools/releases

docs_dir: docs

plugins:
- search

markdown_extensions:
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences