statisticsnorway · haraldsvik · Sep 25, 2024 · Sep 23, 2024 · Sep 24, 2024 · Sep 24, 2024
@@ -0,0 +1,32 @@
+name: Deploy MkDocs to GitHub Pages
+
+on:
+  push:
+    branches:
+      - main
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          persist-credentials: false
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.x'
+
+      - name: Install Dependencies
+        run: |
+          pip install mkdocs-material
+
+      - name: Build Site
+        run: mkdocs build --clean
+
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@v3
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: ./site
@@ -90,7 +90,7 @@ else:
         print(error)
 ```
 
- For a more in-depth explanation of usage visit [the usage documentation](/docs/USAGE.md).
+ For a more in-depth explanation of usage visit [the usage documentation](https://statisticsnorway.github.io/microdata-tools/usage.md).
 
  ### Data format description
 A dataset as defined in microdata consists of one data file, and one metadata file.
@@ -102,6 +102,6 @@ The data file is a csv file seperated by semicolons. A valid example would be:
 000000000000003;123;2020-01-01;2020-12-31;
 000000000000004;123;2020-01-01;2020-12-31;
 ```
-Read more about the data format and columns in [the documentation](/docs).
+Read more about the data format and columns in [the documentation](https://statisticsnorway.github.io/microdata-tools/).
 
-The metadata files should be in json format. The requirements for the metadata is best described through the [Pydantic model](/microdata_tools/validation/model/metadata.py), [the examples](/docs/examples), and [the documentation](/docs).
+The metadata files should be in json format. The requirements for the metadata is best described through the [Pydantic model](/microdata_tools/validation/model/metadata.py), [the examples](/docs/examples), and [the metadata model](https://statisticsnorway.github.io/microdata-tools/metadata-model/).
@@ -1,10 +1,5 @@
 # USAGE
 
-* [Get started](/docs/USAGE.md#get-started)
-* [Validate dataset](/docs/USAGE.md#validate-dataset)
-* [Validate metadata only](/docs/USAGE.md#validate-metadata)
-* [Use metadata references](/docs/USAGE.md#use-metadata-references)
-
 ## Get started
 
 Install microdata-tools through pip:
@@ -37,6 +32,7 @@ my-input-directory/
 
 
 Import microdata-tools in your script and validate your files:
+
 ```py
 from microdata_tools import validate_dataset
 
@@ -56,6 +52,7 @@ The input directory is set to the directory of the script by default.
 If you wish to use a different directory, you can use the ```input_directory```-parameter:
 
 ```py
+
 from microdata_tools import validate_dataset
 
 validation_errors = validate_dataset(
@@ -67,11 +64,13 @@ if not validation_errors:
     print("My dataset is valid")
 else:
     print("Dataset is invalid :(")
- ```
+```
 
 The validate function will temporarily generate some files in order to validate your dataset. To do this, it will create a working directory in the same location as your script, and delete it once it is done. Therefore, it is important that you have writing permissions in your directory. You can also choose to define the location of this directory yourself using the ```working_directory```-parameter. If you choose to do this, the validate function will only delete the files it generates.
 
+
 ```py
+
 from microdata_tools import validate_dataset
 
 validation_errors = validate_dataset(
@@ -84,8 +83,10 @@ if not validation_errors:
     print("My dataset is valid")
 else:
     print("Dataset is invalid :(")
- ```
+```
+
 If you wish to keep the temporary files after the validation has run, you can do this with the ```keep_temporary_files```-parameter:
+
 ```py
 from microdata_tools import validate_dataset
 
@@ -100,7 +101,7 @@ if not validation_errors:
     print("My dataset is valid")
 else:
     print("Dataset is invalid :(")
- ```
+```
 
 ## Validate metadata
 What if your data is not yet done, but you want to start generating and validating your metadata? Keep your files in the same directory structure as described above, minus the csv file.
@@ -117,5 +118,5 @@ if not validation_errors:
     print("Metadata looks good")
 else:
     print("Invalid metadata :(")
- ```
+```
 This will only check if all required fields are present, and that the metadata follows the correct structure. Since it does not have the data file it can not do the more complex validations. It may still be a helpful way to discover errors early.
@@ -0,0 +1,106 @@
+# microdata-tools
+Tools for the [microdata.no](https://www.microdata.no/) platform
+
+## Installation
+`microdata-tools` can be installed from PyPI using pip:
+```
+pip install microdata-tools
+```
+
+## Usage
+Once you have your metadata and data files ready to go, they should be named and stored like this:
+```
+my-input-directory/
+    MY_DATASET_NAME/
+        MY_DATASET_NAME.csv
+        MY_DATASET_NAME.json
+```
+The CSV file is optional in some cases.
+
+### Package dataset
+The `package_dataset()` function will encrypt and package your dataset as a tar archive. The process is as follows:
+
+1. Generate the symmetric key for a dataset.
+2. Encrypt the dataset data (CSV) using the symmetric key and store the encrypted file as `<DATASET_NAME>.csv.encr`
+3. Encrypt the symmetric key using the asymmetric RSA public key `microdata_public_key.pem` 
+   and store the encrypted file as `<DATASET_NAME>.symkey.encr`
+4. Gather the encrypted CSV, encrypted symmetric key and metadata (JSON) file in one tar file.
+
+### Unpackage dataset
+The `unpackage_dataset()` function will untar and decrypt your dataset using the `microdata_private_key.pem`
+RSA private key.
+
+The packaged file has to have the `<DATASET_NAME>.tar` extension. Its contents should be as follows:
+
+```<DATASET_NAME>.json``` : Required medata file.
+
+```<DATASET_NAME>.csv.encr``` : Optional encrypted dataset file.
+
+```<DATASET_NAME>.symkey.encr``` : Optional encrypted file containing the symmetrical key used to decrypt the dataset file. Required if the `.csv.encr` file is present.
+
+Decryption uses the RSA private key located at ```RSA_KEY_DIR```.
+
+The packaged file is then stored in `output_dir/archive/unpackaged` after a successful run or `output_dir/archive/failed` after an unsuccessful run.
+
+## Example
+Python script that uses a RSA public key named `microdata_public_key.pem` and packages a dataset:
+
+```py
+from pathlib import Path
+from microdata_tools import package_dataset
+
+RSA_KEYS_DIRECTORY = Path("tests/resources/rsa_keys")
+DATASET_DIRECTORY = Path("tests/resources/input_package/DATASET_1")
+OUTPUT_DIRECTORY = Path("tests/resources/output")
+
+package_dataset(
+   rsa_keys_dir=RSA_KEYS_DIRECTORY,
+   dataset_dir=DATASET_DIRECTORY,
+   output_dir=OUTPUT_DIRECTORY,
+)
+```
+
+### Validation
+
+Once you have your metadata and data files ready to go, they should be named and stored like this:
+```
+my-input-directory/
+    MY_DATASET_NAME/
+        MY_DATASET_NAME.csv
+        MY_DATASET_NAME.json
+```
+Note that the filename only allows upper case letters A-Z, number 0-9 and underscores.
+
+
+Import microdata-tools in your script and validate your files:
+```py
+from microdata_tools import validate_dataset
+
+validation_errors = validate_dataset(
+    "MY_DATASET_NAME",
+    input_directory="path/to/my-input-directory"
+)
+
+if not validation_errors:
+    print("My dataset is valid")
+else:
+    print("Dataset is invalid :(")
+    # You can print your errors like this:
+    for error in validation_errors:
+        print(error)
+```
+
+ For a more in-depth explanation of usage visit [the usage documentation](/microdata-tools/USAGE).
+
+### Data format description
+A dataset as defined in microdata consists of one data file, and one metadata file.
+
+The data file is a csv file seperated by semicolons. A valid example would be:
+```csv
+000000000000001;123;2020-01-01;2020-12-31;
+000000000000002;123;2020-01-01;2020-12-31;
+000000000000003;123;2020-01-01;2020-12-31;
+000000000000004;123;2020-01-01;2020-12-31;
+```
+
+The metadata files should be in json format. The requirements for the metadata is best described through the [Pydantic model](https://github.com/statisticsnorway/microdata-tools/blob/main/microdata_tools/validation/model/metadata.py) and  [the examples](https://github.com/statisticsnorway/microdata-tools/tree/main/docs/examples)
@@ -4,27 +4,33 @@ In addition to the examples of metadata json files present in this repository, t
 
 ### ROOT LEVEL FIELDS
 These fields describe the dataset as a whole.
+
 * **temporalityType**: The temporality type of the dataset. Must be one of FIXED, ACCUMULATED, STATUS or EVENT.
 * **sensitivityLevel**: The sensitivity of the data in the dataset. Must be one of: PERSON_GENERAL, PERSON_SPECIAL, PUBLIC or NONPUBLIC.
-  * PERSON_GENERAL: general personal data, this category applies to information that is generally handled without further notification and is not especially sensitive. Email address is an example.
-  * PERSON_SPECIAL: special category of personal data, this is a category of data that is more sensitive. Health information is an example.
-  * PUBLIC: data that is publicly available
-  * NONPUBLIC: data that is not publicly available
+
+    * PERSON_GENERAL: general personal data, this category applies to information that is generally handled without further notification and is not especially sensitive. Email address is an example.
+    * PERSON_SPECIAL: special category of personal data, this is a category of data that is more sensitive. Health information is an example.
+    * PUBLIC: data that is publicly available
+    * NONPUBLIC: data that is not publicly available
+
 * **spatialCoverageDescription**: The geographic area relevant to the data.
 * **populationDescription**: Description of the dataset's population.
 
 
 ### DATAREVISION
 These fields describe the current version of the dataset.
+
 * **description**: Description of this version of the dataset.
 * **temporalEnd (Optional)**: Description of why this dataset will not be updated anymore. Successor datasets can be optionally specified. 
 
 ### IDENTIFIER VARIABLES
 Description of the indentifier column of the dataset. It is represented as a list in the metadata model, but currently only one identifier is allowed per dataset. The identifiers are always based on a unit. A unit is centrally defined to make joining datasets across datastores easy.
+
 * **unitType**: The unitType for this dataset identifier column. Must be one of: FAMILIE, FORETAK, HUSHOLDNING, JOBB, KJORETOY, KOMMUNE, KURS, PERSON or VIRKSOMHET.
 
 ### MEASURE VARIABLES
 Description of the measure column of the dataset. It is represented as a list in the metadata model, but currently only one measure is allowed per dataset.
+
 * **name**: Human readable name(Label) of the measure column. This should be similar to your dataset name. Example for PERSON_INNTEKT.json: "Person inntekt".
 * **description**: Description of the column contents. Example: "Skattepliktig og skattefritt utbytte i... "
 * **dataType**: DataType for the values in the column. One of: ["STRING", "LONG", "DOUBLE", "DATE"]
@@ -35,13 +41,15 @@ Description of the measure column of the dataset. It is represented as a list in
 
 ### MEASURE VARIABLES (with unitType)
 You might find that some of your datasets contain a unitType in the measure column as well. Let's say you have a dataset PERSON_MOR where the identifier column is a population of unitType "PERSON", and the measure column is a population of unitType "PERSON". The measure here is representing the populations mothers. Then you may define it as such:
+
 * **unitType**: The unitType for this dataset measure column. Must be one of: FAMILIE, FORETAK, HUSHOLDNING, JOBB, KJORETOY, KOMMUNE, KURS, PERSON or VIRKSOMHET.
 * **name**: Human readable name(Label) of the measure column. This should be similar to your dataset name. Example for PERSON_MOR.json: "Person mor".
-* **description**: Description of the column contents. Example: "Personens registrerte biologiske mor... "
+* **description**: Description of the column contents. Example: "Personens registrerte biologiske mor..."
 
 
 ### VALUE DOMAIN
 Describes the Value domain for the relevant variable. Either by codeList(enumerated value domain), or a description of expected values(described value domain).
+
 * **description**: A description of the domain. Example for the variable "BRUTTO_INNTEKT": "Alle positive tall".
 * **measurementUnitDescription**: A description of the unit measured. Example: "Norske Kroner"
 * **measurementType**: A machine readable definisjon of the unit measured. One of: [CURRENCY, WEIGHT, LENGTH, HEIGHT, GEOGRAPHICAL]

@@ -0,0 +1,11 @@
+[data-md-color-scheme="mdata"] {
+  --md-primary-fg-color: #104050;
+  --md-accent-fg-color: #e94f35;
+
+}
+
+[data-md-color-scheme="slate"] {
+  --md-primary-fg-color: #104050;
+  --md-accent-fg-color: #e94f35;
+
+}
@@ -0,0 +1,53 @@
+site_name: Microdata-tools
+site_url: https://statisticsnorway.github.io/microdata-tools/
+repo_url: https://github.com/statisticsnorway/microdata-tools
+
+theme:
+  name: material
+  palette:
+    - scheme: mdata
+      toggle:
+        icon: material/weather-night
+        name: Switch to dark mode
+    - scheme: slate
+      toggle:
+        icon: material/weather-sunny
+        name: Switch to light mode
+  font:
+    text: Source Sans Pro
+    code: Source Code Pro
+  logo: assets/microdata.png
+  favicon: assets/favicon.ico
+  features:
+    - navigation.instant
+    - navigation.external
+    - content.code.copy
+    - content.code.select
+
+  pygments_style: default
+
+extra_css:
+  - stylesheets/extra.css
+
+nav:
+  - Getting Started: index.md
+  - The Metadata model: metadata-model.md
+  - Usage: USAGE.md
+  - Report an Issue:
+      - Issue template EN: issue_templates/issue_template_en.md
+      - Issue template NO: issue_templates/issue_template_no.md
+  - Releases: https://github.com/statisticsnorway/microdata-tools/releases
+
+docs_dir: docs
+
+plugins:
+  - search
+
+markdown_extensions:
+  - pymdownx.highlight:
+      anchor_linenums: true
+      line_spans: __span
+      pygments_lang_class: true
+  - pymdownx.inlinehilite
+  - pymdownx.snippets
+  - pymdownx.superfences