Skip to content

Commit

Permalink
Fix schema validation (#146)
Browse files Browse the repository at this point in the history
* Fix schema validation

* Change default project name

* Add catalog docs

* Update READMEs to include Databricks docs
  • Loading branch information
arpitjasa-db authored Feb 7, 2024
1 parent 3d93857 commit 27921e0
Show file tree
Hide file tree
Showing 6 changed files with 36 additions and 8 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This repo provides a customizable stack for starting new ML projects
on Databricks that follow production best-practices out of the box.

Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML resources
management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured.
management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured. More information can be found at https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html.

The default stack in this repo includes three modular components:

Expand Down
6 changes: 3 additions & 3 deletions databricks_template_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"input_project_name": {
"order": 2,
"type": "string",
"default": "my-mlops-project",
"default": "my_mlops_project",
"description": "\nProject Name. Default",
"pattern": "^[^ .\\\\/]{3,}$",
"pattern_match_failure_message": "Project name must be at least 3 characters long and cannot contain the following characters: \"\\\", \"/\", \" \" and \".\".",
Expand Down Expand Up @@ -131,7 +131,7 @@
"order": 11,
"type": "string",
"description": "\nWhether to use the Model Registry with Unity Catalog",
"default": "yes",
"default": "no",
"enum": ["yes", "no"],
"skip_prompt_if": {
"properties": {
Expand All @@ -145,7 +145,7 @@
"order": 12,
"type": "string",
"description": "\nName of schema to use when registering a model in Unity Catalog. \nNote that this schema must already exist, and we recommend keeping the name the same as the project name as well as giving the service principals the right access. Default",
"default": "{{ .input_project_name }}",
"default": "{{if (eq .input_include_models_in_unity_catalog `no`)}}schema{{else}}{{ .input_project_name }}{{end}}",
"pattern": "^[^ .\\-\\/]*$",
"pattern_match_failure_message": "Valid schema names cannot contain any of the following characters: \" \", \".\", \"-\", \"\\\", \"/\"",
"skip_prompt_if": {
Expand Down
1 change: 1 addition & 0 deletions template/{{.input_root_dir}}/README.md.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
This directory contains an ML project based on the default
[Databricks MLOps Stacks](https://github.com/databricks/mlops-stacks),
defining a production-grade ML pipeline for automated retraining and batch inference of an ML model on tabular data.
The "Getting Started" docs can be found at {{ template `generate_doc_link` (map (pair "cloud" .input_cloud) (pair "path" "dev-tools/bundles/mlops-stacks.html")) }}.

See the full pipeline structure below. The [MLOps Stacks README](https://github.com/databricks/mlops-stacks/blob/main/Pipeline.md)
contains additional details on how ML pipelines are tested and deployed across each of the dev, staging, prod environments below.
Expand Down
7 changes: 4 additions & 3 deletions template/{{.input_root_dir}}/docs/mlops-setup.md.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,10 @@ For your convenience, we also have a [Terraform module](https://registry.terrafo
#### Configure Service Principal (SP) permissions
If the created project uses **Unity Catalog**, we expect a catalog to exist with the name of the deployment target by default.
For example, if the deployment target is dev, we expect a catalog named dev to exist in the workspace.
If you want to use different catalog names, please update the targets declared in the {{ if (eq .input_setup_cicd_and_project `CICD_and_Project`)}}[{{ .input_project_name }}/databricks.yml](../{{template `project_name_alphanumeric_underscore` .}}/databricks.yml)
and [{{ .input_project_name }}/resources/ml-artifacts-resource.yml](../{{template `project_name_alphanumeric_underscore` .}}/resources/ml-artifacts-resource.yml) {{ else }} `databricks.yml` and `resources/ml-artifacts-resource.yml` {{ end }} files.
If changing the staging, prod, or test deployment targets, you'll need to update the workflows located in the .github/workflows directory.
If you want to use different catalog names, please update the target names declared in the
{{- if (eq .input_setup_cicd_and_project `CICD_and_Project`)}}[{{ .input_project_name }}/databricks.yml](../{{template `project_name_alphanumeric_underscore` .}}/databricks.yml)
{{- else }} `databricks.yml` {{ end }} file.
If changing the staging, prod, or test deployment targets, you'll also need to update the workflows located in the .github/workflows directory.

The SP must have proper permission in each respective environment and the catalog for the environments.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ If you're a data scientist just getting started with this repo for a brand new M
adapting the provided example code to your ML problem. Then making and
testing ML code changes on Databricks or your local machine.

The "Getting Started" docs can be found at {{ template `generate_doc_link` (map (pair "cloud" .input_cloud) (pair "path" "dev-tools/bundles/mlops-stacks.html")) }}.

## Table of contents
* [Code structure](#code-structure): structure of this project.
{{ if (eq .input_include_feature_store `yes`) }}
Expand Down Expand Up @@ -207,6 +209,30 @@ logic in `features` and run the feature engineering pipeline in the `GenerateAnd
* Python 3.8+
* Install feature engineering code and test dependencies via `pip install -I -r requirements.txt` from project root directory.
* The features transform code uses PySpark and brings up a local Spark instance for testing, so [Java (version 8 and later) is required](https://spark.apache.org/docs/latest/#downloading).
{{- if (eq .input_include_models_in_unity_catalog `yes`) }}
* Access to UC catalog and schema
We expect a catalog to exist with the name of the deployment target by default.
For example, if the deployment target is dev, we expect a catalog named dev to exist in the workspace.
If you want to use different catalog names, please update the target names declared in the [databricks.yml](./databricks.yml) file.
{{- if (eq .input_setup_cicd_and_project `CICD_and_Project`) }}
If changing the staging, prod, or test deployment targets, you'll also need to update the workflows located in the .github/workflows directory.
{{- end }}

For the ML training job, you must have permissions to read the input Delta table and create experiment and models.
i.e. for each environment:
- USE_CATALOG
- USE_SCHEMA
- MODIFY
- CREATE_MODEL
- CREATE_TABLE

For the batch inference job, you must have permissions to read input Delta table and modify the output Delta table.
i.e. for each environment
- USAGE permissions for the catalog and schema of the input and output table.
- SELECT permission for the input table.
- MODIFY permission for the output table if it pre-dates your job.
{{- end }}

#### Run unit tests
You can run unit tests for your ML code via `pytest tests`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ include:

# Deployment Target specific values for workspace
targets:
dev:
dev: {{ if (eq .input_include_models_in_unity_catalog `yes`)}} # UC Catalog Name {{ end }}
default: true
workspace:
# TODO: add dev workspace URL
Expand Down

0 comments on commit 27921e0

Please sign in to comment.