Fix schema validation (#146)

* Fix schema validation * Change default project name * Add catalog docs * Update READMEs to include Databricks docs
databricks · Feb 7, 2024 · 27921e0 · 27921e0
1 parent 3d93857
commit 27921e0
Show file tree

Hide file tree

Showing 6 changed files with 36 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ This repo provides a customizable stack for starting new ML projects
 on Databricks that follow production best-practices out of the box.
 
 Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML resources
-management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured.
+management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured. More information can be found at https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html.
 
 The default stack in this repo includes three modular components: 
 

diff --git a/databricks_template_schema.json b/databricks_template_schema.json
@@ -12,7 +12,7 @@
     "input_project_name": {
       "order": 2,
       "type": "string",
-      "default": "my-mlops-project",
+      "default": "my_mlops_project",
       "description": "\nProject Name. Default",
       "pattern": "^[^ .\\\\/]{3,}$",
       "pattern_match_failure_message": "Project name must be at least 3 characters long and cannot contain the following characters: \"\\\", \"/\", \" \" and \".\".",
@@ -131,7 +131,7 @@
       "order": 11,
       "type": "string",
       "description": "\nWhether to use the Model Registry with Unity Catalog",
-      "default": "yes",
+      "default": "no",
       "enum": ["yes", "no"],
       "skip_prompt_if": {
         "properties": {
@@ -145,7 +145,7 @@
       "order": 12,
       "type": "string",
       "description": "\nName of schema to use when registering a model in Unity Catalog. \nNote that this schema must already exist, and we recommend keeping the name the same as the project name as well as giving the service principals the right access. Default",
-      "default": "{{ .input_project_name }}",
+      "default": "{{if (eq .input_include_models_in_unity_catalog `no`)}}schema{{else}}{{ .input_project_name }}{{end}}",
       "pattern": "^[^ .\\-\\/]*$",
       "pattern_match_failure_message": "Valid schema names cannot contain any of the following characters: \" \", \".\", \"-\", \"\\\", \"/\"",
       "skip_prompt_if": {

diff --git a/template/{{.input_root_dir}}/README.md.tmpl b/template/{{.input_root_dir}}/README.md.tmpl
@@ -3,6 +3,7 @@
 This directory contains an ML project based on the default
 [Databricks MLOps Stacks](https://github.com/databricks/mlops-stacks),
 defining a production-grade ML pipeline for automated retraining and batch inference of an ML model on tabular data.
+The "Getting Started" docs can be found at {{ template `generate_doc_link` (map (pair "cloud" .input_cloud) (pair "path" "dev-tools/bundles/mlops-stacks.html")) }}.
 
 See the full pipeline structure below. The [MLOps Stacks README](https://github.com/databricks/mlops-stacks/blob/main/Pipeline.md)
 contains additional details on how ML pipelines are tested and deployed across each of the dev, staging, prod environments below.

diff --git a/template/{{.input_root_dir}}/docs/mlops-setup.md.tmpl b/template/{{.input_root_dir}}/docs/mlops-setup.md.tmpl
@@ -82,9 +82,10 @@ For your convenience, we also have a [Terraform module](https://registry.terrafo
 #### Configure Service Principal (SP) permissions 
 If the created project uses **Unity Catalog**, we expect a catalog to exist with the name of the deployment target by default. 
 For example, if the deployment target is dev, we expect a catalog named dev to exist in the workspace. 
-If you want to use different catalog names, please update the targets declared in the {{ if (eq .input_setup_cicd_and_project `CICD_and_Project`)}}[{{ .input_project_name }}/databricks.yml](../{{template `project_name_alphanumeric_underscore` .}}/databricks.yml)
-and [{{ .input_project_name }}/resources/ml-artifacts-resource.yml](../{{template `project_name_alphanumeric_underscore` .}}/resources/ml-artifacts-resource.yml) {{ else }} `databricks.yml` and `resources/ml-artifacts-resource.yml` {{ end }} files. 
-If changing the staging, prod, or test deployment targets, you'll need to update the workflows located in the .github/workflows directory.
+If you want to use different catalog names, please update the target names declared in the 
+{{- if (eq .input_setup_cicd_and_project `CICD_and_Project`)}}[{{ .input_project_name }}/databricks.yml](../{{template `project_name_alphanumeric_underscore` .}}/databricks.yml) 
+{{- else }} `databricks.yml` {{ end }} file. 
+If changing the staging, prod, or test deployment targets, you'll also need to update the workflows located in the .github/workflows directory.
 
 The SP must have proper permission in each respective environment and the catalog for the environments.
 

diff --git a/.../{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/README.md.tmpl b/.../{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/README.md.tmpl
@@ -4,6 +4,8 @@ If you're a data scientist just getting started with this repo for a brand new M
 adapting the provided example code to your ML problem. Then making and 
 testing ML code changes on Databricks or your local machine.
 
+The "Getting Started" docs can be found at {{ template `generate_doc_link` (map (pair "cloud" .input_cloud) (pair "path" "dev-tools/bundles/mlops-stacks.html")) }}.
+
 ## Table of contents
 * [Code structure](#code-structure): structure of this project.
 {{ if (eq .input_include_feature_store `yes`) }}
@@ -207,6 +209,30 @@ logic in `features` and run the feature engineering pipeline in the `GenerateAnd
 * Python 3.8+
 * Install feature engineering code and test dependencies via `pip install -I -r requirements.txt` from project root directory.
 * The features transform code uses PySpark and brings up a local Spark instance for testing, so [Java (version 8 and later) is required](https://spark.apache.org/docs/latest/#downloading). 
+{{- if (eq .input_include_models_in_unity_catalog `yes`) }}
+* Access to UC catalog and schema
+We expect a catalog to exist with the name of the deployment target by default. 
+For example, if the deployment target is dev, we expect a catalog named dev to exist in the workspace. 
+If you want to use different catalog names, please update the target names declared in the [databricks.yml](./databricks.yml) file.
+{{- if (eq .input_setup_cicd_and_project `CICD_and_Project`) }}
+If changing the staging, prod, or test deployment targets, you'll also need to update the workflows located in the .github/workflows directory.
+{{- end }}
+
+For the ML training job, you must have permissions to read the input Delta table and create experiment and models. 
+i.e. for each environment:
+- USE_CATALOG
+- USE_SCHEMA
+- MODIFY
+- CREATE_MODEL
+- CREATE_TABLE
+
+For the batch inference job, you must have permissions to read input Delta table and modify the output Delta table. 
+i.e. for each environment
+- USAGE permissions for the catalog and schema of the input and output table.
+- SELECT permission for the input table.
+- MODIFY permission for the output table if it pre-dates your job.
+{{- end }}
+
 #### Run unit tests
 You can run unit tests for your ML code via `pytest tests`.
 

diff --git a/...nput_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml.tmpl b/...nput_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml.tmpl
@@ -19,7 +19,7 @@ include:
 
 # Deployment Target specific values for workspace
 targets:
-  dev:
+  dev: {{ if (eq .input_include_models_in_unity_catalog `yes`)}} # UC Catalog Name {{ end }}
     default: true
     workspace:
       # TODO: add dev workspace URL