Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview

This Cookiecutter template repository quickstarts the deployment of data processing pipelines (dpp), sometimes also referred as data transformation jobs, on cloud-based infrastructure. Typically, data processing pipelines may be powered by Kubernetes (e.g., with specific services like AWS EKS) or by specific engines like Apache Spark compute engine. Such specific engines are in turn operated either on specific services offered by general purpose cloud providers (e.g., AWS EMR for Spark and Flink) or all-in-one platforms (e.g., DataBricks).

Data processing jobs:

Take the shape of so-called data processing modules, which:

Capitalize the business knowledge through business-oriented software source code, in any of the supported programming languages (i.e., Scala, Python, SQL and potentially R)
Deliver software artifacts on artifact repositories (e.g., AWS CodeArtifact)
Mainly rely on Apache Spark, but may also use alternatively simpler workloads (e.g.,, DuckDB or Polars), which may be deployed on Kubernetes

Are deployed on compute engines (e.g., DataBricks for exploration and development purposes, AWS EMR for industrialization, Kubernetes for non-Spark or non-distributed work loads). The deployment relies on (Docker-like) containers, which:
- Install the data processing software artifacts through the programming language native package systems (e.g., pip for Python and sbt for Scala)
- Are deployed on DataBricks, AWS EMR or AWS EKS

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
{{cookiecutter.project_name_dashes}}		{{cookiecutter.project_name_dashes}}
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
cookiecutter.json		cookiecutter.json
test_cookiecutter_template_generation.py		test_cookiecutter_template_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview

About

Releases

Packages

Languages

License

data-engineering-helpers/dpp-tmpl

Folders and files

Latest commit

History

Repository files navigation

Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages