Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview

This Cookiecutter template repository quickstarts the deployment of data processing pipelines (dpp), sometimes also referred as data transformation jobs, on cloud-based infrastructure. Typically, data processing pipelines may be powered by Kubernetes (e.g., with specific services like AWS EKS) or by specific engines like Apache Spark compute engine. Such specific engines are in turn operated either on specific services offered by general purpose cloud providers (e.g., AWS EMR for Spark and Flink) or all-in-one platforms (e.g., DataBricks).

Data processing jobs:

Take the shape of so-called data processing modules, which:

Capitalize the business knowledge through business-oriented software source code, in any of the supported programming languages (i.e., Scala, Python, SQL and potentially R)
Deliver software artifacts on artifact repositories (e.g., AWS CodeArtifact)
Mainly rely on Apache Spark, but may also use alternatively simpler workloads (e.g.,, DuckDB or Polars), which may be deployed on Kubernetes

Are deployed on compute engines (e.g., DataBricks for exploration and development purposes, AWS EMR for industrialization, Kubernetes for non-Spark or non-distributed work loads). The deployment relies on (Docker-like) containers, which:
- Install the data processing software artifacts through the programming language native package systems (e.g., pip for Python and sbt for Scala)
- Are deployed on DataBricks, AWS EMR or AWS EKS

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview