Skip to content

data-engineering-helpers/dpp-tmpl

Repository files navigation

Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview

This Cookiecutter template repository quickstarts the deployment of data processing pipelines (dpp), sometimes also referred as data transformation jobs, on cloud-based infrastructure. Typically, data processing pipelines may be powered by Kubernetes (e.g., with specific services like AWS EKS) or by specific engines like Apache Spark compute engine. Such specific engines are in turn operated either on specific services offered by general purpose cloud providers (e.g., AWS EMR for Spark and Flink) or all-in-one platforms (e.g., DataBricks).

Data processing jobs:

  1. Take the shape of so-called data processing modules, which:
  • Capitalize the business knowledge through business-oriented software source code, in any of the supported programming languages (i.e., Scala, Python, SQL and potentially R)
  • Deliver software artifacts on artifact repositories (e.g., AWS CodeArtifact)
  • Mainly rely on Apache Spark, but may also use alternatively simpler workloads (e.g.,, DuckDB or Polars), which may be deployed on Kubernetes
  1. Are deployed on compute engines (e.g., DataBricks for exploration and development purposes, AWS EMR for industrialization, Kubernetes for non-Spark or non-distributed work loads). The deployment relies on (Docker-like) containers, which:
    • Install the data processing software artifacts through the programming language native package systems (e.g., pip for Python and sbt for Scala)
    • Are deployed on DataBricks, AWS EMR or AWS EKS

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

About

Cookiecutter template for deployment projects

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages