Skip to content

Latest commit

 

History

History
44 lines (38 loc) · 2.13 KB

README.md

File metadata and controls

44 lines (38 loc) · 2.13 KB

Cookiecutter template for the deployment of data processing pipelines (dpp)

Table of Content (ToC)

Overview

This Cookiecutter template repository quickstarts the deployment of data processing pipelines (dpp), sometimes also referred as data transformation jobs, on cloud-based infrastructure. Typically, data processing pipelines may be powered by Kubernetes (e.g., with specific services like AWS EKS) or by specific engines like Apache Spark compute engine. Such specific engines are in turn operated either on specific services offered by general purpose cloud providers (e.g., AWS EMR for Spark and Flink) or all-in-one platforms (e.g., DataBricks).

Data processing jobs:

  1. Take the shape of so-called data processing modules, which:
  • Capitalize the business knowledge through business-oriented software source code, in any of the supported programming languages (i.e., Scala, Python, SQL and potentially R)
  • Deliver software artifacts on artifact repositories (e.g., AWS CodeArtifact)
  • Mainly rely on Apache Spark, but may also use alternatively simpler workloads (e.g.,, DuckDB or Polars), which may be deployed on Kubernetes
  1. Are deployed on compute engines (e.g., DataBricks for exploration and development purposes, AWS EMR for industrialization, Kubernetes for non-Spark or non-distributed work loads). The deployment relies on (Docker-like) containers, which:
    • Install the data processing software artifacts through the programming language native package systems (e.g., pip for Python and sbt for Scala)
    • Are deployed on DataBricks, AWS EMR or AWS EKS

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.