This Cookiecutter template repository quickstarts the deployment of data processing pipelines (dpp), sometimes also referred as data transformation jobs, on cloud-based infrastructure. Typically, data processing pipelines may be powered by Kubernetes (e.g., with specific services like AWS EKS) or by specific engines like Apache Spark compute engine. Such specific engines are in turn operated either on specific services offered by general purpose cloud providers (e.g., AWS EMR for Spark and Flink) or all-in-one platforms (e.g., DataBricks).
Data processing jobs:
- Take the shape of so-called data processing modules, which:
- Capitalize the business knowledge through business-oriented software source code, in any of the supported programming languages (i.e., Scala, Python, SQL and potentially R)
- Deliver software artifacts on artifact repositories (e.g., AWS CodeArtifact)
- Mainly rely on Apache Spark, but may also use alternatively simpler workloads (e.g.,, DuckDB or Polars), which may be deployed on Kubernetes
- Are deployed on compute engines (e.g., DataBricks for exploration and
development purposes, AWS EMR for industrialization, Kubernetes for
non-Spark or non-distributed work loads). The deployment relies on
(Docker-like) containers, which:
- Install the data processing software artifacts through the
programming language native package systems (e.g.,
pip
for Python andsbt
for Scala) - Are deployed on DataBricks, AWS EMR or AWS EKS
- Install the data processing software artifacts through the
programming language native package systems (e.g.,
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.