Data platform - Architecture principles

Overview

This project intends to collaborate on specifying architecture principles and diagrams for a typical data platform with the so-called Modern Data Stack (MDS).

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

Material for the Data platform - Architecture principles
Specifications/principles for a data engineering pipeline deployment tool
- dpcctl, the Data Processing Pipeline (DPP) CLI utility, a Minimal Viable Product (MVP) in Go
Material for the Data platform - Data contracts
Material for the Data platform - Data quality
Material for the Data platform - Data-lakes, data warehouses, data lake-houses
Material for the Data platform - Modern Data Stack (MDS) in a box

Diagrams

Data lake - ins and outs

Data engineering

Principles

Production vs non-production

As a summary, production and non-production environments should be as separated as possible, if possible separated by kind of a "Chinese wall":

Non-production environments should not have access, by design, to production resources, including production data
The only allowed tasks are the publication of non sensitive data, by production environment (e.g., Spark processes) to non-production storage (S3 buckets)
As data scientists, analysts and engineers have to be able to work on realistic data sets, the above principles mean that teams must invest on how to create non sensitive data from production data. In order to do so, several processes are possible (e.g., anonymisation, obfuscation, aggregation, data generation, simulation). Some specialized companies, such as Statice+Anonos, help in generating non sensitive realistic data sets.

Persistency of data files

Once the data files are written to S3, they must never be overwritten. The data files are stored in S3 in a persistent way, versioned (e.g., with Delta), and must be kept on cloud object storage (e.g., AWS S3, Azure ADLS, Google GCS) as long as legally and technically possible.

That principle is the same as the one in the Change Data Capture (CDC) mechanism:

Regularly, snapshots are taken out of a given data set
Snapshots take the shape of Parquet/Delta data files. Functionally, a snapshot is similar to a picture taken with a photo camera: it corresponds to the latest state of the data set, consistent and instantaneous (there is no history in a snapshot)
Snapshots must be versioned. Usually, it is enough to add the time-stamp of when the snapshot was taken to the file-path/URI of the snapshot data files
The succession of the snapshots correspond to the succession of the versions of the data set
Snapshots data files must be persistent: they must never be overwritten
The history may be rebuilt from the succession of the snapshots

The Delta format applies that principle of keeping persistent snapshots/versions of a given data set, while abstracting away the need to version and to not overwrite data sets. With Delta, one can store and "overwrite" data sets, while in practice the data files are actually versioned snapshot data files and the log of transactions is kept along those snapshots/versions so as to allow rebuilding of the history.

Format of data files

The format of the structured data files must be Delta wherever possible, and Parquet only when Delta is not possible. No other data format is allowed on the data lake for structured data.

Data processing, from files to files

Any data processing task:

Makes use of software artifacts (e.g., Python wheels, Scala JARs, dbt SQL artifacts)
Takes, as input, data files, which are, as mentioned above, persistent and versioned, i.e., which will never be overwritten and which their version allows to uniquely identify
Generates, as output, data files, which have to be, as mentioned above, persistent and versioned, i.e., which will never be overwritten and which their version allows to uniquely identify

Capitalization on data processing software

We capitalize on the (source code of the) software used to process data, rather than on the prepared data sets. The software project is instantiated from a template (e.g., with Cookiecutter/Cruft) and managed through a Git repository. The Git repository may be audited, including the level of compliance to the (evolutions of the) template.

End-to-end data responsibility

Data domain data engineering teams are responsible for the (quality and service level agreements of the) delivered data sets. That responsibility includes checking (and potentially fixing) the quality of the source data sets. As an illustration, if medallion (silver/gold/insight) data sets would be manufactured cars, the responsibility encompasses the quality of every single part (e.g., tires, windshield). The data engineering teams cannot deflect their responsibility on the quality of the silver/gold/insight data sets to the quality of the source data sets: they have to fix the quality of the source data sets if needed.

Data lake

The purpose of the data lake is to serve as a centralized and scalable repository for storing data from various sources
Data sets are materialized as both:
- Data files with an open standard of storage, namely Delta whenever possible, or Parquet when Delta is not possible
- Tables in databases. These tables are served through a standard and open API, namely Hive Metastore. AWS Glue and GCP Dataproc both implement the Hive Metastore API. In the documentation, Glue/Dataproc databases and tables may be interchanged with Hive Metastore databases and tables. The databases and tables are actually metadata (i.e., data such as table name and description, column names, types and descriptions, about the data itself); the underlying data are stored in Parquet/Delta, as explained in the point just above
The modern data lake is structured around the so-called medallion architecture, representing different levels of data "refinement": Bronze, Silver, Gold and Insight. Each level has its own rules and conventions that should be applied systematically and this page serves as a reference of these rules (and should therefore be kept constantly up to date)

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
diagrams		diagrams
material		material
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data platform - Architecture principles

Overview

References

Diagrams

Data lake - ins and outs

Data engineering

Principles

Production vs non-production

Persistency of data files

Format of data files

Data processing, from files to files

Capitalization on data processing software

End-to-end data responsibility

Data lake

About

Releases

Packages

Contributors 3

License

data-engineering-helpers/architecture-principles

Folders and files

Latest commit

History

Repository files navigation

Data platform - Architecture principles

Overview

References

Diagrams

Data lake - ins and outs

Data engineering

Principles

Production vs non-production

Persistency of data files

Format of data files

Data processing, from files to files

Capitalization on data processing software

End-to-end data responsibility

Data lake

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages