This project intends to collaborate on specifying architecture principles and diagrams for a typical data platform with the so-called Modern Data Stack (MDS).
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- Material for the Data platform - Architecture principles
- Specifications/principles for a
data engineering pipeline deployment tool
dpcctl
, the Data Processing Pipeline (DPP) CLI utility, a Minimal Viable Product (MVP) in Go
- Material for the Data platform - Data contracts
- Material for the Data platform - Data quality
- Material for the Data platform - Data-lakes, data warehouses, data lake-houses
- Material for the Data platform - Modern Data Stack (MDS) in a box
-
Data engineering Excalidraw diagram online - Data platform principles for data lake ins and outs
-
Excalidraw source on GitHub - Data platform principles for data lake ins and outs
-
Data engineering Excalidraw diagram online - Data platform principles for Data Engineering
-
Excalidraw source on GitHub - Data platform principles for Data Engineering
As a summary, production and non-production environments should be as separated as possible, if possible separated by kind of a "Chinese wall":
- Non-production environments should not have access, by design, to production resources, including production data
- The only allowed tasks are the publication of non sensitive data, by production environment (e.g., Spark processes) to non-production storage (S3 buckets)
- As data scientists, analysts and engineers have to be able to work on realistic data sets, the above principles mean that teams must invest on how to create non sensitive data from production data. In order to do so, several processes are possible (e.g., anonymisation, obfuscation, aggregation, data generation, simulation). Some specialized companies, such as Statice+Anonos, help in generating non sensitive realistic data sets.
Once the data files are written to S3, they must never be overwritten. The data files are stored in S3 in a persistent way, versioned (e.g., with Delta), and must be kept on cloud object storage (e.g., AWS S3, Azure ADLS, Google GCS) as long as legally and technically possible.
That principle is the same as the one in the Change Data Capture (CDC) mechanism:
- Regularly, snapshots are taken out of a given data set
- Snapshots take the shape of Parquet/Delta data files. Functionally, a snapshot is similar to a picture taken with a photo camera: it corresponds to the latest state of the data set, consistent and instantaneous (there is no history in a snapshot)
- Snapshots must be versioned. Usually, it is enough to add the time-stamp of when the snapshot was taken to the file-path/URI of the snapshot data files
- The succession of the snapshots correspond to the succession of the versions of the data set
- Snapshots data files must be persistent: they must never be overwritten
- The history may be rebuilt from the succession of the snapshots
The Delta format applies that principle of keeping persistent snapshots/versions of a given data set, while abstracting away the need to version and to not overwrite data sets. With Delta, one can store and "overwrite" data sets, while in practice the data files are actually versioned snapshot data files and the log of transactions is kept along those snapshots/versions so as to allow rebuilding of the history.
The format of the structured data files must be Delta wherever possible, and Parquet only when Delta is not possible. No other data format is allowed on the data lake for structured data.
Any data processing task:
- Makes use of software artifacts (e.g., Python wheels, Scala JARs, dbt SQL artifacts)
- Takes, as input, data files, which are, as mentioned above, persistent and versioned, i.e., which will never be overwritten and which their version allows to uniquely identify
- Generates, as output, data files, which have to be, as mentioned above, persistent and versioned, i.e., which will never be overwritten and which their version allows to uniquely identify
We capitalize on the (source code of the) software used to process data, rather than on the prepared data sets. The software project is instantiated from a template (e.g., with Cookiecutter/Cruft) and managed through a Git repository. The Git repository may be audited, including the level of compliance to the (evolutions of the) template.
Data domain data engineering teams are responsible for the (quality and service level agreements of the) delivered data sets. That responsibility includes checking (and potentially fixing) the quality of the source data sets. As an illustration, if medallion (silver/gold/insight) data sets would be manufactured cars, the responsibility encompasses the quality of every single part (e.g., tires, windshield). The data engineering teams cannot deflect their responsibility on the quality of the silver/gold/insight data sets to the quality of the source data sets: they have to fix the quality of the source data sets if needed.
- The purpose of the data lake is to serve as a centralized and scalable repository for storing data from various sources
- Data sets are materialized as both:
- Data files with an open standard of storage, namely Delta whenever possible, or Parquet when Delta is not possible
- Tables in databases. These tables are served through a standard and open API, namely Hive Metastore. AWS Glue and GCP Dataproc both implement the Hive Metastore API. In the documentation, Glue/Dataproc databases and tables may be interchanged with Hive Metastore databases and tables. The databases and tables are actually metadata (i.e., data such as table name and description, column names, types and descriptions, about the data itself); the underlying data are stored in Parquet/Delta, as explained in the point just above
- The modern data lake is structured around the so-called medallion architecture, representing different levels of data "refinement": Bronze, Silver, Gold and Insight. Each level has its own rules and conventions that should be applied systematically and this page serves as a reference of these rules (and should therefore be kept constantly up to date)