Deployment of data engineering pipelines

Overview

This project intends to develop and maintain a command-line (CLI) utility to help deploy data engineering pipelines on modern data stack (MDS).

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

GitHub - Material/knowledge sharing (KS) - Deployment of Data processing pipelines (DPP)
GitHub - Architecture principles for data engineering pipelines on the Modern Data Stack (MDS)
GitHub - dpcctl, the Data Processing Pipeline (DPP) CLI utility, a Minimal Viable Product (MVP) in Go
GitHub - Material for the Data platform - Data contracts
GitHub - Material for the Data quality - Data contracts
Atlassian - Evolve your data platform with a Deployment Capability, May 2024, by Joop Van De Ven
tobiko blog - Why Data Teams Are Adopting Declarative Pipelines, Nov. 2023, by Iaroslav Zeigerman
Substack - We Need a Data Engineering-Specific Language, Feb. 2024, by Julien Hurault

Specifications

Needs

Input: software artifacts (e.g., Python wheels, Scala JAR, R packages, SQL dbt packages, Airflow DAGs) corresponding to libraries of business models (like the BOM4V libraries). Let us call that business oriented software the payload/workload. The libraries rely on some lower level data processing engines like Spark, Pandas, R or dbt
Expected delivery: deployment of that business-oriented payload on to a modern data stack (MDS) infrastructure, e.g., Spark cluster, dbt core/cloud, Kubernetes pods or Lambda/serverless functions
The data pipelines may optionally be orchestrated, e.g., by Airflow; the Airflow DAGs are then themselves packaged as Python wheels
The various deployment environments are specified with a specification-friendly language such as YAML or JSON. The specification will typically state the payload (which version of which library has to be deployed) and where to deploy it (e.g., specific DataBricks devevelopment Spark job cluster, Kubernetes pre-production pod in some specific namespace, production dbt cloud)
The specification files (for the deployment of data engineering tasks) are to be maintained by the data engineers themselves, not by DevOps/DataOps. Pachyderm model (with specification files in JSON) is much better than Chef recipes, for that matter

Inspiration - Similar tools

Compared to Apache Beam, we would like something where we do not have to abstract away from Spark or Pandas
Compared to Apache Calcite, something more flexible than just SQL
Compared to Pachyderm (pachctl), where the specification is written in JSON and the execution engine is Kubernetes, we would like to accept more execution engines (like Spark clusters or dbt cloud) and allowing the orchestration by Airflow
Compared to Flux (fluxctl), we would like more frameworks than just Kubernetes. Of course, most of the targeted frameworks (e.g., Airflow, Spark) may be operated on top of Kubernetes. But when we use managed services (e.g., DataBricks, AWS MWAA), it is often not an option to have them operated on top of Kubernetes in a way allowing to control it
Infrastructure-as-Code (IaC) is similar in spirit to what we need. The main difference is that the infrastructure already exists in our case (managed by IaC by the way): DataBricks is already there, as well as all the data-related services of AWS (e.g., S3, Glue, EMR (for Spark), CodeArtifact, ECR, OpenSearch, RDS, Redshift). So, Chef, as the provisional tool in addition to Terraform seems good. We were wondering whether Rust would not be a more modern fit?
Rust or Go could be useful if we have to write such a deployment tool. For instance, pachctl, fluxctl and kubectl are written in Go; but Rust seems to us right now slightly more fit to the purpose
In spirit, it should be something like Chef or Puppet: we specify a deployment target, and the tool tries to reach the target. Compared to Chef, we would like something simpler to use, limited to deployment of only data engineering jobs/tasks on modern data stacks
That seems rather a task for a tool/utility than for a template
If no such tool exists today, we may find some combination of simpler tools that can make it. The goal is to avoid writing thousands of lines of code/reinventing the wheel

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README-material.md		README-material.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deployment of data engineering pipelines

Overview

References

Specifications

Needs

Inspiration - Similar tools

Architecture - Principles

About

Uh oh!

Releases

Packages

Uh oh!

License

data-engineering-helpers/data-pipeline-deployment

Folders and files

Latest commit

History

Repository files navigation

Deployment of data engineering pipelines

Overview

References

Specifications

Needs

Inspiration - Similar tools

Architecture - Principles

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages