Skip to content

data-engineering-helpers/data-pipeline-deployment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Deployment of data engineering pipelines

Overview

This project intends to develop and maintain a command-line (CLI) utility to help deploy data engineering pipelines on modern data stack (MDS).

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

Specifications

Needs

  • Input: software artifacts (e.g., Python wheels, Scala JAR, R packages, SQL dbt packages, Airflow DAGs) corresponding to libraries of business models (like the BOM4V libraries). Let us call that business oriented software the payload/workload. The libraries rely on some lower level data processing engines like Spark, Pandas, R or dbt

  • Expected delivery: deployment of that business-oriented payload on to a modern data stack (MDS) infrastructure, e.g., Spark cluster, dbt core/cloud, Kubernetes pods or Lambda/serverless functions

  • The data pipelines may optionally be orchestrated, e.g., by Airflow; the Airflow DAGs are then themselves packaged as Python wheels

  • The various deployment environments are specified with a specification-friendly language such as YAML or JSON. The specification will typically state the payload (which version of which library has to be deployed) and where to deploy it (e.g., specific DataBricks devevelopment Spark job cluster, Kubernetes pre-production pod in some specific namespace, production dbt cloud)

  • The specification files (for the deployment of data engineering tasks) are to be maintained by the data engineers themselves, not by DevOps/DataOps. Pachyderm model (with specification files in JSON) is much better than Chef recipes, for that matter

Inspiration - Similar tools

  • Compared to Apache Beam, we would like something where we do not have to abstract away from Spark or Pandas

  • Compared to Apache Calcite, something more flexible than just SQL

  • Compared to Pachyderm (pachctl), where the specification is written in JSON and the execution engine is Kubernetes, we would like to accept more execution engines (like Spark clusters or dbt cloud) and allowing the orchestration by Airflow

  • Compared to Flux (fluxctl), we would like more frameworks than just Kubernetes. Of course, most of the targeted frameworks (e.g., Airflow, Spark) may be operated on top of Kubernetes. But when we use managed services (e.g., DataBricks, AWS MWAA), it is often not an option to have them operated on top of Kubernetes in a way allowing to control it

  • Infrastructure-as-Code (IaC) is similar in spirit to what we need. The main difference is that the infrastructure already exists in our case (managed by IaC by the way): DataBricks is already there, as well as all the data-related services of AWS (e.g., S3, Glue, EMR (for Spark), CodeArtifact, ECR, OpenSearch, RDS, Redshift). So, Chef, as the provisional tool in addition to Terraform seems good. We were wondering whether Rust would not be a more modern fit?

  • Rust or Go could be useful if we have to write such a deployment tool. For instance, pachctl, fluxctl and kubectl are written in Go; but Rust seems to us right now slightly more fit to the purpose

  • In spirit, it should be something like Chef or Puppet: we specify a deployment target, and the tool tries to reach the target. Compared to Chef, we would like something simpler to use, limited to deployment of only data engineering jobs/tasks on modern data stacks

  • That seems rather a task for a tool/utility than for a template

  • If no such tool exists today, we may find some combination of simpler tools that can make it. The goal is to avoid writing thousands of lines of code/reinventing the wheel

Architecture - Principles

Data Platform - Principles - Data Engineering

About

Deployment of data pipelines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published