Skip to content

Diksha-cmd/Containerized-batch-pipeline-using-DockerHub-Pachyderm-Google-Cloud-Storage-Postgres-Cloud-Database

Repository files navigation

Containerized batch pipeline using DockerHub,Pachyderm, Google Cloud Storage,Postgres Cloud Database

About this Repository

This repository contains the pipeline specifications and code for making batch pipelines for the larger Hugging Face Transformers based Question Answer model.

Purpose

This repository contains two pipelines.

Pipeline 1 : Pulls the data uploaded by a user in a google cloud storage in the form of CSVs .

There CSVs contain a list of questions and contexts . It then processes the questions asked to fetch ansers using a defualt BERT model for question answering . Answers fetched are then written to an output directory.

Pipeline 2: Pushes the answers created by first pipeline to a PostGres database on google cloud.

The goal of these batch pipelines is to act as data pipelines between the CLoud storage and the Cloud SQL database, where the answers generated by the pipeline is stored. The pipelines automates the data storage process by acting as a non-real time data updation link between the model and the database.

Below is a high-level diagram to show how the pipelines fit in the overall framework:

3

REST API URL

Upload csv route

You may use Postman to push a request to this route , to upload the CSV file.

RequestType = POST

Provide CSV in a key in form-data while submitting the request

https://assignmentsservice-sueoei3pla-uc.a.run.app/upload

Upload csv sample

image

Pipeline 1 Operation:

##(insert pipeline 1 data flow chart here)

Based on the diagram its clear that this pipeline takes multiple .csv files as an input, processes them using the available Hugging Face models and generates the respective answers as an output.

Pipeline 2 Operation:

(insert pipeline 2 data flow chart here)

Deploying these pipelines

Prerequisites

Below are the steps to deploy your pipelines via Pachyderm:

Once the above pre-requisites are confirmed ,checkout this repository and proceed as below:

Build and deploy the docker images to DockerHub

  • Add secret DOCKERHUB_USERNAME = in github secrets

  • Add secret DOCKERHUB_TOKEN = in github secrets

  • Change your / in .github/workflows/main.yml

      - - name: Build and push
    run: |-
      cd pachyderm_01 && docker build -t <your-dockerhub-username>/<docker-imaeg-name>:${{  github.sha }} .
      docker push <your-dockerhub-username>/<docker-imaeg-name>:${{  github.sha }} && cd ../
      cd pachyderm_02 && docker build -t <your-dockerhub-username>/<docker-imaeg-name>:${{  github.sha }} .
      docker push <your-dockerhub-username>/<docker-imaeg-name>:${{  github.sha }}
    

TODO : provide a ref link on how to fetch dockerhub access token and adding to secrets

Create pachyderm workspace

Sign-in to Pachyderm

Create a new workspace as shown below

1

Refer : https://docs.pachyderm.com/latest/hub/hub_getting_started/

Login to Pachyderm Cluster

TODO

Install pachctl on your machine

The "pachctl" or pach control is a command line tool that you can use to interact with a Pachyderm cluster in your terminal. For a Debian based Linux 64-bit or Windows 10 or later running on WSL run the following code:

curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.13.2/pachctl_1.13.2_amd64.deb && sudo dpkg -i /tmp/pachctl.deb

For macOS use the below command:

brew tap pachyderm/tap && brew install pachyderm/tap/[email protected]

For all other Linux systems use below command:

curl -o /tmp/pachctl.tar.gz -L https://github.com/pachyderm/pachyderm/releases/download/v1.13.2/pachctl_1.13.2_linux_amd64.tar.gz && tar -xvf /tmp/pachctl.tar.gz -C /tmp && sudo cp /tmp/pachctl_1.13.2_linux_amd64/pachctl /usr/local/bin

Refer : https://docs.pachyderm.com/latest/getting_started/local_installation/#install-pachctl

Connect to your Pachyderm workspace

Click "Connect" on your Pachyderm workspace and follow the below listed steps to connect to your workspace via the machine:

2

Verify the installation

 pachctl version

Copy below files to your local system (where pachtcl is installed)

  • create_secret.sh
  • secret_template.json
  • secret_template_db.jso
  • credentials.json
  • server-ca.pem
  • client-cert.pem
  • client-key.pem

Export the required environment variables to your OS environment . For Example : If you are using LINUX , you can you below commands:

export GOOGLE_APPLICATION_CREDENTIALS=<YOUR-CLOUD-SERVICE_ACCOUNT_KEY>
export PG_HOST=<POSTGRES-DATABASE-HOST>
export PG_PASSWORD=<POSTGRES-DATABASE-PASSWOD>
export PG_DBNAME=<POSTGRES-DATABASE-DBNAME>
export PG_USER=<POSTGRES-DATABASE-USER>
export PG_SSLROOTCERT=<POSTGRES-DATABASE-HOST>
export PG_SSLCLIENT_CERT=<POSTGRES-DATABASE-HOST>
export PG_SSL_CLIENT_KEY=<POSTGRES-DATABASE-HOST>

Now execute the scripts to create secrets for pachctl

   ./create_secret.sh

Verify whether secrets are created or not

  pachctl list secret

Create the pipelines

On you local system , create two json files

  • pipeline1.json
  • pipeline2.json

Then run the below commands to create the pachctl pipelines

  pachctl create pipeline -f pipeline1.json
  pachctl create pipeline -f pipeline2.json

You may use below command to create/update/delete pipelines

-- Create the pipeline 
pachctl create pipeline -f pipeline1.json
pachctl create pipeline -f pipeline2.json

-- Delete Pipeline
pachctl delete pipeline push-answers
pachctl delete pipeline pull_files

-- Update pipelibe 
pachctl update pipeline -f pipeline1.json
pachctl update pipeline -f pipeline2.json

Once the pipelines are created you may verify whether the proper flow is created or not using either pachyderm hub or LINUX

https://hub.pachyderm.com/

Using Pachyderm Hub Workspace

img_1_1

Created pipelines must look like this:

img_2_2

An output repo of first pipeline must be visible

img_3_3

Using LINUX

--  check the list of pipelines created
   pachctl list pipeline  

img_4_4

-- -- check the list of jobs
   pachctl list job

img_5_5

-- to view logs 
    pachctl logs -j <04946105f1e44dc091ebce097f6ad567>

img_6_6

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published