Containerized batch pipeline using DockerHub,Pachyderm, Google Cloud Storage,Postgres Cloud Database
This repository contains the pipeline specifications and code for making batch pipelines for the larger Hugging Face Transformers based Question Answer model.
This repository contains two pipelines.
There CSVs contain a list of questions and contexts . It then processes the questions asked to fetch ansers using a defualt BERT model for question answering . Answers fetched are then written to an output directory.
The goal of these batch pipelines is to act as data pipelines between the CLoud storage and the Cloud SQL database, where the answers generated by the pipeline is stored. The pipelines automates the data storage process by acting as a non-real time data updation link between the model and the database.
Below is a high-level diagram to show how the pipelines fit in the overall framework:
You may use Postman to push a request to this route , to upload the CSV file.
RequestType = POST
Provide CSV in a key in form-data while submitting the request
https://assignmentsservice-sueoei3pla-uc.a.run.app/upload
##(insert pipeline 1 data flow chart here)
Based on the diagram its clear that this pipeline takes multiple .csv files as an input, processes them using the available Hugging Face models and generates the respective answers as an output.
(insert pipeline 2 data flow chart here)
- Must have a valid account on dockerhub -
- Must have a google cloud storage already set up
- Must have access to pachyderm - https://hub.pachyderm.com/landing
Once the above pre-requisites are confirmed ,checkout this repository and proceed as below:
-
Add secret DOCKERHUB_USERNAME = in github secrets
-
Add secret DOCKERHUB_TOKEN = in github secrets
-
Change your / in .github/workflows/main.yml
- - name: Build and push run: |- cd pachyderm_01 && docker build -t <your-dockerhub-username>/<docker-imaeg-name>:${{ github.sha }} . docker push <your-dockerhub-username>/<docker-imaeg-name>:${{ github.sha }} && cd ../ cd pachyderm_02 && docker build -t <your-dockerhub-username>/<docker-imaeg-name>:${{ github.sha }} . docker push <your-dockerhub-username>/<docker-imaeg-name>:${{ github.sha }}
Refer : https://docs.pachyderm.com/latest/hub/hub_getting_started/
The "pachctl" or pach control is a command line tool that you can use to interact with a Pachyderm cluster in your terminal. For a Debian based Linux 64-bit or Windows 10 or later running on WSL run the following code:
curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.13.2/pachctl_1.13.2_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
For macOS use the below command:
brew tap pachyderm/tap && brew install pachyderm/tap/[email protected]
For all other Linux systems use below command:
curl -o /tmp/pachctl.tar.gz -L https://github.com/pachyderm/pachyderm/releases/download/v1.13.2/pachctl_1.13.2_linux_amd64.tar.gz && tar -xvf /tmp/pachctl.tar.gz -C /tmp && sudo cp /tmp/pachctl_1.13.2_linux_amd64/pachctl /usr/local/bin
Refer : https://docs.pachyderm.com/latest/getting_started/local_installation/#install-pachctl
Click "Connect" on your Pachyderm workspace and follow the below listed steps to connect to your workspace via the machine:
pachctl version
- create_secret.sh
- secret_template.json
- secret_template_db.jso
- credentials.json
- server-ca.pem
- client-cert.pem
- client-key.pem
Export the required environment variables to your OS environment . For Example : If you are using LINUX , you can you below commands:
export GOOGLE_APPLICATION_CREDENTIALS=<YOUR-CLOUD-SERVICE_ACCOUNT_KEY>
export PG_HOST=<POSTGRES-DATABASE-HOST>
export PG_PASSWORD=<POSTGRES-DATABASE-PASSWOD>
export PG_DBNAME=<POSTGRES-DATABASE-DBNAME>
export PG_USER=<POSTGRES-DATABASE-USER>
export PG_SSLROOTCERT=<POSTGRES-DATABASE-HOST>
export PG_SSLCLIENT_CERT=<POSTGRES-DATABASE-HOST>
export PG_SSL_CLIENT_KEY=<POSTGRES-DATABASE-HOST>
./create_secret.sh
pachctl list secret
On you local system , create two json files
- pipeline1.json
- pipeline2.json
Then run the below commands to create the pachctl pipelines
pachctl create pipeline -f pipeline1.json
pachctl create pipeline -f pipeline2.json
-- Create the pipeline
pachctl create pipeline -f pipeline1.json
pachctl create pipeline -f pipeline2.json
-- Delete Pipeline
pachctl delete pipeline push-answers
pachctl delete pipeline pull_files
-- Update pipelibe
pachctl update pipeline -f pipeline1.json
pachctl update pipeline -f pipeline2.json
Once the pipelines are created you may verify whether the proper flow is created or not using either pachyderm hub or LINUX
-- check the list of pipelines created
pachctl list pipeline
-- -- check the list of jobs
pachctl list job
-- to view logs
pachctl logs -j <04946105f1e44dc091ebce097f6ad567>