Pinterest Data Pipeline

Overview

This project implements a sophisticated data pipeline that processes Pinterest-like data through various AWS services. It combines batch processing (Apache Kafka) and stream processing (Amazon Kinesis) capabilities, with Apache Airflow orchestrating the workflow and Databricks handling data transformation. The pipeline demonstrates modern data engineering practices by integrating multiple AWS services and open-source tools to create a robust, scalable data processing solution.

Architecture

The pipeline consists of several components working together:

Data Ingestion
- Extracts data from a MySQL database (local/RDS)
- Processes three main data types:
  - Pinterest post data
  - Geolocation data
  - User data
Data Processing Paths
- Batch Processing Path:
  - Kafka REST Proxy on EC2 for data ingestion
  - S3 storage via Kafka Connect
  - Databricks for data transformation
- Streaming Path:
  - Amazon Kinesis for real-time data streaming
  - LocalStack Kinesis for local development
  - Direct integration with Databricks
Orchestration
- Apache Airflow (local/MWAA) for workflow management
- Integration between Airflow and Databricks

Please note there are two architecture diagrams:

Cloud architecture diagram: Pinterest_Data_Pipeline_cloud.drawio.svg
Hybrid setup diagram: Pinterest_Data_Pipeline_local_DB_Airflow.drawio.svg

Here we deployed the hybrid method.

Components

Core Scripts

user_posting_emulation.py: Handles batch processing via Kafka
user_posting_emulation_streaming.py: Manages real-time data streaming via Kinesis
pinterest_data_pipeline.ipynb: Jupyter notebook containing pipeline development and testing
localstack-kinesis/kinesis-implementation.py: Implementation for local Kinesis development using LocalStack
localstack-kinesis/kinesis-consumer.py: Standalone consumer for testing LocalStack Kinesis streams
localstack-kinesis/localstack-setup.sh: Setup script for LocalStack and Kinesis streams

Infrastructure

Amazon RDS/Local MySQL for data storage
Amazon EC2 for Kafka broker hosting
Amazon API Gateway for REST endpoints
Amazon S3 for data lake storage
Amazon Kinesis for stream processing
LocalStack for local AWS service emulation
Amazon MWAA (Managed Workflows for Apache Airflow)/local Apache Airflow
Databricks for data transformation and analysis

Setup and Installation

Prerequisites

AWS Account with appropriate permissions
Python 3.10.13
MySQL database
Apache Kafka
Databricks workspace
LocalStack (for local development)
Required Python packages (see airflow/requirements.txt)

Configuration Files (igonred in .gitignore)

api_creds.yaml: API Gateway credentials
aws_db_creds.yaml: AWS RDS credentials
local_db_creds.yaml: Local database credentials

Setup Steps

Environment Setup
- Set Up a Virtual Environment (conda):
Database Configuration
- Configure MySQL database using pinterest_data_db.sql
- Set up credentials in appropriate YAML files
AWS Services Setup
- Configure EC2 instance for Kafka
- Set up API Gateway endpoints
- Configure S3 buckets
- Set up Kinesis streams
- Configure MWAA environment
LocalStack Setup (for local development)
- Install and configure LocalStack using localstack-kinesis/localstack-setup.sh
- Create Kinesis streams for local development

Installation

Clone the Repository:

git clone https://github.com/luke-who/pinterest-data-pipeline542.git

Usage

Batch Processing

Start the Kafka services on EC2
Run the batch processing script:
```
python user_posting_emulation.py
```

Stream Processing

AWS Kinesis

Ensure AWS Kinesis streams are configured

Run the streaming script:

python user_posting_emulation_streaming.py

LocalStack Kinesis (Local Development)

Set up LocalStack and create Kinesis streams:

cd localstack-kinesis
chmod +x localstack-setup.sh
./localstack-setup.sh

Run the LocalStack Kinesis implementation:
```
python kinesis-implementation.py
```
For testing and monitoring streams, use the standalone consumer in a separate terminal:
```
python kinesis-consumer.py pin_data_geo  # or pin_data_pin or pin_data_user
```

For the rest of the pipeline and development see `pinterest_data_pipeline.ipynb`

Airflow DAGs

The airflow/dags directory contains the workflow definitions for orchestrating the pipeline tasks.

Project Structure

├── .databricks/                                        # Databricks integration files
│   └── commit_outputs                                  # Output logs from Databricks commits
├── airflow/                                            # Apache Airflow configuration
│   ├── dags/                                           # Directory for Airflow DAG definitions
│   │   └── 808492447622.py                             # Main DAG file for workflow orchestration
│   └── requirements.txt                                # Python dependencies for Airflow
├── databricks_notebook_output/                         # Exported Databricks notebooks
│   ├── pinterest_data_pipeline.html                    # HTML version of the notebook
│   └── pinterest_data_pipeline.ipynb                   # Jupyter notebook version
├── img/                                                # Project documentation images
│   ├── Pinterest_Data_Pipeline_cloud.drawio.svg             # Cloud architecture diagram
│   └── Pinterest_Data_Pipeline_local_DB_Airflow.drawio.svg  # Hybrid setup diagram
├── localstack-kinesis/                                 # LocalStack implementation for Kinesis
│   ├── kinesis-consumer.py                             # Standalone consumer for testing
│   ├── kinesis-implementation.py                       # Main implementation with LocalStack
│   └── localstack-setup.sh                             # Setup script for LocalStack
├── user_posting_emulation.py                           # Batch processing implementation
├── user_posting_emulation_streaming.py                 # Streaming processing implementation
├── pinterest_data_pipeline.ipynb                       # Main development notebook
├── pinterest_data_db.sql                               # Database initialization script
├── LICENSE                                             # MIT License file
├── README.md                                           # Project documentation
└── .gitignore                                          # Git ignore rules

Additional Features to Implement

Real-time Data Visualization:
- Integrate Power BI for interactive dashboard creation
- Implement Tableau for advanced data visualization
- Create real-time monitoring dashboards for Kinesis streams
Enhanced Analytics:
- Develop predictive analytics models using streamed data
- Implement anomaly detection for data quality monitoring
Performance Optimization:
- Fine-tune Kafka and Kinesis configurations for better throughput
- Optimize Databricks processing jobs

Security Notes

API keys and credentials should be stored securely
Use appropriate IAM roles and permissions
Never commit sensitive credentials to version control (use .gitignore)
For LocalStack development, use dummy credentials only

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pinterest Data Pipeline

Table of Contents

Overview

Architecture

Components

Core Scripts

Infrastructure

Setup and Installation

Prerequisites

Configuration Files (igonred in .gitignore)

Setup Steps

Installation

Usage

Batch Processing

Stream Processing

AWS Kinesis

LocalStack Kinesis (Local Development)

For the rest of the pipeline and development see `pinterest_data_pipeline.ipynb`

Airflow DAGs

Project Structure

Additional Features to Implement

Security Notes

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.databricks		.databricks
databricks_notebook_output		databricks_notebook_output
img		img
localstack-kinesis		localstack-kinesis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pinterest_data_db.sql		pinterest_data_db.sql
pinterest_data_pipeline.ipynb		pinterest_data_pipeline.ipynb
user_posting_emulation.py		user_posting_emulation.py
user_posting_emulation_streaming.py		user_posting_emulation_streaming.py

License

lukez42/pinterest-data-pipeline542

Folders and files

Latest commit

History

Repository files navigation

Pinterest Data Pipeline

Table of Contents

Overview

Architecture

Components

Core Scripts

Infrastructure

Setup and Installation

Prerequisites

Configuration Files (igonred in .gitignore)

Setup Steps

Installation

Usage

Batch Processing

Stream Processing

AWS Kinesis

LocalStack Kinesis (Local Development)

For the rest of the pipeline and development see pinterest_data_pipeline.ipynb

Airflow DAGs

Project Structure

Additional Features to Implement

Security Notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

For the rest of the pipeline and development see `pinterest_data_pipeline.ipynb`

Packages