Skip to content

This project builds a scalable data pipeline for processing Pinterest-like data, integrating AWS services like Amazon Kinesis for streaming, Apache Kafka for batch processing, Apache Airflow for orchestration, and Databricks for data transformation.

License

Notifications You must be signed in to change notification settings

lukez42/pinterest-data-pipeline542

Repository files navigation

Pinterest Data Pipeline

Pinterest Data Pipeline Architecture


python MySQL Apache Kafka Apache Airflow AWS Databricks LocalStack

Table of Contents

Overview

This project implements a sophisticated data pipeline that processes Pinterest-like data through various AWS services. It combines batch processing (Apache Kafka) and stream processing (Amazon Kinesis) capabilities, with Apache Airflow orchestrating the workflow and Databricks handling data transformation. The pipeline demonstrates modern data engineering practices by integrating multiple AWS services and open-source tools to create a robust, scalable data processing solution.

Architecture

The pipeline consists of several components working together:

  1. Data Ingestion

    • Extracts data from a MySQL database (local/RDS)
    • Processes three main data types:
      • Pinterest post data
      • Geolocation data
      • User data
  2. Data Processing Paths

    • Batch Processing Path:
      • Kafka REST Proxy on EC2 for data ingestion
      • S3 storage via Kafka Connect
      • Databricks for data transformation
    • Streaming Path:
      • Amazon Kinesis for real-time data streaming
      • LocalStack Kinesis for local development
      • Direct integration with Databricks
  3. Orchestration

    • Apache Airflow (local/MWAA) for workflow management
    • Integration between Airflow and Databricks

Please note there are two architecture diagrams:

  1. Cloud architecture diagram: Pinterest_Data_Pipeline_cloud.drawio.svg
  2. Hybrid setup diagram: Pinterest_Data_Pipeline_local_DB_Airflow.drawio.svg

Here we deployed the hybrid method.

Components

Core Scripts

  • user_posting_emulation.py: Handles batch processing via Kafka
  • user_posting_emulation_streaming.py: Manages real-time data streaming via Kinesis
  • pinterest_data_pipeline.ipynb: Jupyter notebook containing pipeline development and testing
  • localstack-kinesis/kinesis-implementation.py: Implementation for local Kinesis development using LocalStack
  • localstack-kinesis/kinesis-consumer.py: Standalone consumer for testing LocalStack Kinesis streams
  • localstack-kinesis/localstack-setup.sh: Setup script for LocalStack and Kinesis streams

Infrastructure

  • Amazon RDS/Local MySQL for data storage
  • Amazon EC2 for Kafka broker hosting
  • Amazon API Gateway for REST endpoints
  • Amazon S3 for data lake storage
  • Amazon Kinesis for stream processing
  • LocalStack for local AWS service emulation
  • Amazon MWAA (Managed Workflows for Apache Airflow)/local Apache Airflow
  • Databricks for data transformation and analysis

Setup and Installation

Prerequisites

  • AWS Account with appropriate permissions
  • Python 3.10.13
  • MySQL database
  • Apache Kafka
  • Databricks workspace
  • LocalStack (for local development)
  • Required Python packages (see airflow/requirements.txt)

Configuration Files (igonred in .gitignore)

  1. api_creds.yaml: API Gateway credentials
  2. aws_db_creds.yaml: AWS RDS credentials
  3. local_db_creds.yaml: Local database credentials

Setup Steps

  1. Environment Setup

    • Set Up a Virtual Environment (conda):
  2. Database Configuration

    • Configure MySQL database using pinterest_data_db.sql
    • Set up credentials in appropriate YAML files
  3. AWS Services Setup

    • Configure EC2 instance for Kafka
    • Set up API Gateway endpoints
    • Configure S3 buckets
    • Set up Kinesis streams
    • Configure MWAA environment
  4. LocalStack Setup (for local development)

    • Install and configure LocalStack using localstack-kinesis/localstack-setup.sh
    • Create Kinesis streams for local development

Installation

Clone the Repository:

git clone https://github.com/luke-who/pinterest-data-pipeline542.git

Usage

Batch Processing

  1. Start the Kafka services on EC2
  2. Run the batch processing script:
    python user_posting_emulation.py

Stream Processing

AWS Kinesis

  1. Ensure AWS Kinesis streams are configured
  2. Run the streaming script:
    python user_posting_emulation_streaming.py

LocalStack Kinesis (Local Development)

  1. Set up LocalStack and create Kinesis streams:

    cd localstack-kinesis
    chmod +x localstack-setup.sh
    ./localstack-setup.sh
  2. Run the LocalStack Kinesis implementation:

    python kinesis-implementation.py
  3. For testing and monitoring streams, use the standalone consumer in a separate terminal:

    python kinesis-consumer.py pin_data_geo  # or pin_data_pin or pin_data_user

For the rest of the pipeline and development see pinterest_data_pipeline.ipynb

Airflow DAGs

The airflow/dags directory contains the workflow definitions for orchestrating the pipeline tasks.

Project Structure

├── .databricks/                                        # Databricks integration files
│   └── commit_outputs                                  # Output logs from Databricks commits
├── airflow/                                            # Apache Airflow configuration
│   ├── dags/                                           # Directory for Airflow DAG definitions
│   │   └── 808492447622.py                             # Main DAG file for workflow orchestration
│   └── requirements.txt                                # Python dependencies for Airflow
├── databricks_notebook_output/                         # Exported Databricks notebooks
│   ├── pinterest_data_pipeline.html                    # HTML version of the notebook
│   └── pinterest_data_pipeline.ipynb                   # Jupyter notebook version
├── img/                                                # Project documentation images
│   ├── Pinterest_Data_Pipeline_cloud.drawio.svg             # Cloud architecture diagram
│   └── Pinterest_Data_Pipeline_local_DB_Airflow.drawio.svg  # Hybrid setup diagram
├── localstack-kinesis/                                 # LocalStack implementation for Kinesis
│   ├── kinesis-consumer.py                             # Standalone consumer for testing
│   ├── kinesis-implementation.py                       # Main implementation with LocalStack
│   └── localstack-setup.sh                             # Setup script for LocalStack
├── user_posting_emulation.py                           # Batch processing implementation
├── user_posting_emulation_streaming.py                 # Streaming processing implementation
├── pinterest_data_pipeline.ipynb                       # Main development notebook
├── pinterest_data_db.sql                               # Database initialization script
├── LICENSE                                             # MIT License file
├── README.md                                           # Project documentation
└── .gitignore                                          # Git ignore rules

Additional Features to Implement

  • Real-time Data Visualization:
    • Integrate Power BI for interactive dashboard creation
    • Implement Tableau for advanced data visualization
    • Create real-time monitoring dashboards for Kinesis streams
  • Enhanced Analytics:
    • Develop predictive analytics models using streamed data
    • Implement anomaly detection for data quality monitoring
  • Performance Optimization:
    • Fine-tune Kafka and Kinesis configurations for better throughput
    • Optimize Databricks processing jobs

Security Notes

  • API keys and credentials should be stored securely
  • Use appropriate IAM roles and permissions
  • Never commit sensitive credentials to version control (use .gitignore)
  • For LocalStack development, use dummy credentials only

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This project builds a scalable data pipeline for processing Pinterest-like data, integrating AWS services like Amazon Kinesis for streaming, Apache Kafka for batch processing, Apache Airflow for orchestration, and Databricks for data transformation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published