- Overview
- Architecture
- Components
- Setup and Installation
- Installation
- Usage
- Project Structure
- Additional Features to Implement
- Security Notes
- License
This project implements a sophisticated data pipeline that processes Pinterest-like data through various AWS services. It combines batch processing (Apache Kafka) and stream processing (Amazon Kinesis) capabilities, with Apache Airflow orchestrating the workflow and Databricks handling data transformation. The pipeline demonstrates modern data engineering practices by integrating multiple AWS services and open-source tools to create a robust, scalable data processing solution.
The pipeline consists of several components working together:
-
Data Ingestion
- Extracts data from a MySQL database (local/RDS)
- Processes three main data types:
- Pinterest post data
- Geolocation data
- User data
-
Data Processing Paths
- Batch Processing Path:
- Kafka REST Proxy on EC2 for data ingestion
- S3 storage via Kafka Connect
- Databricks for data transformation
- Streaming Path:
- Amazon Kinesis for real-time data streaming
- LocalStack Kinesis for local development
- Direct integration with Databricks
- Batch Processing Path:
-
Orchestration
- Apache Airflow (local/MWAA) for workflow management
- Integration between Airflow and Databricks
Please note there are two architecture diagrams:
- Cloud architecture diagram: Pinterest_Data_Pipeline_cloud.drawio.svg
- Hybrid setup diagram: Pinterest_Data_Pipeline_local_DB_Airflow.drawio.svg
Here we deployed the hybrid method.
user_posting_emulation.py
: Handles batch processing via Kafkauser_posting_emulation_streaming.py
: Manages real-time data streaming via Kinesispinterest_data_pipeline.ipynb
: Jupyter notebook containing pipeline development and testinglocalstack-kinesis/kinesis-implementation.py
: Implementation for local Kinesis development using LocalStacklocalstack-kinesis/kinesis-consumer.py
: Standalone consumer for testing LocalStack Kinesis streamslocalstack-kinesis/localstack-setup.sh
: Setup script for LocalStack and Kinesis streams
- Amazon RDS/Local MySQL for data storage
- Amazon EC2 for Kafka broker hosting
- Amazon API Gateway for REST endpoints
- Amazon S3 for data lake storage
- Amazon Kinesis for stream processing
- LocalStack for local AWS service emulation
- Amazon MWAA (Managed Workflows for Apache Airflow)/local Apache Airflow
- Databricks for data transformation and analysis
- AWS Account with appropriate permissions
- Python 3.10.13
- MySQL database
- Apache Kafka
- Databricks workspace
- LocalStack (for local development)
- Required Python packages (see
airflow/requirements.txt
)
api_creds.yaml
: API Gateway credentialsaws_db_creds.yaml
: AWS RDS credentialslocal_db_creds.yaml
: Local database credentials
-
Environment Setup
- Set Up a Virtual Environment (conda):
-
Database Configuration
- Configure MySQL database using
pinterest_data_db.sql
- Set up credentials in appropriate YAML files
- Configure MySQL database using
-
AWS Services Setup
- Configure EC2 instance for Kafka
- Set up API Gateway endpoints
- Configure S3 buckets
- Set up Kinesis streams
- Configure MWAA environment
-
LocalStack Setup (for local development)
- Install and configure LocalStack using
localstack-kinesis/localstack-setup.sh
- Create Kinesis streams for local development
- Install and configure LocalStack using
Clone the Repository:
git clone https://github.com/luke-who/pinterest-data-pipeline542.git
- Start the Kafka services on EC2
- Run the batch processing script:
python user_posting_emulation.py
- Ensure AWS Kinesis streams are configured
- Run the streaming script:
python user_posting_emulation_streaming.py
-
Set up LocalStack and create Kinesis streams:
cd localstack-kinesis chmod +x localstack-setup.sh ./localstack-setup.sh
-
Run the LocalStack Kinesis implementation:
python kinesis-implementation.py
-
For testing and monitoring streams, use the standalone consumer in a separate terminal:
python kinesis-consumer.py pin_data_geo # or pin_data_pin or pin_data_user
The airflow/dags
directory contains the workflow definitions for orchestrating the pipeline tasks.
├── .databricks/ # Databricks integration files
│ └── commit_outputs # Output logs from Databricks commits
├── airflow/ # Apache Airflow configuration
│ ├── dags/ # Directory for Airflow DAG definitions
│ │ └── 808492447622.py # Main DAG file for workflow orchestration
│ └── requirements.txt # Python dependencies for Airflow
├── databricks_notebook_output/ # Exported Databricks notebooks
│ ├── pinterest_data_pipeline.html # HTML version of the notebook
│ └── pinterest_data_pipeline.ipynb # Jupyter notebook version
├── img/ # Project documentation images
│ ├── Pinterest_Data_Pipeline_cloud.drawio.svg # Cloud architecture diagram
│ └── Pinterest_Data_Pipeline_local_DB_Airflow.drawio.svg # Hybrid setup diagram
├── localstack-kinesis/ # LocalStack implementation for Kinesis
│ ├── kinesis-consumer.py # Standalone consumer for testing
│ ├── kinesis-implementation.py # Main implementation with LocalStack
│ └── localstack-setup.sh # Setup script for LocalStack
├── user_posting_emulation.py # Batch processing implementation
├── user_posting_emulation_streaming.py # Streaming processing implementation
├── pinterest_data_pipeline.ipynb # Main development notebook
├── pinterest_data_db.sql # Database initialization script
├── LICENSE # MIT License file
├── README.md # Project documentation
└── .gitignore # Git ignore rules
- Real-time Data Visualization:
- Integrate Power BI for interactive dashboard creation
- Implement Tableau for advanced data visualization
- Create real-time monitoring dashboards for Kinesis streams
- Enhanced Analytics:
- Develop predictive analytics models using streamed data
- Implement anomaly detection for data quality monitoring
- Performance Optimization:
- Fine-tune Kafka and Kinesis configurations for better throughput
- Optimize Databricks processing jobs
- API keys and credentials should be stored securely
- Use appropriate IAM roles and permissions
- Never commit sensitive credentials to version control (use
.gitignore
) - For LocalStack development, use dummy credentials only
This project is licensed under the MIT License - see the LICENSE file for details.