A production-grade ETL pipeline for processing financial market data using Apache Airflow, dbt, and PostgreSQL. This project demonstrates modern data engineering practices with a focus on reliability, scalability, and performance.
This project implements a robust data engineering pipeline that processes financial market data from Alpha Vantage API. It showcases industry best practices in data engineering including data validation, testing, documentation, and monitoring.
graph TD
A[Alpha Vantage API] -->|Extract| B[Raw Data Layer]
B -->|Transform| C[Staging Layer]
C -->|Model| D[Marts Layer]
D -->|Visualize| E[Dashboards]
style A fill:#f9a825,stroke:#f57f17,stroke-width:2px
style B fill:#42a5f5,stroke:#1976d2,stroke-width:2px
style C fill:#66bb6a,stroke:#388e3c,stroke-width:2px
style D fill:#ab47bc,stroke:#7b1fa2,stroke-width:2px
style E fill:#ec407a,stroke:#c2185b,stroke-width:2px
- π Real-time Market Data: Automated extraction of stock market data from Alpha Vantage
- π‘οΈ Data Quality: Comprehensive data testing and validation using dbt
- π Scalable Architecture: Containerized services with proper health checks and dependency management
- π Visualization: Interactive Streamlit dashboard and Metabase BI platform
- π Technical Analysis: Built-in indicators and market metrics
- π Monitoring: Built-in logging and health monitoring for all services
- π Documentation: Extensive documentation of models, tests, and best practices
- π₯οΈ Resource Optimization: Support for older hardware with minimal resource requirements
Category | Technology |
---|---|
Orchestration | Apache Airflow 2.7.3 |
Data Warehouse | PostgreSQL 13 |
Transformation | dbt 1.7.3 |
Containerization | Docker & Docker Compose |
Programming | Python 3.9 |
Data Source | Alpha Vantage API |
Visualization | Streamlit & Metabase |
Testing | pytest, dbt tests |
Our data pipeline follows a modern layered architecture:
flowchart LR
subgraph Extraction
A[Alpha Vantage API] --> B[Airflow DAGs]
end
subgraph Storage
B --> C[Raw Layer]
C --> D[Staging Layer]
D --> E[Marts Layer]
end
subgraph Visualization
E --> F[Streamlit Dashboard]
E --> G[Metabase]
end
style A fill:#f9a825,stroke:#f57f17,stroke-width:2px
style B fill:#42a5f5,stroke:#1976d2,stroke-width:2px
style C fill:#90caf9,stroke:#42a5f5,stroke-width:2px
style D fill:#66bb6a,stroke:#388e3c,stroke-width:2px
style E fill:#ab47bc,stroke:#7b1fa2,stroke-width:2px
style F fill:#ec407a,stroke:#c2185b,stroke-width:2px
style G fill:#7e57c2,stroke:#512da8,stroke-width:2px
Our data model follows a star schema design for analytics:
erDiagram
fact_market_metrics ||--o{ dim_company : references
dim_company {
string symbol PK
string company_name
string sector
decimal market_cap
decimal pe_ratio
}
fact_market_metrics {
date trading_date
string symbol FK
decimal close_price
bigint volume
decimal price_change_pct
}
- Docker and Docker Compose
- Python 3.9+
- Make (optional, for using Makefile commands)
- Alpha Vantage API key
- Clone the repository:
git clone https://github.com/javid912/datapipe-analytics.git
cd datapipe-analytics
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
- Copy the example environment file and configure your API key:
cp .env.example .env
# Edit .env and add your Alpha Vantage API key
- Start the services:
# For standard hardware:
docker-compose up -d
# For older or resource-constrained hardware:
docker-compose -f docker-compose-minimal.yml up -d
- Access the services:
- π Streamlit Dashboard: http://localhost:8501
- π Metabase: http://localhost:3000 (username: [email protected], password: metabase123)
- π Airflow UI: http://localhost:8080 (username: admin, password: admin)
- ποΈ PostgreSQL: localhost:5432
For older or resource-constrained hardware, we provide a minimal Docker Compose configuration:
docker-compose -f docker-compose-minimal.yml up -d
This configuration:
- π½ Reduces memory usage for all containers
- π½ Limits CPU usage
- β Starts only essential services
- β Optimizes database connections
- β Implements selective computation of technical indicators
datapipe-analytics/
βββ airflow/ # Airflow DAGs and configurations
β βββ dags/ # DAG definitions
βββ dbt/ # Data transformation
β βββ models/ # dbt models
β β βββ staging/ # Staging models
β β βββ marts/ # Mart models
β βββ seeds/ # Seed data files
β βββ tests/ # Data tests
βββ docker/ # Dockerfile definitions
βββ src/ # Source code
β βββ dashboard/ # Streamlit dashboard
β βββ extractors/ # Data extraction modules
β βββ loaders/ # Database loading modules
βββ tests/ # Python tests
βββ docs/ # Documentation
βββ DEVELOPMENT_JOURNAL.md # Development history
Our dbt models follow a layered architecture:
Layer | Purpose | Examples |
---|---|---|
Raw (public_raw) | Original data from external sources | raw_stock_prices , raw_company_info |
Staging (public_staging) | Clean, typed data from raw sources | stg_daily_prices , stg_company_info |
Marts (public_marts) | Business logic transformations for analytics | dim_company , fact_market_metrics |
The project includes comprehensive testing at multiple levels:
- β dbt tests: Data quality and business logic validation
- β Python unit tests: Code functionality verification
- β Integration tests: End-to-end pipeline validation
- β Container health checks: Service availability monitoring
Our Streamlit dashboard provides:
- π Market overview with key metrics
- π Technical analysis with indicators
- π Company-specific deep dives
- π Historical price analysis
Metabase offers:
- π Custom SQL queries and visualizations
- π Scheduled reports and alerts
- π Interactive filtering and exploration
- π Shareable dashboards and insights
Access Metabase at:
- π URL: http://localhost:3000
- π€ Default credentials:
- Email: [email protected]
- Password: metabase123
Our monitoring approach includes:
- π Service health monitoring via Docker health checks
- π Airflow task monitoring and alerting
- β dbt test coverage and data quality metrics
- π Comprehensive logging for all components
We welcome contributions! Please read our CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
Check out our Issues page to see current tasks, bugs, and feature requests. Feel free to pick up any issue labeled "good first issue" to get started!
This project is licensed under the MIT License - see the LICENSE file for details.
- β Add Streamlit dashboard for data visualization
- β Implement resource optimization for older hardware
- β Add Metabase integration
- π Implement real-time data processing
- π Add more technical indicators
- π Enhance monitoring and alerting
- π Add support for more data sources
- Alpha Vantage for providing financial market data
- Apache Airflow for workflow orchestration
- dbt for data transformation
- Streamlit for dashboard creation
- Metabase for business intelligence