Databox: Modern Bird Observation Analytics Platform

A production-ready data platform for analyzing bird observation data from eBird API, featuring real-time data ingestion, transformation pipelines, and interactive dashboards.

🚀 Features

Complete Data Pipeline

eBird API Integration: Real-time bird observation data from multiple US states
Multi-State Support: Arizona, California, and expandable to all US states
Automated Ingestion: Scheduled data collection with 30-day lookback
Data Transformation: Clean, standardized models using SQLMesh
Interactive Dashboard: Rich Streamlit app with maps, charts, and filters

Modern Data Stack

Database: DuckDB (fast, embedded analytical database)
Ingestion: dlt (data load tool) with robust error handling
Transformations: SQLMesh with version control and testing
Orchestration: Dagster for workflow management
Visualization: Streamlit with Plotly for interactive charts and maps
Task Management: Task (go-task) for streamlined development

🎯 Quick Start

1. Environment Setup

# Install Task (macOS)
brew install go-task/tap/go-task

# Clone and setup
git clone <your-repo>
cd databox
task setup

# Configure your eBird API token
cp .env.example .env
# Edit .env and add your EBIRD_API_TOKEN

2. Run the Data Pipeline

# Ingest Arizona bird data (default)
task pipeline:ebird

# Or specify a different state
task pipeline:ebird -- --region US-CA  # California
task pipeline:ebird -- --region US-NY  # New York

# Transform raw data into analytics-ready tables
task transform:run

3. Launch the Dashboard

# Start interactive bird observation dashboard
task streamlit

# Access at http://localhost:8501

🏗️ Project Architecture

databox/
├── apps/                    # Applications and dashboards
│   └── ebird_streamlit/    # Streamlit bird observation dashboard
│       ├── main.py         # Main dashboard application
│       ├── README.md       # Dashboard documentation
│       └── .streamlit/     # Streamlit configuration
├── pipelines/              # Data ingestion pipelines
│   └── sources/
│       └── ebird_api.py    # eBird API integration with multi-state support
├── transformations/        # SQLMesh data transformation project
│   └── home_team/         # Main transformation project
│       ├── models/        # SQL transformation models
│       │   ├── staging/   # Clean, standardized data (stg_*)
│       │   ├── intermediate/ # Business logic (int_*)
│       │   └── marts/     # Final analytics tables (fct_*, dim_*)
│       ├── tests/         # Model tests and data quality checks
│       └── config.yaml    # SQLMesh configuration
├── orchestration/          # Dagster workflow orchestration
│   └── dagster_project.py # Asset definitions and jobs
├── data/                  # Data storage (gitignored)
│   └── databox.db        # DuckDB database file
├── scripts/               # Utility scripts
└── .dagster/             # Dagster state and configuration

📊 Dashboard Features

Interactive Filters

State/Region: Select Arizona, California, or multiple states
Date Range: Filter observations by date with smart defaults
Species: Multi-select from 700+ observed bird species
Time of Day: Hour-based filtering (0-23)
Notable Observations: Show only rare/unusual sightings

Visualization Tabs

🗺️ Map Tab (Default)

Interactive map of all observation locations
Color-coded by bird species
Semi-transparent red star markers for birding hotspots
Hover details with location and observation info

📊 Overview Tab

Top 15 most frequently observed species (horizontal bar chart)
Hourly bird activity patterns (line chart)
Daily observation timeline showing trends over time
Key metrics: total observations, unique species, locations, notable sightings

📈 Trends Tab

Species diversity over time
Daily observation count trends
Aggregated analytics from daily fact tables

📋 Data Tab

Raw data exploration with search functionality
Searchable by species name, scientific name, or location
CSV export functionality
Data overview metrics

🐦 Data Model

Raw Data Sources

Recent Observations: Current bird sightings (33+ records from Arizona)
Notable Observations: Rare/unusual birds (81+ records)
Hotspots: Popular birding locations (477+ locations)
Species List: State-specific bird species (700+ species)
Taxonomy: Global eBird taxonomy (17,415+ species)

Transformed Models

stg_ebird_observations: Cleaned observation data with standardized columns
stg_ebird_hotspots: Processed hotspot locations with coordinates
stg_ebird_taxonomy: Normalized species taxonomy data
int_ebird_enriched_observations: Business logic applied observations
fct_daily_bird_observations: Daily aggregated metrics by species and location

🛠️ Development Workflows

Data Pipeline Operations

# List available pipelines
task pipeline:list

# Run full data refresh
task full-refresh

# Plan transformation changes
task transform:plan

# Apply transformations
task transform:run

# Run transformation tests
task transform:test

# Open SQLMesh UI for model development
task transform:ui

Dashboard Development

# Start development server with hot reload
task streamlit

# Or run directly
cd apps/ebird_streamlit
streamlit run main.py

Orchestration

# Start Dagster development server
task dagster:dev

# Execute specific job
task dagster:job daily_ebird_pipeline

# Materialize specific assets
task dagster:materialize ebird_raw_data

Code Quality

# Format and lint code
task format
task lint

# Type checking
task typecheck

# Run all CI checks
task ci

# Security scanning
task check-secrets

🌍 Multi-State Support

The platform is designed for easy expansion to additional US states:

# Add new states by running pipeline with different regions
task pipeline:ebird -- --region US-TX  # Texas
task pipeline:ebird -- --region US-FL  # Florida
task pipeline:ebird -- --region US-NY  # New York

# The dashboard automatically detects and includes new states
# No code changes required!

Supported Region Codes:

US-AZ - Arizona
US-CA - California
US-NY - New York
US-TX - Texas
US-FL - Florida
US - All United States (if API supports)

📈 Performance & Scalability

Current Scale

10,000+ observations per state per pipeline run
30-day lookback for historical data
Sub-second dashboard response with caching
Real-time filtering across multiple dimensions

Optimizations

Streamlit @st.cache_data for query performance
DuckDB columnar storage for analytical queries
SQLMesh incremental models for efficient transformations
Connection pooling and proper resource management

🔒 Security & Best Practices

API Key Management

Environment variables for sensitive configuration
Pre-commit hooks prevent accidental secret commits
Placeholder values in example files

Data Quality

SQLMesh built-in testing framework
Data type validation and constraints
Error handling and graceful degradation
Monitoring and alerting through Dagster

🚀 Deployment

Development

task prod  # Shows available production commands

Production Considerations

Replace DuckDB with PostgreSQL/Snowflake for multi-user access
Add Airflow/Prefect for production orchestration
Implement proper logging and monitoring
Add CI/CD pipeline for automated deployments
Set up data backup and disaster recovery

📚 Documentation

Main README: You're reading it!
Dashboard Docs: apps/ebird_streamlit/README.md
Pipeline Docs: Inline documentation in pipelines/sources/ebird_api.py
SQLMesh Models: SQL comments and model descriptions
Development Guide: CLAUDE.md for development workflows

🤝 Contributing

Code Quality: All PRs must pass linting, type checking, and tests
Documentation: Update relevant README files for new features
Testing: Add tests for new models and pipeline components
Security: Run task check-secrets before committing

📄 License

MIT License - see LICENSE file for details.

Built with ❤️ for bird enthusiasts and data engineers

Transform raw eBird data into actionable insights with modern data engineering practices.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
.task/checksum		.task/checksum
apps/ebird_streamlit		apps/ebird_streamlit
orchestration		orchestration
pipelines		pipelines
scripts		scripts
transformations		transformations
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
Taskfile.yaml		Taskfile.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Doctacon/databox

Folders and files

Latest commit

History

Repository files navigation