A production-ready data platform for analyzing bird observation data from eBird API, featuring real-time data ingestion, transformation pipelines, and interactive dashboards.
- eBird API Integration: Real-time bird observation data from multiple US states
- Multi-State Support: Arizona, California, and expandable to all US states
- Automated Ingestion: Scheduled data collection with 30-day lookback
- Data Transformation: Clean, standardized models using SQLMesh
- Interactive Dashboard: Rich Streamlit app with maps, charts, and filters
- Database: DuckDB (fast, embedded analytical database)
- Ingestion: dlt (data load tool) with robust error handling
- Transformations: SQLMesh with version control and testing
- Orchestration: Dagster for workflow management
- Visualization: Streamlit with Plotly for interactive charts and maps
- Task Management: Task (go-task) for streamlined development
# Install Task (macOS)
brew install go-task/tap/go-task
# Clone and setup
git clone <your-repo>
cd databox
task setup
# Configure your eBird API token
cp .env.example .env
# Edit .env and add your EBIRD_API_TOKEN
# Ingest Arizona bird data (default)
task pipeline:ebird
# Or specify a different state
task pipeline:ebird -- --region US-CA # California
task pipeline:ebird -- --region US-NY # New York
# Transform raw data into analytics-ready tables
task transform:run
# Start interactive bird observation dashboard
task streamlit
# Access at http://localhost:8501
databox/
├── apps/ # Applications and dashboards
│ └── ebird_streamlit/ # Streamlit bird observation dashboard
│ ├── main.py # Main dashboard application
│ ├── README.md # Dashboard documentation
│ └── .streamlit/ # Streamlit configuration
├── pipelines/ # Data ingestion pipelines
│ └── sources/
│ └── ebird_api.py # eBird API integration with multi-state support
├── transformations/ # SQLMesh data transformation project
│ └── home_team/ # Main transformation project
│ ├── models/ # SQL transformation models
│ │ ├── staging/ # Clean, standardized data (stg_*)
│ │ ├── intermediate/ # Business logic (int_*)
│ │ └── marts/ # Final analytics tables (fct_*, dim_*)
│ ├── tests/ # Model tests and data quality checks
│ └── config.yaml # SQLMesh configuration
├── orchestration/ # Dagster workflow orchestration
│ └── dagster_project.py # Asset definitions and jobs
├── data/ # Data storage (gitignored)
│ └── databox.db # DuckDB database file
├── scripts/ # Utility scripts
└── .dagster/ # Dagster state and configuration
- State/Region: Select Arizona, California, or multiple states
- Date Range: Filter observations by date with smart defaults
- Species: Multi-select from 700+ observed bird species
- Time of Day: Hour-based filtering (0-23)
- Notable Observations: Show only rare/unusual sightings
- Interactive map of all observation locations
- Color-coded by bird species
- Semi-transparent red star markers for birding hotspots
- Hover details with location and observation info
- Top 15 most frequently observed species (horizontal bar chart)
- Hourly bird activity patterns (line chart)
- Daily observation timeline showing trends over time
- Key metrics: total observations, unique species, locations, notable sightings
- Species diversity over time
- Daily observation count trends
- Aggregated analytics from daily fact tables
- Raw data exploration with search functionality
- Searchable by species name, scientific name, or location
- CSV export functionality
- Data overview metrics
- Recent Observations: Current bird sightings (33+ records from Arizona)
- Notable Observations: Rare/unusual birds (81+ records)
- Hotspots: Popular birding locations (477+ locations)
- Species List: State-specific bird species (700+ species)
- Taxonomy: Global eBird taxonomy (17,415+ species)
stg_ebird_observations
: Cleaned observation data with standardized columnsstg_ebird_hotspots
: Processed hotspot locations with coordinatesstg_ebird_taxonomy
: Normalized species taxonomy dataint_ebird_enriched_observations
: Business logic applied observationsfct_daily_bird_observations
: Daily aggregated metrics by species and location
# List available pipelines
task pipeline:list
# Run full data refresh
task full-refresh
# Plan transformation changes
task transform:plan
# Apply transformations
task transform:run
# Run transformation tests
task transform:test
# Open SQLMesh UI for model development
task transform:ui
# Start development server with hot reload
task streamlit
# Or run directly
cd apps/ebird_streamlit
streamlit run main.py
# Start Dagster development server
task dagster:dev
# Execute specific job
task dagster:job daily_ebird_pipeline
# Materialize specific assets
task dagster:materialize ebird_raw_data
# Format and lint code
task format
task lint
# Type checking
task typecheck
# Run all CI checks
task ci
# Security scanning
task check-secrets
The platform is designed for easy expansion to additional US states:
# Add new states by running pipeline with different regions
task pipeline:ebird -- --region US-TX # Texas
task pipeline:ebird -- --region US-FL # Florida
task pipeline:ebird -- --region US-NY # New York
# The dashboard automatically detects and includes new states
# No code changes required!
Supported Region Codes:
US-AZ
- ArizonaUS-CA
- CaliforniaUS-NY
- New YorkUS-TX
- TexasUS-FL
- FloridaUS
- All United States (if API supports)
- 10,000+ observations per state per pipeline run
- 30-day lookback for historical data
- Sub-second dashboard response with caching
- Real-time filtering across multiple dimensions
- Streamlit
@st.cache_data
for query performance - DuckDB columnar storage for analytical queries
- SQLMesh incremental models for efficient transformations
- Connection pooling and proper resource management
- Environment variables for sensitive configuration
- Pre-commit hooks prevent accidental secret commits
- Placeholder values in example files
- SQLMesh built-in testing framework
- Data type validation and constraints
- Error handling and graceful degradation
- Monitoring and alerting through Dagster
task prod # Shows available production commands
- Replace DuckDB with PostgreSQL/Snowflake for multi-user access
- Add Airflow/Prefect for production orchestration
- Implement proper logging and monitoring
- Add CI/CD pipeline for automated deployments
- Set up data backup and disaster recovery
- Main README: You're reading it!
- Dashboard Docs:
apps/ebird_streamlit/README.md
- Pipeline Docs: Inline documentation in
pipelines/sources/ebird_api.py
- SQLMesh Models: SQL comments and model descriptions
- Development Guide:
CLAUDE.md
for development workflows
- Code Quality: All PRs must pass linting, type checking, and tests
- Documentation: Update relevant README files for new features
- Testing: Add tests for new models and pipeline components
- Security: Run
task check-secrets
before committing
MIT License - see LICENSE file for details.
Built with ❤️ for bird enthusiasts and data engineers
Transform raw eBird data into actionable insights with modern data engineering practices.