A blockchain indexer using Cryo for data extraction and ClickHouse for storage.
- Quick Start
- Operations Overview
- Repository Structure
- Key Features
- Architecture
- Operation Modes
- Indexing Modes
- Configuration
- Database Schema
- Running with Docker
- Running with Makefile
- State Management
- Monitoring & Validation
- Performance Tuning
- Troubleshooting
- Use Case Examples
- Development
# 1. Clone and build
git clone <repository>
cd cryo-indexer
make build
# 2. Configure
cp .env.example .env
# Edit .env with your RPC URL and ClickHouse settings
# 3. Run migrations
make run-migrations
# 4. Start indexing
make start
The simplified indexer has 3 core operations designed for clarity and reliability:
Operation | Purpose | When to Use | Key Features |
---|---|---|---|
continuous |
Real-time blockchain following | Production systems, live data | Polls chain tip, handles reorgs, automatic recovery |
historical |
Fast bulk indexing of specific ranges | Initial sync, catching up, research | Parallel processing, progress tracking, efficient batching |
maintain |
Process failed/pending ranges | After failures, fixing incomplete data | Retry failed ranges, process pending work |
What do you need to do?
🔄 Real-time blockchain following?
└─ Use: continuous
📥 Download specific block range?
├─ Fresh/empty database?
│ └─ Use: historical (most efficient)
└─ Know exact range needed?
└─ Use: historical
🔧 Fix failed or incomplete data?
├─ Ranges marked as 'failed'?
├─ Ranges stuck as 'pending'?
└─ Use: maintain
Feature | Status | Notes |
---|---|---|
continuous operation |
✅ Implemented | Real-time blockchain following |
historical operation |
✅ Implemented | Parallel bulk indexing |
maintain operation |
✅ Implemented | Processes failed/pending ranges from state table |
Automatic gap detection | ❌ Not implemented | Roadmap feature |
Automatic timestamp fixing | ❌ Not implemented | Roadmap feature |
Data validation | ✅ Basic validation | Checks for valid timestamps during processing |
cryo-indexer/
├── Dockerfile # Container build configuration
├── LICENSE # MIT License
├── Makefile # Simplified build and run commands
├── README.md # This file
├── data/ # Local data directory (mounted as volume)
├── docker-compose.yml # Simplified Docker Compose configuration
├── img/
│ └── header-cryo-indexer.png # Header image
├── migrations/ # Database schema migrations
│ ├── 001_create_database.sql
│ ├── 002_create_blocks.sql
│ ├── 003_create_transactions.sql
│ ├── 004_create_logs.sql
│ ├── 005_create_contracts.sql
│ ├── 006_create_native_transfers.sql
│ ├── 007_create_traces.sql
│ ├── 008_create_balance_diffs.sql
│ ├── 009_create_code_diffs.sql
│ ├── 010_create_nonce_diffs.sql
│ ├── 011_create_storage_diffs.sql
│ ├── 012_create_indexing_state.sql # Simplified state management
│ └── 013_create_withdrawals.sql # Withdrawals table (auto-populated)
├── requirements.txt # Python dependencies
├── scripts/
│ └── entrypoint.sh # Simplified container entrypoint
└── src/ # Main application code
├── __init__.py
├── __main__.py # Simplified application entry point
├── config.py # Streamlined configuration (15 settings vs 40+)
├── indexer.py # Main indexer with 3 operations
├── worker.py # Simplified worker with strict timestamp requirements
├── core/ # Core functionality
│ ├── __init__.py
│ ├── blockchain.py # Blockchain client for RPC calls
│ ├── state_manager.py # Simplified state management
│ └── utils.py # Utility functions
└── db/ # Database components
├── __init__.py
├── clickhouse_manager.py # Simplified ClickHouse operations
├── clickhouse_pool.py # Connection pooling
└── migrations.py # Migration runner
- 3 Operations: Down from 8+ complex operations
- 15 Settings: Down from 40+ configuration options
- Single State Table: Simplified from multiple state tracking tables
- Fail-Fast: Clear error messages, immediate failure on issues
- Strict Timestamps: Blocks must have valid timestamps before processing other datasets
- Atomic Processing: Complete ranges or fail entirely
- Automatic Recovery: Self-healing from crashes and network issues
- Parallel Processing: Efficient multi-worker historical indexing
- Optimized Batching: Smart batch sizes for different operations
- Reduced Overhead: Simpler code paths, less verification complexity
- Better Resource Usage: Eliminated redundant operations
- Faster Startup: Simpler initialization and state checking
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Blockchain│────▶│ Cryo │────▶│ ClickHouse │
│ RPC │ │ Extractor │ │ Database │
└─────────────┘ └─────────────┘ └─────────────┘
│ ▲
▼ │
┌─────────────┐ │
│ Simplified │──────────────┘
│ Worker │
└─────────────┘
- Blocks First: Always process blocks before other datasets
- Strict Validation: Fail immediately if timestamps are missing
- Single Source of Truth: Only
indexing_state
table for all state - Clear Separation: Operations don't overlap in functionality
- Atomic Operations: Complete ranges or rollback entirely
- Main Indexer (
indexer.py
) orchestrates one of 3 operations - Simplified Worker (
worker.py
) processes ranges with strict timestamp requirements - State Manager (
state_manager.py
) uses single table for all state tracking - ClickHouse Manager handles database operations with fail-fast validation
Real-time blockchain following and indexing
Use Case: Production deployment for real-time data
Behavior:
- Polls for new blocks every
POLL_INTERVAL
seconds (default: 10s) - Waits
CONFIRMATION_BLOCKS
(default: 12) before indexing to avoid reorgs - Processes in small batches (default: 100 blocks) for reliability
- Automatically resumes from last indexed block on restart
- Self-healing: resets stale jobs on startup
When to Use:
- ✅ Production systems requiring up-to-date blockchain data
- ✅ Real-time analytics and monitoring
- ✅ DeFi applications needing fresh transaction data
- ✅ After completing historical sync
Reliability Features:
- Single-threaded for stability
- Small batch sizes prevent memory issues
- Automatic stale job cleanup
- Graceful shutdown handling
# Basic continuous indexing
make continuous
# Continuous with specific mode
make continuous MODE=full
# Start from specific block
make continuous START_BLOCK=18000000
Fast bulk indexing of specific block ranges
Use Case: Initial data loading, catching up, or selective range processing
Behavior:
- Downloads exactly what you specify (start to end block)
- Supports parallel processing with multiple workers
- Automatically divides work into optimal batch sizes
- Built-in progress tracking and ETA calculations
- Strict timestamp validation at each step
When to Use:
- ✅ Initial sync of blockchain data
- ✅ Indexing specific periods (e.g., "DeFi Summer 2020")
- ✅ Catching up after downtime
- ✅ Selective data extraction for research
- ✅ Fresh start with empty database
- ✅ You know exactly what range you need
Performance Features:
- Parallel workers for speed
- Optimized batch sizes
- Built-in load balancing
- Progress monitoring
# Basic historical range
make historical START_BLOCK=1000000 END_BLOCK=2000000
# Parallel processing (recommended for large ranges)
make historical START_BLOCK=1000000 END_BLOCK=2000000 WORKERS=8
# Conservative settings for rate-limited RPCs
make historical START_BLOCK=1000000 END_BLOCK=1100000 \
WORKERS=2 BATCH_SIZE=100 REQUESTS_PER_SECOND=10
Process failed and pending ranges from state table
Use Case: Fix incomplete or failed indexing work
Behavior:
- Scans State Table: Looks for ranges marked as 'failed' or 'pending'
- Retry Processing: Re-attempts failed ranges with proper error handling
- Clear Pending: Processes ranges that were never attempted
- Progress Reporting: Shows what was fixed and any remaining issues
- State-Driven: Only processes what the state table indicates needs work
When to Use:
- ✅ After system failures or crashes
- ✅ Network interruptions during indexing
- ✅ RPC failures that left work incomplete
- ✅ Ranges stuck in 'pending' state
- ✅ Periodic maintenance to clear failed work
What It Fixes:
- Ranges marked as 'failed' in indexing_state
- Ranges stuck as 'pending' that were never processed
- Worker crashes that left ranges incomplete
What It Does NOT Do (Roadmap Features):
- Automatic gap detection between completed ranges
- Timestamp correction from invalid dates
- Data validation across all tables
- State reconstruction from existing data
# Process all failed/pending ranges
make maintain
# Process issues for specific range
make maintain START_BLOCK=1000000 END_BLOCK=2000000
# Parallel maintenance
make maintain WORKERS=4
- Scan Phase: Queries
indexing_state
table for failed/pending ranges - Report Phase: Shows what issues were found
- Retry Phase: Re-processes each failed/pending range individually
- Complete Phase: Marks successfully processed ranges as completed
Check indexing progress and data integrity
# Check overall progress
make status
# Validate specific range
make status START_BLOCK=1000000 END_BLOCK=2000000
Choose what blockchain data to extract and index:
Mode | Datasets | Use Case | Performance | Storage (per 1M blocks) |
---|---|---|---|---|
Minimal | blocks , transactions , logs |
Standard DeFi/DApp analysis | Fast | ~50GB |
Extra | contracts , native_transfers , traces |
Additional contract & trace data | Moderate | ~100GB |
Diffs | balance_diffs , code_diffs , nonce_diffs , storage_diffs |
State change analysis | Slow | ~200GB |
Full | All 10 datasets | Complete blockchain analysis | Slowest | ~500GB |
Custom | User-defined | Tailored to specific needs | Variable | Variable |
Standard DeFi/DApp analysis setup
Datasets: blocks
, transactions
, logs
Perfect for:
- DeFi protocols analysis
- Transaction monitoring
- Event log processing
- Token transfer tracking
make continuous MODE=minimal
Additional blockchain data
Datasets: contracts
, native_transfers
, traces
Perfect for:
- Contract deployment tracking
- Native ETH transfers
- Internal transaction analysis
- MEV research
make continuous MODE=extra
State change tracking
Datasets: balance_diffs
, code_diffs
, nonce_diffs
, storage_diffs
Perfect for:
- State analysis
- Account balance tracking
- Storage slot monitoring
- Contract code changes
make continuous MODE=diffs
Complete blockchain analysis
Datasets: All 10 datasets
Perfect for:
- Complete audit trails
- Academic research
- Forensic analysis
- Comprehensive blockchain indexing
make continuous MODE=full
Tailored dataset selection
# MEV analysis
make continuous MODE=custom DATASETS=blocks,transactions,logs,traces,native_transfers
# Contract focus
make continuous MODE=custom DATASETS=blocks,transactions,contracts
# State tracking
make continuous MODE=custom DATASETS=blocks,balance_diffs,storage_diffs
Variable | Description | Default | Required |
---|---|---|---|
ETH_RPC_URL |
Blockchain RPC endpoint | - | ✅ |
CLICKHOUSE_HOST |
ClickHouse host | - | ✅ |
CLICKHOUSE_PASSWORD |
ClickHouse password | - | ✅ |
Variable | Description | Default | Notes |
---|---|---|---|
NETWORK_NAME |
Network name for Cryo | ethereum | Most networks supported |
CLICKHOUSE_USER |
ClickHouse username | default | Usually default |
CLICKHOUSE_DATABASE |
Database name | blockchain | Auto-created |
CLICKHOUSE_PORT |
ClickHouse port | 8443 | Standard for ClickHouse Cloud |
CLICKHOUSE_SECURE |
Use HTTPS | true | Recommended |
Variable | Description | Default | Options |
---|---|---|---|
OPERATION |
Operation mode | continuous | continuous, historical, maintain, validate |
MODE |
Indexing mode | minimal | minimal, extra, diffs, full, custom |
START_BLOCK |
Starting block number | 0 | For historical/maintain |
END_BLOCK |
Ending block number | 0 | For historical/maintain |
Variable | Description | Default | Notes |
---|---|---|---|
WORKERS |
Number of parallel workers | 1 | Use 4-16 for historical |
BATCH_SIZE |
Blocks per batch | 100 | Smaller = more reliable |
MAX_RETRIES |
Max retry attempts | 3 | Exponential backoff |
Variable | Description | Default | Notes |
---|---|---|---|
REQUESTS_PER_SECOND |
RPC requests per second | 20 | Conservative default |
MAX_CONCURRENT_REQUESTS |
Concurrent RPC requests | 2 | Prevent overload |
CRYO_TIMEOUT |
Cryo command timeout (seconds) | 600 | Increase for slow networks |
Variable | Description | Default | Notes |
---|---|---|---|
CONFIRMATION_BLOCKS |
Blocks to wait for confirmation | 12 | Reorg protection |
POLL_INTERVAL |
Polling interval (seconds) | 10 | How often to check for new blocks |
The indexer automatically creates these tables in ClickHouse:
blocks
- Block headers and metadata with strict timestamp requirementstransactions
- Transaction data including gas, value, statuslogs
- Event logs from smart contractscontracts
- Contract creation datanative_transfers
- ETH/native token transferstraces
- Detailed transaction execution traces
balance_diffs
- Account balance changescode_diffs
- Smart contract code changesnonce_diffs
- Account nonce changesstorage_diffs
- Contract storage changes
withdrawals
- Validator withdrawals (automatically populated when processing blocks)
indexing_state
- Single source of truth for all indexing stateindexing_progress
- Real-time progress view (materialized view)migrations
- Database migration tracking
The withdrawals table is automatically populated whenever blocks are processed. It contains:
- Block Context:
block_number
,block_hash
,withdrawals_root
,chain_id
,block_timestamp
- Withdrawal Data:
withdrawal_index
,validator_index
,address
,amount
- System Fields:
insert_version
(for deduplication)
Key Features:
- Automatic Population: No separate dataset needed - filled when processing blocks
- Normalized Structure: Individual rows for each withdrawal (not JSON arrays)
- Easy Querying: Standard SQL queries work naturally
- Consistent Timestamps: Same timestamp validation as other tables
Example Queries:
-- Get all withdrawals for a specific validator
SELECT * FROM withdrawals
WHERE validator_index = '0x2a696'
ORDER BY block_number;
-- Total withdrawal amounts by address
SELECT address, SUM(hexToUInt256(amount)) as total_amount
FROM withdrawals
WHERE block_number >= 17000000
GROUP BY address
ORDER BY total_amount DESC;
-- Daily withdrawal summary
SELECT
toDate(block_timestamp) as date,
COUNT(*) as withdrawal_count,
COUNT(DISTINCT address) as unique_addresses,
SUM(hexToUInt256(amount)) as total_amount
FROM withdrawals
GROUP BY date
ORDER BY date;
- Docker and Docker Compose installed
- ClickHouse instance (local or cloud)
- Blockchain RPC endpoint
Create a .env
file:
# Required settings
ETH_RPC_URL=https://eth-mainnet.g.alchemy.com/v2/YOUR_KEY
CLICKHOUSE_HOST=your-clickhouse-host.com
CLICKHOUSE_PASSWORD=your-password
# Optional settings (smart defaults)
NETWORK_NAME=ethereum
CLICKHOUSE_DATABASE=blockchain
WORKERS=4
BATCH_SIZE=100
MODE=minimal
# Build the image
docker-compose build
# Run migrations (includes withdrawals table creation)
docker-compose --profile migrations up migrations
# Start continuous indexing
docker-compose up cryo-indexer-minimal
# Historical indexing (one-shot job)
OPERATION=historical START_BLOCK=18000000 END_BLOCK=18100000 \
docker-compose --profile historical up historical-job
# Maintenance (fix failed/pending ranges)
docker-compose --profile maintain up maintain-job
# Minimal mode (default) - includes automatic withdrawals processing
docker-compose up cryo-indexer-minimal
# Extra mode (contracts, transfers, traces) - includes automatic withdrawals processing
docker-compose up cryo-indexer-extra
# Diffs mode (state changes) - includes automatic withdrawals processing
docker-compose up cryo-indexer-diffs
# Full mode (everything) - includes automatic withdrawals processing
docker-compose up cryo-indexer-full
# Custom mode
DATASETS=blocks,transactions,logs,traces \
docker-compose up cryo-indexer-custom
make build # Build Docker image
make run-migrations # Run database migrations (includes withdrawals table)
# Continuous indexing (automatically processes withdrawals when processing blocks)
make continuous # Start real-time indexing
make start # Alias for continuous
# Historical indexing (automatically processes withdrawals when processing blocks)
make historical START_BLOCK=1000000 END_BLOCK=2000000
make historical START_BLOCK=1000000 END_BLOCK=2000000 WORKERS=8
# Maintenance (process failed/pending ranges)
make maintain # Fix failed/pending ranges from state table
# Different modes (all automatically process withdrawals when processing blocks)
make minimal # Start minimal mode
make full # Start full mode
make custom DATASETS=blocks,transactions,traces
# Testing
make test-range START_BLOCK=18000000 # Test with 1000 blocks
make logs # View logs
make status # Check indexing status (uses validate operation)
make ps # Container status
make stop # Stop indexer
make clean # Remove containers and volumes
make shell # Open container shell
- Single Source of Truth: Only the
indexing_state
table is used - Fixed Range Size: 1000-block ranges for predictable processing
- Simple Statuses:
pending
→processing
→completed
orfailed
- Atomic Operations: Ranges are completed entirely or not at all
- Automatic Cleanup: Stale jobs are reset on startup
indexing_state table:
- mode, dataset, start_block, end_block (composite key)
- status: pending | processing | completed | failed
- worker_id, attempt_count, created_at, completed_at
- rows_indexed, error_message
- On Startup: All 'processing' jobs are reset to 'pending'
- No Complex Monitoring: Simple state-based detection
- Self-Healing: System automatically recovers from crashes
# Check overall progress (uses validate operation)
make status
# View detailed logs
make logs
# Monitor container status
make ps
The validate operation provides progress reports:
=== INDEXING PROGRESS ===
blocks:
Completed ranges: 1250
Processing ranges: 0
Failed ranges: 0
Pending ranges: 0
Highest attempted: 18125000
Total rows: 125,000,000
transactions:
Completed ranges: 1248
Processing ranges: 0
Failed ranges: 1
Pending ranges: 1
Highest attempted: 18125000
Total rows: 45,230,123
=== MAINTENANCE NEEDED ===
transactions: 2 ranges need attention
Failed range: blocks 18124000-18125000 (1000 blocks)
Pending range: blocks 18120000-18121000 (1000 blocks)
💡 Run 'make maintain' to process these ranges
During operations, the system provides clear progress updates:
Historical Progress: 45/116 ranges (38.8%) | ✓ 43 | ✗ 2 | Rate: 12.5 ranges/min | ETA: 5.7 min
Maintain Progress: Fixed 12/15 ranges | ✓ 10 | ✗ 2 | Processed: 80%
- Start with Defaults: The defaults are optimized for reliability
- Scale Workers for Historical: Use 4-16 workers for large historical jobs
- Keep Batches Small: 100-block batches prevent memory issues
- Respect RPC Limits: Conservative defaults prevent rate limiting
# Production settings
WORKERS=1 # Single worker for stability
BATCH_SIZE=100 # Small batches for low latency
REQUESTS_PER_SECOND=20 # Conservative rate
CONFIRMATION_BLOCKS=12 # Standard reorg protection
POLL_INTERVAL=10 # Regular polling
# Fast historical (good RPC)
WORKERS=16 # High parallelism
BATCH_SIZE=500 # Larger batches for efficiency
REQUESTS_PER_SECOND=50 # Higher rate
# Conservative historical (rate-limited RPC)
WORKERS=4 # Moderate parallelism
BATCH_SIZE=100 # Reliable batch size
REQUESTS_PER_SECOND=20 # Respect limits
# Default settings are usually optimal
WORKERS=4 # Parallel issue fixing
BATCH_SIZE=100 # Reliable processing
Cause: Trying to process other datasets before blocks are indexed
Solution:
# Process blocks first
make historical START_BLOCK=X END_BLOCK=Y MODE=custom DATASETS=blocks
# Then process other datasets
make maintain
Cause: Worker crashes, network issues, or processing failures
Solution:
# Use maintain operation to retry these ranges
make maintain
Solutions:
# Increase workers and batch size
make historical START_BLOCK=X END_BLOCK=Y WORKERS=8 BATCH_SIZE=200
# For rate-limited RPCs
make historical START_BLOCK=X END_BLOCK=Y WORKERS=2 REQUESTS_PER_SECOND=10
Solution: Reduce requests per second and workers
REQUESTS_PER_SECOND=10 WORKERS=2 make historical
Solution: Reduce batch size and workers
BATCH_SIZE=50 WORKERS=2 make historical
- Check logs:
make logs
- Check status:
make status
- Process failed/pending:
make maintain
- If issues persist: Reduce batch size/workers and retry
# Setup
make build
make run-migrations # Creates all tables including withdrawals
# Historical sync
make historical START_BLOCK=18000000 END_BLOCK=latest WORKERS=8
# Switch to continuous
make continuous
# Setup
make build
make run-migrations # Adds withdrawals table if not exists
# Process any failed/pending ranges
make maintain
# Verify everything is working
make status
# Start continuous indexing
make continuous
# Target specific historical period (includes withdrawals automatically)
make historical START_BLOCK=12000000 END_BLOCK=13000000 MODE=full
# Check for any incomplete ranges
make maintain
# Analyze specific events
make historical START_BLOCK=12500000 END_BLOCK=12600000 \
MODE=custom DATASETS=blocks,transactions,logs
# Regular health check and fix failed/pending ranges
make maintain
# After system issues - this processes incomplete work
make maintain
# Verify all issues resolved
make status
# Check what's wrong
make status
# Fix incomplete ranges in specific area
make maintain START_BLOCK=1000000 END_BLOCK=2000000
# Test with small range (includes withdrawals processing)
make test-range START_BLOCK=18000000
# Test different modes
make historical START_BLOCK=18000000 END_BLOCK=18001000 MODE=minimal
make historical START_BLOCK=18000000 END_BLOCK=18001000 MODE=full
# Test maintenance
make maintain START_BLOCK=18000000 END_BLOCK=18001000
# Benchmark different settings
time make historical START_BLOCK=18000000 END_BLOCK=18010000 WORKERS=4
time make historical START_BLOCK=18010000 END_BLOCK=18020000 WORKERS=8
- Install Dependencies
pip install -r requirements.txt
- Set Environment Variables
export ETH_RPC_URL=your_rpc_url
export CLICKHOUSE_HOST=localhost
export CLICKHOUSE_PASSWORD=password
- Run Locally
python -m src
# Test with small range (automatically processes withdrawals)
make test-range START_BLOCK=18000000
# Test maintenance
make maintain START_BLOCK=18000000 END_BLOCK=18001000
# Validate results
make status
The following features are mentioned in the codebase but not yet implemented:
- Detect missing ranges between completed blocks
- Smart gap identification beyond simple state table scanning
- Cross-table validation to find data inconsistencies
- Detect and fix invalid '1970-01-01' timestamps
- Join with blocks table to correct timestamp issues
- Batch timestamp correction operations
- Deep data validation across all tables
- Completeness checks beyond state table
- Data consistency verification
- Rebuild indexing_state from existing data tables
- Recover from corrupted state tracking
- Automatic state healing
This project is licensed under the MIT License