A modular toolkit for ClickHouse data ingestion from various sources, particularly useful for running ETL jobs via Docker.
click-runner
is designed for flexible query execution and data ingestion with support for different data formats and sources:
- CSV Ingestion: Load data from CSV files using ClickHouse's URL engine
- Parquet Ingestion: Load data from Parquet files in S3 buckets
- SQL Execution: Run arbitrary SQL files against a ClickHouse database
- Docker
- Docker Compose (for using the provided docker-compose.yml)
- ClickHouse server
.
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── run_queries.py # Main CLI entry point
├── ingestors/ # Ingestor modules
│ ├── __init__.py
│ ├── base.py # Abstract base ingestor
│ ├── csv_ingestor.py # CSV ingestion (e.g., Ember data)
│ └── parquet_ingestor.py # Parquet ingestion (e.g., ProbeLab data)
├── utils/ # Utility modules
│ ├── __init__.py
│ ├── s3.py # S3 utilities
│ ├── db.py # Database utilities
│ └── date.py # Date utilities
└── queries/ # SQL query files
├── ember/ # Ember electricity data
│ ├── create_ember_table.sql
│ ├── insert_ember_data.sql
│ └── optimize_ember_data.sql
└── probelab/ # ProbeLab data
├── probelab_agent_semvers_avg_1d.up.sql
├── probelab_agent_types_avg_1d.up.sql
└── ... (other create table queries)
Set the following environment variables to configure the ClickHouse connection:
CH_HOST
: ClickHouse hostCH_PORT
: Native ClickHouse port (default: 9000)CH_USER
: Username for authenticationCH_PASSWORD
: Password for authenticationCH_DB
: Database to useCH_SECURE
: Use TLS connection (True
orFalse
)CH_VERIFY
: Verify TLS certificate (True
orFalse
)
For S3 integration:
S3_ACCESS_KEY
: AWS access key IDS3_SECRET_KEY
: AWS secret access keyS3_BUCKET
: S3 bucket name (default: prod-use1-gnosis)S3_REGION
: AWS region (default: us-east-1)
For Ember data:
EMBER_DATA_URL
: URL to the Ember CSV data
The system supports three primary running modes, controlled by the --ingestor
parameter:
Execute arbitrary SQL queries directly against ClickHouse.
Usage:
# CLI
python run_queries.py --ingestor=query --queries=queries/file1.sql,queries/file2.sql
# Docker
docker-compose run click-runner --ingestor=query --queries=queries/file1.sql,queries/file2.sql
Environment variable alternative:
CH_QUERIES=queries/file1.sql,queries/file2.sql
Use cases:
- Running administrative queries
- Database maintenance
- Schema updates
- Custom data transformations
Import data from CSV files using ClickHouse's URL engine. Typically used for Ember electricity data.
Usage:
# CLI
python run_queries.py --ingestor=csv \
--create-table-sql=queries/ember/create_ember_table.sql \
--insert-sql=queries/ember/insert_ember_data.sql \
--optimize-sql=queries/ember/optimize_ember_data.sql
# Docker
docker-compose run ember-ingestor
Use cases:
- Importing public datasets available as CSV files
- Scheduled updates from static CSV URLs
- When data source is a REST API that returns CSV
Import data from Parquet files in S3 buckets, with three ingestion strategies:
- Latest (
--mode=latest
): Import only the most recent file - Date (
--mode=date
): Import a file for a specific date - All (
--mode=all
): Import all available files
Usage:
Latest File:
python run_queries.py --ingestor=parquet \
--create-table-sql=queries/probelab/probelab_agent_semvers_avg_1d.up.sql \
--s3-path=assets/agent_semvers_avg_1d_data/{{DATE}}.parquet \
--table-name=crawlers_data.probelab_agent_semvers_avg_1d \
--mode=latest
Specific Date:
python run_queries.py --ingestor=parquet \
--create-table-sql=queries/probelab/probelab_agent_semvers_avg_1d.up.sql \
--s3-path=assets/agent_semvers_avg_1d_data/{{DATE}}.parquet \
--table-name=crawlers_data.probelab_agent_semvers_avg_1d \
--mode=date \
--date=2025-04-13
All Files:
python run_queries.py --ingestor=parquet \
--create-table-sql=queries/probelab/probelab_agent_semvers_avg_1d.up.sql \
--s3-path=assets/agent_semvers_avg_1d_data/{{DATE}}.parquet \
--table-name=crawlers_data.probelab_agent_semvers_avg_1d \
--mode=all
Using Docker:
docker-compose run probelab-agent-semvers-ingestor
Use cases:
- Daily ingestion of time-series data stored as Parquet
- Backfilling historical Parquet data
- Importing structured data from data lakes
All modes share these common parameters:
--host
: ClickHouse host (default: fromCH_HOST
env var)--port
: ClickHouse port (default: fromCH_PORT
env var)--user
: ClickHouse user (default: fromCH_USER
env var)--password
: ClickHouse password (default: fromCH_PASSWORD
env var)--db
: ClickHouse database (default: fromCH_DB
env var)--secure
: Use TLS connection (default: fromCH_SECURE
env var)--verify
: Verify TLS certificate (default: fromCH_VERIFY
env var)--skip-table-creation
: Skip table creation steps (optional flag)
The docker-compose.yml
file includes several predefined services:
- click-runner: Generic service that can run in any mode
- ember-ingestor: Specialized for Ember CSV data
- probelab-agent-semvers-ingestor: Example for one ProbeLab Parquet dataset
To run data ingestion as a daily cron job, you can use the provided Docker containers:
# Example crontab entry for daily Ember data update at 2 AM
0 2 * * * cd /path/to/click-runner && docker-compose run --rm ember-ingestor
# Example crontab entry for daily ProbeLab data update at 3 AM
0 3 * * * cd /path/to/click-runner && docker-compose run --rm probelab-agent-semvers-ingestor
For convenience, you can use the included cron_setup.sh
script to automatically create these cron jobs:
chmod +x cron_setup.sh
sudo ./cron_setup.sh
- Create table definition SQL file:
queries/new_source/create_table.sql
- Create insert SQL file:
queries/new_source/insert_data.sql
- (Optional) Create optimization SQL file:
queries/new_source/optimize.sql
- Add environment variable for data URL:
NEW_SOURCE_URL
- Run:
python run_queries.py --ingestor=csv \ --create-table-sql=queries/new_source/create_table.sql \ --insert-sql=queries/new_source/insert_data.sql
- Create table definition SQL file:
queries/new_source/new_source_table.up.sql
- Run:
python run_queries.py --ingestor=parquet \ --create-table-sql=queries/new_source/new_source_table.up.sql \ --s3-path=assets/new_source_data/{{DATE}}.parquet \ --table-name=database.new_source_table \ --mode=latest
For ease of use, add a new service to docker-compose.yml
:
new-source-ingestor:
build:
context: .
dockerfile: Dockerfile
container_name: new-source-ingestor
volumes:
- ./queries:/app/queries
environment:
CH_HOST: ${CH_DB_HOST}
CH_PORT: ${CH_NATIVE_PORT}
CH_USER: ${CH_USER}
CH_PASSWORD: ${CH_PASSWORD}
CH_DB: ${CH_DB}
CH_SECURE: ${CH_SECURE}
CH_VERIFY: "False"
CH_QUERY_VAR_NEW_SOURCE_URL: ${NEW_SOURCE_URL:-}
CH_QUERY_VAR_S3_ACCESS_KEY: ${S3_ACCESS_KEY:-}
CH_QUERY_VAR_S3_SECRET_KEY: ${S3_SECRET_KEY:-}
CH_QUERY_VAR_S3_BUCKET: ${S3_BUCKET:-}
CH_QUERY_VAR_S3_REGION: ${S3_REGION:-}
command: >
--ingestor=parquet
--create-table-sql=queries/new_source/new_source_table.up.sql
--s3-path=assets/new_source_data/{{DATE}}.parquet
--table-name=database.new_source_table
--mode=latest
If you need to support a new file format beyond CSV and Parquet:
- Create a new ingestor class that extends
BaseIngestor
iningestors/new_format_ingestor.py
- Implement the
ingest()
method and any format-specific methods - Update
run_queries.py
to recognize the new ingestor type
If you wanted to add support for Avro files, you would:
- Update
requirements.txt
to include Avro-related packages - Create
ingestors/avro_ingestor.py
extendingBaseIngestor
- Implement the specialized logic for Avro ingestion
- Update
run_queries.py
to support--ingestor=avro
- Create sample Avro ingestion Docker Compose services
SQL files can use variable placeholders with the {{VARIABLE_NAME}}
syntax. These are replaced with values from environment variables prefixed with CH_QUERY_VAR_
.
For example:
- Environment variable:
CH_QUERY_VAR_EMBER_DATA_URL=https://example.com/data.csv
- In SQL:
FROM url('{{EMBER_DATA_URL}}', 'CSV')
If tables already exist, you can skip the table creation step:
python run_queries.py --ingestor=csv \
--create-table-sql=queries/ember/create_ember_table.sql \
--insert-sql=queries/ember/insert_ember_data.sql \
--skip-table-creation
For complex workflows, you can chain multiple ingestors:
# First run ember ingestor
docker-compose run --rm ember-ingestor
# Then run probelab ingestor
docker-compose run --rm probelab-agent-semvers-ingestor
-
S3 Access Denied:
- Verify S3 credentials
- Check bucket permissions
-
ClickHouse Connection Failure:
- Verify ClickHouse connection details
- Check network connectivity
-
Invalid SQL Syntax:
- Inspect SQL files for errors
- Use ClickHouse client directly to test queries
By default, logs are output to stdout/stderr. Docker Compose captures these logs.
To view logs:
docker-compose logs click-runner
When using cron jobs, logs are saved to the logs/
directory with dated filenames.
This project is licensed under the MIT License.