Update tpch, clickbench, sort_tpch to mark failed queries #16182

ding-young · 2025-05-24T16:23:50Z

Which issue does this PR close?

Closes Update tpch, clickbench, sort_tpch to mark failed queries #16160 .

Rationale for this change

To track whether certain queries fail on limited memory, we need to continue executing benchmark queries even when previous one fails. Also, it would be better to mark & visualize the failure with compare.py.

What changes are included in this PR?

tpch, clickbench, sort_tpch executes next query when current query fails on any iteration. (It will not execute the next iteration of failed query)
In terminal, Error message is printed as [], and it will print the ids of failed queries.
New field success added to output json, updated compare.py to compare the elapsed time of successful results, not failed ones.

Are these changes tested?

Since it's benchmark suite, I tested it manually.

// Sample output in terminal
Query 18 failed: Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
  HashJoinInput[1]#1239(can spill: false) consumed 9.9 MB,
  HashJoinInput[3]#1246(can spill: false) consumed 9.7 MB,
  HashJoinInput[2]#1251(can spill: false) consumed 9.6 MB,
  HashJoinInput[0]#1235(can spill: false) consumed 9.2 MB,
  HashJoinInput[0]#1266(can spill: false) consumed 2.6 MB.
Error: Failed to allocate additional 665.3 KB for HashJoinInput[0] with 9.2 MB already allocated for this reservation - 278.5 KB remain available for the total pool
Query 19 iteration 0 took 584.6 ms and returned 1 rows
Query 19 iteration 1 took 546.7 ms and returned 1 rows
Query 19 avg time: 565.66 ms
Query 20 iteration 0 took 587.3 ms and returned 186 rows
Query 20 iteration 1 took 551.7 ms and returned 186 rows
Query 20 avg time: 569.48 ms
Query 21 iteration 0 took 784.4 ms and returned 411 rows
Query 21 iteration 1 took 773.0 ms and returned 411 rows
Query 21 avg time: 778.71 ms
Query 22 iteration 0 took 295.4 ms and returned 7 rows
Query 22 iteration 1 took 299.7 ms and returned 7 rows
Query 22 avg time: 297.55 ms
Failed Queries: 4, 7, 9, 10, 13, 16, 18

// compare.py 
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  results ┃  results ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 409.33ms │ 433.54ms │  1.06x slower │
│ QQuery 2     │ 343.53ms │ 323.37ms │ +1.06x faster │
│ QQuery 3     │ 499.94ms │ 491.93ms │     no change │
│ QQuery 4     │     FAIL │ 350.74ms │  incomparable │
│ QQuery 5     │ 529.90ms │ 542.79ms │     no change │
│ QQuery 6     │ 242.10ms │ 253.38ms │     no change │
│ QQuery 7     │     FAIL │ 528.23ms │  incomparable │
│ QQuery 8     │ 634.55ms │ 638.16ms │     no change │
│ QQuery 9     │     FAIL │ 689.83ms │  incomparable │
│ QQuery 10    │     FAIL │ 601.86ms │  incomparable │
│ QQuery 11    │ 205.05ms │ 205.76ms │     no change │
│ QQuery 12    │ 421.38ms │ 418.29ms │     no change │
│ QQuery 13    │     FAIL │ 418.92ms │  incomparable │
│ QQuery 14    │ 346.10ms │ 345.50ms │     no change │
│ QQuery 15    │ 452.58ms │ 448.31ms │     no change │
│ QQuery 16    │     FAIL │ 256.87ms │  incomparable │
│ QQuery 17    │ 762.77ms │ 627.49ms │ +1.22x faster │
│ QQuery 18    │     FAIL │ 830.38ms │  incomparable │
│ QQuery 19    │ 546.70ms │ 534.35ms │     no change │
│ QQuery 20    │ 551.65ms │ 465.91ms │ +1.18x faster │
│ QQuery 21    │ 772.99ms │ 789.56ms │     no change │
│ QQuery 22    │ 295.41ms │ 301.68ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (results)   │ 7013.98ms │
│ Total Time (results)   │ 6820.03ms │
│ Average Time (results) │  467.60ms │
│ Average Time (results) │  454.67ms │
│ Queries Faster         │         3 │
│ Queries Slower         │         1 │
│ Queries with No Change │        11 │
│ Queries with Failure   │         7 │
└────────────────────────┴───────────┘

Are there any user-facing changes?

Copilot

Pull Request Overview

The PR updates several benchmark modules to mark failed queries and continue executing subsequent queries when a failure occurs. Key changes include:

Adding a new "success" field to track query failure status.
Updating benchmark execution in tpch, sort_tpch, imdb, and clickbench to mark failures instead of aborting.
Modifying compare.py to exclude failed queries from average time calculations and reporting the list of failed queries.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
benchmarks/src/util/run.rs	Added "success" field and mark_failed method to track query results.
benchmarks/src/util/mod.rs	Re-exported QueryResult to support new benchmark result structure.
benchmarks/src/tpch/run.rs	Updated query execution loop to handle failures and record failed query IDs.
benchmarks/src/sort_tpch.rs	Adjusted benchmark loop to mark failed queries and print failure details.
benchmarks/src/imdb/run.rs	Applied similar failure handling as in tpch and sort_tpch modules.
benchmarks/src/clickbench.rs	Refactored benchmark_query logic to include failure handling and iterations abstraction.
benchmarks/compare.py	Updated average time calculations and display to exclude failed queries.

benchmarks/compare.py

2010YOUY01

Thank you, this is quite helpful for us to work towards robust memory-limited execution. We need this feature since now many queries are failing under memory limit, if they can stably finish in the future we can consider making it a CI test.

I tested tpch/clickbench/sort_tpch, and first two can run as expected, sort_tpch benchmark seems need a fix in the runner 🤔

Benchmark commands:

cargo run --profile release-nonlto --bin tpch -- benchmark datafusion --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 --prefer_hash_join true --format parquet -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/tpch_sf1.json --memory-limit 1G

cargo run --profile release-nonlto --bin dfbench -- clickbench --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/hits.parquet --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/clickbench/queries.sql -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/clickbench_1.json --memory-limit 1G

 cargo run --profile release-nonlto --bin dfbench -- sort-tpch --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/sort_tpch.json --memory-limit 2G

sort_tpch's error message

ongting@Mac ~/C/d/benchmarks (pr-16182 *)> cargo run --profile release-nonlto --bin dfbench -- sort-tpch --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/sort_tpch.json --memory-limit 2G
    Finished `release-nonlto` profile [optimized] target(s) in 0.12s
     Running `/Users/yongting/Code/datafusion/target/release-nonlto/dfbench sort-tpch --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/sort_tpch.json --memory-limit 2G`
Q1 iteration 0 took 1645.0 ms and returned 59986052 rows
Q1 iteration 1 took 1674.9 ms and returned 59986052 rows
Q1 iteration 2 took 1635.0 ms and returned 59986052 rows
Q1 iteration 3 took 1672.7 ms and returned 59986052 rows
Q1 iteration 4 took 1604.2 ms and returned 59986052 rows
Q1 avg time: 1646.32 ms
Q2 iteration 0 took 1282.9 ms and returned 59986052 rows
Q2 iteration 1 took 1286.9 ms and returned 59986052 rows
Q2 iteration 2 took 1313.8 ms and returned 59986052 rows
Q2 iteration 3 took 1300.1 ms and returned 59986052 rows
Q2 iteration 4 took 1290.9 ms and returned 59986052 rows
Q2 avg time: 1294.90 ms
Q3 iteration 0 took 6326.5 ms and returned 59986052 rows
Q3 iteration 1 took 6289.8 ms and returned 59986052 rows
Q3 iteration 2 took 6223.8 ms and returned 59986052 rows
Q3 iteration 3 took 6329.6 ms and returned 59986052 rows
Q3 iteration 4 took 6333.3 ms and returned 59986052 rows
Q3 avg time: 6300.60 ms
Q4 iteration 0 took 1574.2 ms and returned 59986052 rows
Q4 iteration 1 took 1578.4 ms and returned 59986052 rows
Q4 iteration 2 took 1586.4 ms and returned 59986052 rows
Q4 iteration 3 took 1608.0 ms and returned 59986052 rows
Q4 iteration 4 took 1578.5 ms and returned 59986052 rows
Q4 avg time: 1585.13 ms

thread 'main' panicked at /Users/yongting/Code/datafusion/benchmarks/src/sort_tpch.rs:312:32:
called `Result::unwrap()` on an `Err` value: ResourcesExhausted("Additional allocation failed with top memory consumers (across reservations) as:\n  ExternalSorterMerge[2]#586(can spill: false) consumed 182.5 MB,\n  ExternalSorterMerge[10]#602(can spill: false) consumed 179.3 MB,\n  ExternalSorterMerge[5]#592(can spill: false) consumed 176.1 MB,\n  ExternalSorterMerge[4]#590(can spill: false) consumed 175.3 MB,\n  ExternalSorterMerge[0]#582(can spill: false) consumed 173.0 MB.\nError: Failed to allocate additional 248.1 KB for ExternalSorterMerge[8] with 0.0 B already allocated for this reservation - 37.8 KB remain available for the total pool")

Other than that I left some minor suggestions, looking forward to your feedback!

benchmarks/src/clickbench.rs

benchmarks/src/util/run.rs

ding-young · 2025-05-31T08:30:41Z

benchmarks/src/sort_tpch.rs

@@ -294,7 +297,7 @@ impl RunOpt {

        let mut stream = execute_stream(physical_plan.clone(), state.task_ctx())?;
        while let Some(batch) = stream.next().await {
-            row_count += batch.unwrap().num_rows();
+            row_count += batch?.num_rows();


I replaced unwrap() with ? so that sort_tpch does not terminate early. Thanks for catching that !

ding-young · 2025-05-31T08:34:40Z

Thank you for review. I’ve applied your suggestions, and I’ll run it a few more times and check the output display once again before requesting a final review. If there are any other benchmarks I didn’t touch that would benefit from similar changes, please let me know.

ding-young force-pushed the memory-limit-bench branch 2 times, most recently from 2a9d7a8 to e484c52 Compare May 29, 2025 14:39

ding-young marked this pull request as ready for review May 29, 2025 14:51

2010YOUY01 requested a review from Copilot May 31, 2025 04:49

Copilot AI reviewed May 31, 2025

View reviewed changes

benchmarks/compare.py Outdated Show resolved Hide resolved

benchmarks/compare.py Outdated Show resolved Hide resolved

2010YOUY01 reviewed May 31, 2025

View reviewed changes

benchmarks/src/clickbench.rs Outdated Show resolved Hide resolved

benchmarks/src/util/run.rs Show resolved Hide resolved

ding-young added 3 commits May 31, 2025 08:28

Move struct QueryResult to util/run.rs

cc6258a

Modify benches to continue query execution even on failure

806fe2c

Mark benchmark query success on output json

3d5ab62

ding-young force-pushed the memory-limit-bench branch from e484c52 to 3d5ab62 Compare May 31, 2025 08:28

ding-young commented May 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update tpch, clickbench, sort_tpch to mark failed queries #16182

Update tpch, clickbench, sort_tpch to mark failed queries #16182

ding-young commented May 24, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ding-young May 31, 2025

Uh oh!

ding-young commented May 31, 2025

Uh oh!

Uh oh!

Update tpch, clickbench, sort_tpch to mark failed queries #16182

Are you sure you want to change the base?

Update tpch, clickbench, sort_tpch to mark failed queries #16182

Conversation

ding-young commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ding-young May 31, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young commented May 31, 2025

Uh oh!

Uh oh!

ding-young commented May 24, 2025 •

edited

Loading

2010YOUY01 left a comment •

edited

Loading