Skip to content

retry fetching block data with decreasing batch size #204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

iuwqyir
Copy link
Collaborator

@iuwqyir iuwqyir commented May 19, 2025

TL;DR

Implemented a retry mechanism for failed block fetches with automatic chunk splitting and added result sorting.

What changed?

  • Added a new processChunkWithRetry function that:
    • Processes blocks in chunks and detects failed requests
    • Splits failed blocks into smaller chunks and retries them recursively
    • Continues splitting until successful or reaching single block chunks
    • Sends successful results to the results channel as they're processed
  • Modified the Run method to use the new retry mechanism
  • Added sorting of results by block number before returning
  • Removed the conditional sleep and made it consistent after each chunk processing

How to test?

  1. Run the application with a mix of valid and invalid block numbers
  2. Verify that all valid blocks are fetched successfully, even when part of a batch with invalid blocks
  3. Confirm that the results are properly sorted by block number
  4. Check logs for "Splitting failed blocks" messages to verify the retry mechanism is working

Why make this change?

This change improves the reliability of block fetching by automatically retrying failed requests with smaller batch sizes. The previous implementation would fail an entire batch if any block in it failed, leading to missing data. The new approach maximizes data retrieval by isolating problematic blocks and ensuring the results are returned in a consistent order regardless of how they were processed.

Summary by CodeRabbit

  • New Features
    • Improved reliability of block fetching with automatic retries for failed attempts.
  • Enhancements
    • Results are now delivered more promptly for successfully fetched blocks, even if others fail.
    • Improved sorting of fetched blocks for more consistent output.

Copy link
Collaborator Author

iuwqyir commented May 19, 2025

@iuwqyir iuwqyir requested a review from catalyst17 May 19, 2025 14:15
@iuwqyir iuwqyir marked this pull request as ready for review May 19, 2025 14:15
Copy link
Contributor

@catalyst17 catalyst17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one!

@iuwqyir iuwqyir force-pushed the 05-19-retry_fetching_block_data_with_decreasing_batch_size branch from 1ac61b8 to 4f2ca9a Compare May 20, 2025 09:08
@iuwqyir iuwqyir force-pushed the 05-19-change_sql_script_position_so_that_it_is_run_first branch 2 times, most recently from e226f96 to 3d6fb55 Compare May 21, 2025 06:54
@iuwqyir iuwqyir force-pushed the 05-19-retry_fetching_block_data_with_decreasing_batch_size branch from 4f2ca9a to c692c5e Compare May 21, 2025 06:54
Copy link

coderabbitai bot commented May 21, 2025

Walkthrough

A new method, processChunkWithRetry, was introduced to the Worker struct to implement a retry mechanism for block fetching. The Run method now uses this new method for chunk processing, and the results channel buffer size was adjusted. Results are sorted by block number before returning, with metrics updates unchanged.

Changes

File(s) Change Summary
internal/worker/worker.go Added processChunkWithRetry method for block fetch retries; updated Run to use it, changed results channel buffering, and sorted results by block number.

Sequence Diagram(s)

sequenceDiagram
    participant Worker
    participant RPC
    participant ResultsChannel

    Worker->>Worker: processChunkWithRetry(chunk, resultsCh)
    alt chunk size == 1
        Worker->>RPC: Fetch block via RPC
        RPC-->>Worker: Block result
        Worker->>ResultsChannel: Send result
    else chunk size > 1
        Worker->>RPC: Fetch all blocks in chunk
        RPC-->>Worker: Block results (some may fail)
        alt All succeeded
            Worker->>ResultsChannel: Send all results
        else Some failed
            Worker->>Worker: Retry failed blocks (possibly split & concurrent)
            Worker->>ResultsChannel: Send successful results as available
        end
    end
Loading
sequenceDiagram
    participant Worker
    participant ResultsChannel

    Worker->>Worker: Run()
    loop For each chunk
        Worker->>Worker: processChunkWithRetry(chunk, resultsCh)
    end
    Worker->>Worker: Collect all results from resultsCh
    Worker->>Worker: Sort results by block number
    Worker-->>Caller: Return sorted results
Loading

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@iuwqyir iuwqyir changed the base branch from 05-19-change_sql_script_position_so_that_it_is_run_first to graphite-base/204 May 21, 2025 06:55
@iuwqyir iuwqyir force-pushed the 05-19-retry_fetching_block_data_with_decreasing_batch_size branch from c692c5e to afbb7e0 Compare May 21, 2025 06:56
@graphite-app graphite-app bot changed the base branch from graphite-base/204 to main May 21, 2025 06:56
@iuwqyir iuwqyir force-pushed the 05-19-retry_fetching_block_data_with_decreasing_batch_size branch from afbb7e0 to 78e2b76 Compare May 21, 2025 06:56
@iuwqyir iuwqyir merged commit c8cf165 into main May 21, 2025
5 of 6 checks passed
@iuwqyir iuwqyir deleted the 05-19-retry_fetching_block_data_with_decreasing_batch_size branch May 21, 2025 06:58
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/worker/worker.go (2)

26-33: Consider moving the delay outside the recursive call stack

Placing the time.Sleep in a defer means every recursive invocation – including those created solely for retries on a single block – will incur the full BatchDelay.
With deep-splitting this can quickly multiply the total wall-clock time by O(log n) (or even O(n) in the worst case of all failures), although the actual RPC fan-out already throttles the provider.

If the intent is simply to rate-limit requests to the provider, consider moving the delay:

  1. Just before the top-level GetFullBlocks call, instead of in defer, so it executes once per network round-trip.
  2. Or guard it with if len(chunk) == originalChunkSize { … } to avoid stacking delays on leaf retries.

This keeps the throttle while preventing unnecessary cumulative sleep on recursive splits.


118-121: Sorting logic looks correct, but big.Intfloat64 lossiness remains downstream

The added sort by BlockNumber is spot-on.
Note that a few lines below (unchanged) the largest block is converted to float64, which silently loses precision after ~2⁵³. Consider using an integer gauge or exposing the value as a label/string instead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee125fc and 78e2b76.

📒 Files selected for processing (1)
  • internal/worker/worker.go (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
internal/worker/worker.go (3)
internal/rpc/rpc.go (1)
  • GetFullBlockResult (18-22)
configs/config.go (1)
  • Cfg (170-170)
internal/common/utils.go (1)
  • SliceToChunks (10-23)
🔇 Additional comments (1)
internal/worker/worker.go (1)

96-97: Good catch on channel buffering

Switching the buffer from chunk count to the total block count prevents deadlocks when retries emit more result batches than the original chunking.

Comment on lines +68 to +89
// Split failed blocks in half and retry
mid := len(failedBlocks) / 2
leftChunk := failedBlocks[:mid]
rightChunk := failedBlocks[mid:]

log.Debug().Msgf("Splitting %d failed blocks into chunks of %d and %d", len(failedBlocks), len(leftChunk), len(rightChunk))

var wg sync.WaitGroup
wg.Add(2)

go func() {
defer wg.Done()
w.processChunkWithRetry(leftChunk, resultsCh)
}()

go func() {
defer wg.Done()
w.processChunkWithRetry(rightChunk, resultsCh)
}()

wg.Wait()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential goroutine explosion when many blocks fail

Every failure split spawns two new goroutines and waits (wg.Wait()).
For k consecutively failing blocks this creates 2k-1 goroutines (binary tree), which can reach thousands under a bad RPC outage and may:

  • Exceed the Go scheduler’s default thread limits.
  • Create unnecessary network pressure once the RPC endpoint comes back.

Recommend introducing a simple concurrency limiter (token/semaphore) or using a worker-pool instead of spawning unbounded goroutines, e.g.:

 type Worker struct {
     rpc rpc.IRPCClient
+    sem  chan struct{} // capacity == max concurrent RPCs
 }
 
 func (w *Worker) acquire() { w.sem <- struct{}{} }
 func (w *Worker) release() { <-w.sem }

 // before launching a goroutine
-wg.Add(2)
-go func() { … }()
+wg.Add(2)
+go func() {
+    w.acquire()
+    defer w.release()
+
+}()

This keeps memory & CPU predictable even under sustained failure scenarios.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In internal/worker/worker.go around lines 68 to 89, the current approach spawns
two new goroutines for each failure split without limit, causing exponential
growth in goroutines under repeated failures. To fix this, implement a
concurrency limiter such as a semaphore or token bucket to restrict the number
of concurrent goroutines processing chunks. Alternatively, refactor to use a
fixed-size worker pool that processes failed blocks from a queue, ensuring
controlled concurrency and preventing resource exhaustion during sustained
failure scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants