Skip to content

[Sorta-breaking] Update core to fix low wft slots/pollers issue #877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 28, 2025

Conversation

Sushisource
Copy link
Member

Updated core to obtain the fix for low (<2) WFT slot/poller counts causing workers to potentially get stuck.

This is a sorta-breaking change in that it makes a previously invalid configuration into a hard error at worker startup time.

@Sushisource Sushisource requested a review from a team as a code owner May 28, 2025 01:19
@Sushisource
Copy link
Member Author

Sushisource commented May 28, 2025

Unclear why this test_workflow_deadlock_fill_up_slots is failing now, but it seems related given it failed everywhere. Will need to look into it.

Comment on lines +7214 to +7215
# Start the worker with CPU count + 11 task slots
max_concurrent_workflow_tasks=cpu_count + 11,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit I don't actually understand how starting cpu_count + 5 workflows somehow eats up 100% of cpu_count + 10 task slots.

However, it does, and before my change the pollers would (erroneously) make sure permits were attempted to be acquired evenly between sticky/nonsticky. Now that that's fixed, 100% of the permits can be eaten up by the deadlockers, meaning the non-deadlocking workflow never gets a chance to start.

This is, IMO, seemingly an natural consequence of deadlocking every single task slot. From what I understand, we all accepted that we could never free slots of deadlocked workflows until they actually became released.

So, seemingly this test change is acceptable, but please LMK if I'm wrong.

Copy link
Member

@cretz cretz May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit I don't actually understand how starting cpu_count + 5 workflows somehow eats up 100% of cpu_count + 10 task slots.

I think this is worth understanding. I would assume "max slots used" = "max concurrent tasks" + "max pollers", is that not correct here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that is it. I forgot the default number of pollers here is 5.

poll workflow task requests we will perform at a time on this
worker's task queue.
poll workflow task requests we will perform at a time on this worker's task queue.
Must be set to at least two if ``max_cached_workflows`` is nonzero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it is not? Is there some validation error coming from Core?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 If there is an actual exception thrown, that works for me. I assume it's a clear exception? I do think we should make sure to call this out in the release notes as technically a breaking change to those that have (improperly) set this to 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be quite clear

Comment on lines +7214 to +7215
# Start the worker with CPU count + 11 task slots
max_concurrent_workflow_tasks=cpu_count + 11,
Copy link
Member

@cretz cretz May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit I don't actually understand how starting cpu_count + 5 workflows somehow eats up 100% of cpu_count + 10 task slots.

I think this is worth understanding. I would assume "max slots used" = "max concurrent tasks" + "max pollers", is that not correct here?

@Sushisource Sushisource merged commit b24326c into main May 28, 2025
18 checks passed
@Sushisource Sushisource deleted the update-core branch May 28, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants