-
Notifications
You must be signed in to change notification settings - Fork 103
[Sorta-breaking] Update core to fix low wft slots/pollers issue #877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unclear why this |
# Start the worker with CPU count + 11 task slots | ||
max_concurrent_workflow_tasks=cpu_count + 11, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit I don't actually understand how starting cpu_count + 5
workflows somehow eats up 100% of cpu_count + 10
task slots.
However, it does, and before my change the pollers would (erroneously) make sure permits were attempted to be acquired evenly between sticky/nonsticky. Now that that's fixed, 100% of the permits can be eaten up by the deadlockers, meaning the non-deadlocking workflow never gets a chance to start.
This is, IMO, seemingly an natural consequence of deadlocking every single task slot. From what I understand, we all accepted that we could never free slots of deadlocked workflows until they actually became released.
So, seemingly this test change is acceptable, but please LMK if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit I don't actually understand how starting cpu_count + 5 workflows somehow eats up 100% of cpu_count + 10 task slots.
I think this is worth understanding. I would assume "max slots used" = "max concurrent tasks" + "max pollers", is that not correct here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that is it. I forgot the default number of pollers here is 5.
poll workflow task requests we will perform at a time on this | ||
worker's task queue. | ||
poll workflow task requests we will perform at a time on this worker's task queue. | ||
Must be set to at least two if ``max_cached_workflows`` is nonzero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it is not? Is there some validation error coming from Core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 If there is an actual exception thrown, that works for me. I assume it's a clear exception? I do think we should make sure to call this out in the release notes as technically a breaking change to those that have (improperly) set this to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be quite clear
# Start the worker with CPU count + 11 task slots | ||
max_concurrent_workflow_tasks=cpu_count + 11, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit I don't actually understand how starting cpu_count + 5 workflows somehow eats up 100% of cpu_count + 10 task slots.
I think this is worth understanding. I would assume "max slots used" = "max concurrent tasks" + "max pollers", is that not correct here?
Updated core to obtain the fix for low (<2) WFT slot/poller counts causing workers to potentially get stuck.
This is a sorta-breaking change in that it makes a previously invalid configuration into a hard error at worker startup time.