-
Notifications
You must be signed in to change notification settings - Fork 50
Scalable async RPC #8048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalable async RPC #8048
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## dev #8048 +/- ##
============================================
- Coverage 38.42% 37.95% -0.47%
Complexity 2163 2163
============================================
Files 1242 1243 +1
Lines 113599 115008 +1409
Branches 3131 3145 +14
============================================
+ Hits 43648 43652 +4
- Misses 68107 69512 +1405
Partials 1844 1844
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
This PR is still in draft because we have some rough edges and various fixes. I've opened the new early in order to facilitate early reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished my first review. But here are a few comments.
We don't have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, great work. Here is a first iteration of general remarks.
It seems that even if we have an abstract worker driver API, each of the implementations provide a different set of features (I can only have on core worker per infra for Docker, but autoscaling allow for more than one in k8s, I can setup the machine capacity for k8s, but not setup any limit for Docker). It's like implementation details leaks out of this API. Maybe number of workers per infra and capacities of these nodes should be part of the API?
I disagree on this : we've considered integrating the scaling of cores in osrdyne but rejected it. It was clearly not easy to determine when go up or down, and it requires taking account metrics that are not easy to handle properly. From my perspective, it makes way more sense that for a Kubernetes deployment the scaling happens at the orchestrator level and not the applicative level. We will probably, in the near future, add an option to leverage Keda instead of the HPA. The Kubernetes Driver exposes features of the orchestrator, and for a Docker deployment we can only work with a single host so horizontally scaling the core makes little sense. We could add a driver for docker swarm clusters but they are marginal (and not what we use). To finish, I'll say that I do not feel like the API is leaking, it just allows to pass options to the driver. |
This sentence did make me understand your point of view. And indeed, I think I do agree. It also made me realize that |
To be clear, we should use worker_group, not worker_pool (https://osrd.fr/en/docs/reference/design-docs/scalable-async-rpc/#conceps) Currently osrdyne manages one worker pool, (yes, fields referring to core are inherited from a previous design - core_controller - and should disappear) |
Yes, that's a really good point that @flomonster also raised during its review so I think we're going to modify this. I'll commit something about this later today. |
8f9e29b
to
17ddec8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the most part. Haven't tested yet (might be useful to check the dev setup on macOS where no host networking is available).
Thanks for guiding me through the review and answering my questions :)
b6b8366
to
1487784
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, not tested ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm (not tested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. ApiServerCommand to be deleted at a later stage (+ hopefully at some point we'll drop the v1 endpoints).
3edc6ba
to
65a12a6
Compare
de64f2f
to
63e5bfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work !! Tested on the dev environment it works 🎉 !!!
Co-authored-by: ElysaSrc <[email protected]> Co-authored-by: Younes Khoudli <[email protected]>
Co-authored-by: ElysaSrc <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #6117.
(This PR replaces the closed #7103 PR. Check the log of the previous PR for the reason).
This PR implements the Scalable Async RPC design.
Changes
Tests