Scalable async RPC #8048

ElysaSrc · 2024-07-14T20:10:38Z

Fixes #6117.

(This PR replaces the closed #7103 PR. Check the log of the previous PR for the reason).

This PR implements the Scalable Async RPC design.

Changes

Modify local environment to provide RabbitMQ
Implement OSRDyne
- Implement Kubernetes driver
- Implement Docker driver
- Implement RabbitMQ driver
- Implement logic
Modify Core to use RabbitMQ queues
Modify Editoast to use RabbitMQ queues

Tests

Against Docker
Against Kubernetes (we deployed a custom Kubernetes cluster to test it)

codecov-commenter · 2024-07-14T20:12:45Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 5.31915% with 1691 lines in your changes missing coverage. Please review.

Project coverage is 37.95%. Comparing base (684fe3d) to head (58fbea5).
Report is 3 commits behind head on dev.

Files	Patch %	Lines
osrdyne/src/pool.rs	0.00%	404 Missing ⚠️
osrdyne/src/drivers/kubernetes.rs	0.00%	221 Missing ⚠️
editoast/src/core/mq_client.rs	0.00%	152 Missing ⚠️
osrdyne/src/drivers/docker.rs	0.00%	150 Missing ⚠️
...re/src/main/java/fr/sncf/osrd/cli/WorkerCommand.kt	0.00%	145 Missing ⚠️
osrdyne/src/queue_controller.rs	0.00%	129 Missing ⚠️
osrdyne/src/target_tracker.rs	0.00%	125 Missing ⚠️
osrdyne/src/target_tracker/actor.rs	0.00%	74 Missing ⚠️
osrdyne/src/api.rs	0.00%	69 Missing ⚠️
osrdyne/src/management_client.rs	50.51%	48 Missing ⚠️
... and 23 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff              @@
##                dev    #8048      +/-   ##
============================================
- Coverage     38.42%   37.95%   -0.47%     
  Complexity     2163     2163              
============================================
  Files          1242     1243       +1     
  Lines        113599   115008    +1409     
  Branches       3131     3145      +14     
============================================
+ Hits          43648    43652       +4     
- Misses        68107    69512    +1405     
  Partials       1844     1844

Flag	Coverage Δ
core	`74.80% <1.35%> (-0.79%)`	⬇️
editoast	`66.35% <16.54%> (-0.53%)`	⬇️
front	`16.69% <ø> (ø)`
gateway	`?`
osrdyne	`2.71% <3.57%> (?)`
railjson_generator	`87.49% <ø> (ø)`
tests	`72.98% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ElysaSrc · 2024-07-14T20:13:55Z

This PR is still in draft because we have some rough edges and various fixes. I've opened the new early in order to facilitate early reviews.

flomonster

I haven't finished my first review. But here are a few comments.

osrdyne/Dockerfile

osrdyne/src/drivers/docker.rs

flomonster · 2024-07-16T09:06:47Z

We don't have check_osrdyne_tests where we run tests, fmt and clippy.
There are currently several clippy warnings.

woshilapin

In general, great work. Here is a first iteration of general remarks.

It seems that even if we have an abstract worker driver API, each of the implementations provide a different set of features (I can only have on core worker per infra for Docker, but autoscaling allow for more than one in k8s, I can setup the machine capacity for k8s, but not setup any limit for Docker). It's like implementation details leaks out of this API. Maybe number of workers per infra and capacities of these nodes should be part of the API?

osrdyne/Cargo.toml

core/src/main/java/fr/sncf/osrd/cli/WorkerCommand.kt

docker-compose.yml

docker/docker-bake.hcl

osrdyne/src/config.rs

osrdyne/src/drivers/docker.rs

osrdyne/src/drivers/kubernetes.rs

osrdyne/src/main.rs

osrdyne/src/drivers/mod.rs

ElysaSrc · 2024-07-16T12:54:06Z

It seems that even if we have an abstract worker driver API, each of the implementations provide a different set of features (I can only have on core worker per infra for Docker, but autoscaling allow for more than one in k8s, I can setup the machine capacity for k8s, but not setup any limit for Docker). It's like implementation details leaks out of this API. Maybe number of workers per infra and capacities of these nodes should be part of the API?

I disagree on this : we've considered integrating the scaling of cores in osrdyne but rejected it. It was clearly not easy to determine when go up or down, and it requires taking account metrics that are not easy to handle properly.

From my perspective, it makes way more sense that for a Kubernetes deployment the scaling happens at the orchestrator level and not the applicative level. We will probably, in the near future, add an option to leverage Keda instead of the HPA.

The Kubernetes Driver exposes features of the orchestrator, and for a Docker deployment we can only work with a single host so horizontally scaling the core makes little sense. We could add a driver for docker swarm clusters but they are marginal (and not what we use).

To finish, I'll say that I do not feel like the API is leaking, it just allows to pass options to the driver.

woshilapin · 2024-07-16T13:50:50Z

From my perspective, it makes way more sense that for a Kubernetes deployment the scaling happens at the orchestrator level and not the applicative level.

This sentence did make me understand your point of view. And indeed, I think I do agree. It also made me realize that osrdyne doesn't even handle workers, only workers' ~~pools~~ groups. So let's make that even more obvious by renaming WorkerMetadata to ~~WorkerPoolMetadata~~ WorkerGroupMetadata and maybe the field core_id should be ~~worker_pool_id~~ worker_group_id (if I understand correctly, kubernetes returns the deployment, so one object per worker ~~pool~~ group)?

Khoyo · 2024-07-16T14:40:00Z

To be clear, we should use worker_group, not worker_pool (https://osrd.fr/en/docs/reference/design-docs/scalable-async-rpc/#conceps)

Currently osrdyne manages one worker pool, core, and starts worker groups (eg. core-1). The docker driver does that by directly starting a single worker, but the k8s driver can create autoscaled deployments.

(yes, fields referring to core are inherited from a previous design - core_controller - and should disappear)

ElysaSrc · 2024-07-16T14:40:18Z

From my perspective, it makes way more sense that for a Kubernetes deployment the scaling happens at the orchestrator level and not the applicative level.

This sentence did make me understand your point of view. And indeed, I think I do agree. It also made me realize that osrdyne doesn't even handle workers, only workers' pools. So let's make that even more obvious by renaming WorkerMetadata to WorkerPoolMetadata and maybe the field core_id should be worker_pool_id (if I understand correctly, kubernetes returns the deployment, so one object per worker pool)?

Yes, that's a really good point that @flomonster also raised during its review so I think we're going to modify this. I'll commit something about this later today.

leovalais

LGTM for the most part. Haven't tested yet (might be useful to check the dev setup on macOS where no host networking is available).

Thanks for guiding me through the review and answering my questions :)

editoast/src/core/mod.rs

editoast/src/core/mq_client.rs

osrdyne/src/main.rs

clarani

LGTM, not tested ✅

front/public/locales/fr/errors.json

SharglutDev

Lgtm (not tested)

editoast/src/core/mq_client.rs

docker/docker-compose.core-test.yml

Erashin

LGTM. ApiServerCommand to be deleted at a later stage (+ hopefully at some point we'll drop the v1 endpoints).

osrdyne/Dockerfile

Khoyo

bloussou

Great work !! Tested on the dev environment it works 🎉 !!!

Co-authored-by: ElysaSrc <[email protected]> Co-authored-by: Younes Khoudli <[email protected]>

Co-authored-by: ElysaSrc <[email protected]>

flomonster

LGTM

editoast/src/core/mq_client.rs

ElysaSrc requested review from a team as code owners July 14, 2024 20:10

ElysaSrc requested a review from Khoyo July 14, 2024 20:10

ElysaSrc marked this pull request as draft July 14, 2024 20:10

flomonster reviewed Jul 16, 2024

View reviewed changes

osrdyne/Dockerfile Show resolved Hide resolved

osrdyne/src/drivers/docker.rs Outdated Show resolved Hide resolved

osrdyne/src/drivers/docker.rs Outdated Show resolved Hide resolved

woshilapin self-assigned this Jul 16, 2024

woshilapin reviewed Jul 16, 2024

View reviewed changes

Khoyo mentioned this pull request Jul 16, 2024

Osrdyne: add telemetry #8081

Closed

ElysaSrc force-pushed the osrd-async branch from db0cbd2 to e3b41bd Compare July 17, 2024 12:15

Khoyo force-pushed the osrd-async branch 6 times, most recently from 8f9e29b to 17ddec8 Compare July 19, 2024 08:11

leovalais reviewed Jul 22, 2024

View reviewed changes

Khoyo force-pushed the osrd-async branch 5 times, most recently from b6b8366 to 1487784 Compare July 23, 2024 13:08

clarani approved these changes Jul 23, 2024

View reviewed changes

front/public/locales/fr/errors.json Show resolved Hide resolved

SharglutDev approved these changes Jul 23, 2024

View reviewed changes

Khoyo force-pushed the osrd-async branch from 816f7e6 to 953dcbb Compare July 24, 2024 07:01

woshilapin reviewed Jul 24, 2024

View reviewed changes

editoast/src/core/mq_client.rs Show resolved Hide resolved

Erashin reviewed Jul 25, 2024

View reviewed changes

docker/docker-compose.core-test.yml Outdated Show resolved Hide resolved

Erashin approved these changes Jul 25, 2024

View reviewed changes

woshilapin reviewed Jul 26, 2024

View reviewed changes

osrdyne/Dockerfile Outdated Show resolved Hide resolved

Khoyo force-pushed the osrd-async branch 5 times, most recently from 3edc6ba to 65a12a6 Compare July 29, 2024 09:05

This was referenced Jul 30, 2024

editoast: populate InternalError status/context when unparsable #8241

Merged

editoast: make core pathfinding/simulation errors non-fatal #8240

Merged

ElysaSrc force-pushed the osrd-async branch 2 times, most recently from de64f2f to 63e5bfd Compare August 7, 2024 18:52

Khoyo approved these changes Aug 7, 2024

View reviewed changes

bloussou approved these changes Aug 8, 2024

View reviewed changes

ElysaSrc force-pushed the osrd-async branch from 63e5bfd to ce6d531 Compare August 8, 2024 07:29

ElysaSrc requested a review from flomonster August 8, 2024 07:38

Khoyo force-pushed the osrd-async branch from ce6d531 to 0eee797 Compare August 8, 2024 08:11

multun and others added 3 commits August 8, 2024 10:40

osrdyne: initial commit

7f1b9bd

Co-authored-by: ElysaSrc <[email protected]> Co-authored-by: Younes Khoudli <[email protected]>

core: add worker mode

de350b0

editoast: use core worker through rabbitmq

58fbea5

Co-authored-by: ElysaSrc <[email protected]>

Khoyo force-pushed the osrd-async branch from 0eee797 to 58fbea5 Compare August 8, 2024 08:40

flomonster approved these changes Aug 8, 2024

View reviewed changes

editoast/src/core/mq_client.rs Show resolved Hide resolved

ElysaSrc added this pull request to the merge queue Aug 8, 2024

Merged via the queue into dev with commit 597260b Aug 8, 2024
22 checks passed

ElysaSrc deleted the osrd-async branch August 8, 2024 09:09

Scalable async RPC #8048

Scalable async RPC #8048

Uh oh!

Conversation

ElysaSrc commented Jul 14, 2024 • edited by woshilapin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Tests

Uh oh!

codecov-commenter commented Jul 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ElysaSrc commented Jul 14, 2024

Uh oh!

flomonster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flomonster commented Jul 16, 2024

Uh oh!

woshilapin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElysaSrc commented Jul 16, 2024

Uh oh!

woshilapin commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khoyo commented Jul 16, 2024

Uh oh!

ElysaSrc commented Jul 16, 2024

Uh oh!

leovalais left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clarani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SharglutDev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Erashin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Khoyo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bloussou left a comment

Choose a reason for hiding this comment

Uh oh!

flomonster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElysaSrc commented Jul 14, 2024 •

edited by woshilapin

Loading

codecov-commenter commented Jul 14, 2024 •

edited

Loading

woshilapin commented Jul 16, 2024 •

edited

Loading

Khoyo left a comment •

edited

Loading