Skip to content

[serve] Log rejected requests at router side #51346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 14, 2025
Merged

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Mar 13, 2025

Why are these changes needed?

Router side logs (made less alarming, made clear that request will be retried):

INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.
INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 57d94c8a-13b4-4ea2-a628-75d566ef29e5.
INFO 2025-03-13 13:42:35,301 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.

Replica side logs about rejected requests are now DEBUG logs only.

This is to make the logs appear less alarming for users who are not familiar with the request lifecycle. The way the logs are now, the user can get confused reading the replica-side logs and think requests got dropped.
https://anyscale1.atlassian.net/browse/SERVE-659

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

zcin added 2 commits March 13, 2025 13:39
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin added the go add ONLY when ready to merge, run all tests label Mar 13, 2025
Signed-off-by: Cindy Zhang <[email protected]>
@@ -624,7 +624,7 @@ async def handle_request_with_rejection(
limit = self._deployment_config.max_ongoing_requests
num_ongoing_requests = self.get_num_ongoing_requests()
if num_ongoing_requests >= limit:
logger.warning(
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be logger.debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops yes thanks

Signed-off-by: Cindy Zhang <[email protected]>
Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

zcin added 2 commits March 13, 2025 16:29
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin merged commit 65514ea into ray-project:master Mar 14, 2025
5 checks passed
@zcin zcin deleted the rej-logs branch March 14, 2025 19:07
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
## Why are these changes needed?

Router side logs (made less alarming, made clear that request will be
retried):
```
INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.
INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 57d94c8a-13b4-4ea2-a628-75d566ef29e5.
INFO 2025-03-13 13:42:35,301 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.
```

Replica side logs about rejected requests are now DEBUG logs only.

This is to make the logs appear less alarming for users who are not
familiar with the request lifecycle. The way the logs are now, the user
can get confused reading the replica-side logs and think requests got
dropped.
https://anyscale1.atlassian.net/browse/SERVE-659

---------

Signed-off-by: Cindy Zhang <[email protected]>
Drice1999 pushed a commit to Drice1999/ray that referenced this pull request Mar 23, 2025
## Why are these changes needed?

Router side logs (made less alarming, made clear that request will be
retried):
```
INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.
INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 57d94c8a-13b4-4ea2-a628-75d566ef29e5.
INFO 2025-03-13 13:42:35,301 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.
```

Replica side logs about rejected requests are now DEBUG logs only.

This is to make the logs appear less alarming for users who are not
familiar with the request lifecycle. The way the logs are now, the user
can get confused reading the replica-side logs and think requests got
dropped.
https://anyscale1.atlassian.net/browse/SERVE-659

---------

Signed-off-by: Cindy Zhang <[email protected]>
akyang-anyscale added a commit to akyang-anyscale/ray that referenced this pull request Mar 26, 2025
akyang-anyscale added a commit to akyang-anyscale/ray that referenced this pull request Mar 26, 2025
zcin pushed a commit that referenced this pull request Mar 26, 2025
This reverts commit 65514ea.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Logging the rejected requests is causing lower serve throughput. The
regression was originally flagged from the microbenchmark test that runs
in the nightly release tests.
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: akyang-anyscale <[email protected]>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
## Why are these changes needed?

Router side logs (made less alarming, made clear that request will be
retried):
```
INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.
INFO 2025-03-13 13:42:35,298 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 57d94c8a-13b4-4ea2-a628-75d566ef29e5.
INFO 2025-03-13 13:42:35,301 serve 40047 -- Replica(id='7mqhdb0d', deployment='Model', app='default') rejected request because it is at max capacity of 1 ongoing requests. Retrying request 4a843e03-e1c7-47a2-be9d-6c0224108f42.
```

Replica side logs about rejected requests are now DEBUG logs only.

This is to make the logs appear less alarming for users who are not
familiar with the request lifecycle. The way the logs are now, the user
can get confused reading the replica-side logs and think requests got
dropped.
https://anyscale1.atlassian.net/browse/SERVE-659

---------

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…)" (ray-project#51698)

This reverts commit 65514ea.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Logging the rejected requests is causing lower serve throughput. The
regression was originally flagged from the microbenchmark test that runs
in the nightly release tests.
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: akyang-anyscale <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
srinathk10 pushed a commit that referenced this pull request Mar 28, 2025
This reverts commit 65514ea.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Logging the rejected requests is causing lower serve throughput. The
regression was originally flagged from the microbenchmark test that runs
in the nightly release tests.
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: akyang-anyscale <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants