Skip to content

Testing CI in JAX example #2385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 17, 2025
Merged

Conversation

saileshd1402
Copy link
Contributor

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

sandipanpanda and others added 6 commits December 26, 2024 01:30
Illustrate how to use JAX's `pmap` to express and execute
single-program multiple-data (SPMD) programs for data parallelism
along a batch dimension

Signed-off-by: Sandipan Panda <[email protected]>
Use -- server-side to install the latest local changes of Training
Operator control plane

Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
@coveralls
Copy link

coveralls commented Jan 9, 2025

Pull Request Test Coverage Report for Build 12815665472

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12718184264: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jan 9, 2025
@andreyvelich
Copy link
Member

Hi @saileshd1402 @sandipanpanda @mahdikhashan, any updates with the JAX example ?
Since we want to release 1.9 soon, do we want to include this PR as part of the release ?

@andreyvelich
Copy link
Member

cc @Electronic-Waste @kubeflow/wg-training-leads

@Electronic-Waste
Copy link
Member

@andreyvelich It would be great if we could catch up with the final feature freeze day on Jan 20th. However, we could implement the CI for JAX example with notebooks like #2246 if the integration test cannot be ready before Jan 20th.

@saileshd1402
Copy link
Contributor Author

I'll try to make the job use lesser resources by the end of today. But if it's not working, then I agree with @Electronic-Waste

@mahdikhashan
Copy link
Member

mahdikhashan commented Jan 16, 2025

Hi @saileshd1402 @sandipanpanda @mahdikhashan, any updates with the JAX example ? Since we want to release 1.9 soon, do we want to include this PR as part of the release ?

hi, i couldn't figure out what could be changed, not yet, regarding the first epoch. i'll spend more time tomorrow.


for epoch in range(num_epochs):
start_time = time.time()
num_batches = 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich @mahdikhashan The tests are working when I just hard coded this num_batches test to check CI, it's passing so if we make some params smaller specific to CI, the tests should pass. Is there anywhere we can compromise for CI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this number of batches small.
Was it the main issue why the first epoch takes ~ 1 hour ?

Copy link
Contributor Author

@saileshd1402 saileshd1402 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from my understanding this was the reason. Resources was not enough to handle more batches/takes way too long.

Should I edit the script so that if a certain CI env var is set, then we set num_batches to lower value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use this number of batches by default.
You can just add the comment that increasing number of batches requires more resources.

Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
@saileshd1402 saileshd1402 changed the title [WIP] Testing CI in JAX example Testing CI in JAX example Jan 16, 2025
@saileshd1402
Copy link
Contributor Author

@andreyvelich I have addressed your comment. Seems like CI looks good now, please review when you're available.

cc: @sandipanpanda @mahdikhashan @Electronic-Waste

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @saileshd1402!
/lgtm
/assign @Electronic-Waste @kubeflow/wg-training-leads @sandipanpanda

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your great contribution! @saileshd1402

/lgtm

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @saileshd1402!
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, Electronic-Waste

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 6d58ea9 into kubeflow:master Jan 17, 2025
56 checks passed
@andreyvelich
Copy link
Member

/cherry-pick release-1.9

@google-oss-robot
Copy link

@andreyvelich: new pull request created: #2390

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants