-
Notifications
You must be signed in to change notification settings - Fork 780
Testing CI in JAX example #2385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Illustrate how to use JAX's `pmap` to express and execute single-program multiple-data (SPMD) programs for data parallelism along a batch dimension Signed-off-by: Sandipan Panda <[email protected]>
Use -- server-side to install the latest local changes of Training Operator control plane Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Pull Request Test Coverage Report for Build 12815665472Details
💛 - Coveralls |
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Hi @saileshd1402 @sandipanpanda @mahdikhashan, any updates with the JAX example ? |
cc @Electronic-Waste @kubeflow/wg-training-leads |
@andreyvelich It would be great if we could catch up with the final feature freeze day on Jan 20th. However, we could implement the CI for JAX example with notebooks like #2246 if the integration test cannot be ready before Jan 20th. |
I'll try to make the job use lesser resources by the end of today. But if it's not working, then I agree with @Electronic-Waste |
hi, i couldn't figure out what could be changed, not yet, regarding the first epoch. i'll spend more time tomorrow. |
|
||
for epoch in range(num_epochs): | ||
start_time = time.time() | ||
num_batches = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich @mahdikhashan The tests are working when I just hard coded this num_batches test to check CI, it's passing so if we make some params smaller specific to CI, the tests should pass. Is there anywhere we can compromise for CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep this number of batches small.
Was it the main issue why the first epoch takes ~ 1 hour ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, from my understanding this was the reason. Resources was not enough to handle more batches/takes way too long.
Should I edit the script so that if a certain CI env var is set, then we set num_batches to lower value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use this number of batches by default.
You can just add the comment that increasing number of batches requires more resources.
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
bb9222a
to
343090c
Compare
@andreyvelich I have addressed your comment. Seems like CI looks good now, please review when you're available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @saileshd1402!
/lgtm
/assign @Electronic-Waste @kubeflow/wg-training-leads @sandipanpanda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your great contribution! @saileshd1402
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @saileshd1402!
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, Electronic-Waste The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.9 |
@andreyvelich: new pull request created: #2390 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: