Skip to content

[Slurm scheduler] Add better support for specifying resources in slurm #359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
9 tasks
aivanou opened this issue Nov 22, 2021 · 0 comments
Closed
9 tasks
Labels
bug Something isn't working module: runner issues related to the torchx.runner and torchx.scheduler modules slurm slurm scheduler

Comments

@aivanou
Copy link
Contributor

aivanou commented Nov 22, 2021

🐛 Bug

According to aws/aws-parallelcluster#2198 PCluster has problems running jobs that have explicit memory requirements.

We need to modify our slurm scheduler to address this.

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • [ x] torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

Steps to reproduce the behavior:

  1. ssh to slurm cluster
  2. create main.py that prints hello world
  3. torchx run -s slurm --scheduler_args partition=compute,time=10 dist.ddp --script main.py

Expected behavior

Job successfully executed

@d4l3k d4l3k added bug Something isn't working module: runner issues related to the torchx.runner and torchx.scheduler modules slurm slurm scheduler labels Jan 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module: runner issues related to the torchx.runner and torchx.scheduler modules slurm slurm scheduler
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants