Skip to content

[torchx/ray] Is elastic training on ray clusters supported? #520

Open
@ntlm1686

Description

@ntlm1686

🐛 Bug

Hi, I would like to know the current state of running elastic training on ray clusters.

I tried to repeat some experiments(notebook) in this blog on my ray cluster, but I got unexpected behavior.

  • I EXPECT to see when use custom component and the cluster has fewer available nodes than the job requested, the submitted job continues running with current nodes, and when there are new nodes become available, they join can join the training process. What I OBSERVED is the job failed and got the error below:
    TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]
    
  • When use the built-in dist.ddp component, even if there are enough computation resources, the ray job status always shows succeed, but from the ray job logs, the expected output never appears, and the only information in the log is
    Waiting for placement group to start.
    
  • When use custom component and the cluster has the required resources, the submitted job has expected log information in the log file, but the job will never stop, when I check the ray job status, it always shown
    Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING
    Status message: Job is currently running.
    

Question

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

I tried two ways to launch a TorchX job on ray:

# Use custom component
# Required resouses are defined in the component.py file
torchx run -s ray \ # use ray scheduler
    -cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
    component.py:trainer # use custom component

# Use built-in dist.ddp component
torchx run -s ray \ # use ray scheduler
    -cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
    dist.ddp \ # use dist.ddp component
    -j 4x1 \ # nproc and nnodes
    --script ./compute_world_size.py # a distributed script

A detailed description of the command is here.

The provisioned ray cluster:

"headCPU": "4",
"headGPU": "0",
"headMemory": "12Gi",
"headMaxMemory": "24Gi", 
"workerMinCount": 1, 
"workerMaxCount": 4,
"workerCPU": "4",
"workerGPU": "0",
"workerMemory": "12Gi",
"workerMaxMemory": "24Gi"

Performed following experiments:

  • (Autoscaling) To test if torchx will trigger ray autoscaler to provide more nodes than the minimum nodes, I launched a job that requires 4 nodes.
    The results are listed below:

    • Custom component:

      • Ray job status:

        Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING
        Status message: Job is currently running.
      • Ray job logs:

        Waiting for placement group to start.
        (scheduler +1s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
        (scheduler +1s) Adding 3 nodes of type worker_node.
        (scheduler +21s) Resized to 20 CPUs, 4 GPUs.
        (CommandActor pid=223, ip=10.130.6.73) initializing `gloo` process group
        (CommandActor pid=223, ip=10.130.6.73) successfully initialized process group
        (CommandActor pid=223, ip=10.130.6.73) rank: 3, actual world_size: 4, computed world_size: 4
        (CommandActor pid=221, ip=10.131.6.32) initializing `gloo` process group
        (CommandActor pid=221, ip=10.131.6.32) successfully initialized process group
        (CommandActor pid=221, ip=10.131.6.32) rank: 1, actual world_size: 4, computed world_size: 4
        (CommandActor pid=222, ip=10.130.6.74) initializing `gloo` process group
        (CommandActor pid=222, ip=10.130.6.74) successfully initialized process group
        (CommandActor pid=222, ip=10.130.6.74) rank: 0, actual world_size: 4, computed world_size: 4
        (CommandActor pid=225, ip=10.131.6.30) initializing `gloo` process group
        (CommandActor pid=225, ip=10.131.6.30) successfully initialized process group
        (CommandActor pid=225, ip=10.131.6.30) rank: 2, actual world_size: 4, computed world_size: 4
      • Comment:
        The job can be executed correctly, and output info is shown in the log, however, the job status is stuck at RUNNING even after the job has been finished, and the computing resources were NOT released neither. I have to restart the cluster to submit new jobs.

    • Built-in dist.ddp:

      • Ray job status:

        ------------------------------------------
        Job 'raysubmit_EtGmNBAYVKrATdUj' succeeded
        ------------------------------------------
      • Ray job logs:

        Waiting for placement group to start.
      • Comment:
        Ray job status shows job has succeeded, but the log is stuck at waiting for placement group to start. By checking ray.nodes(), the autoscaler didn't work, there is still 1 active worker node.

Expected behavior

torchx supports elastic training on ray clusters and has the following features:

  1. elastic
  2. autoscaling with ray autoscaler
  3. fault tolerance

Environment

  • torchx version (e.g. 0.1.0rc1): 0.1.2
  • Python version: 3.9
  • OS (e.g., Linux): Linux
  • How you installed torchx (conda, pip, source, docker): pip
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionUser questionsrayRelated to the ray scheduler

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions