[torchx/ray] Is elastic training on ray clusters supported?

## 🐛 Bug
Hi, I would like to know the current state of running elastic training on ray clusters.

I tried to repeat some experiments([notebook](https://colab.research.google.com/drive/1vVCpgQ9z_1SN8K9CJxUT2LtvUDN0AlND?usp=sharing)) in this [blog](https://www.anyscale.com/blog/large-scale-distributed-training-with-torchx-and-ray) on my ray cluster, but I got unexpected behavior.
- I EXPECT to see when use custom component and the cluster has fewer available nodes than the job requested, the submitted job continues running with current nodes, and when there are new nodes become available, they join can join the training process. What I OBSERVED is the job failed and got the error below:
  ```
  TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]
  ```
- When use the built-in `dist.ddp` component, even if there are enough computation resources, the ray job status  always shows succeed, but from the ray job logs, the expected output never appears, and the only information in the log is
  ```
  Waiting for placement group to start.
  ```
- When use custom component and the cluster has the required resources, the submitted job has expected log information in the log file, but the job will never stop, when I check the ray job status, it always shown
  ```
  Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING
  Status message: Job is currently running.
  ```

### Question




Module (check all that applies):
 * [ ] `torchx.spec`
 * [x] `torchx.component`
 * [ ] `torchx.apps`
 * [ ] `torchx.runtime`
 * [ ] `torchx.cli`
 * [x] `torchx.schedulers`
 * [ ] `torchx.pipelines`
 * [ ] `torchx.aws`
 * [ ] `torchx.examples`
 * [ ] `other`


## To Reproduce

I tried two ways to launch a TorchX job on ray:

```bash
# Use custom component
# Required resouses are defined in the component.py file
torchx run -s ray \ # use ray scheduler
    -cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
    component.py:trainer # use custom component

# Use built-in dist.ddp component
torchx run -s ray \ # use ray scheduler
    -cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
    dist.ddp \ # use dist.ddp component
    -j 4x1 \ # nproc and nnodes
    --script ./compute_world_size.py # a distributed script
```

A detailed description of the command is [here](https://pytorch.org/torchx/latest/quickstart.html).

The provisioned ray cluster:

```python
"headCPU": "4",
"headGPU": "0",
"headMemory": "12Gi",
"headMaxMemory": "24Gi", 
"workerMinCount": 1, 
"workerMaxCount": 4,
"workerCPU": "4",
"workerGPU": "0",
"workerMemory": "12Gi",
"workerMaxMemory": "24Gi"
```

Performed following experiments:

- **(Autoscaling)** To test if torchx will trigger ray autoscaler to provide more nodes than the minimum nodes, I launched a job that requires 4 nodes.
The results are listed below:

  - [Custom component](torchx-ray/component.py):
    - Ray job status:

        ```shell
        Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING
        Status message: Job is currently running.
        ```

    - Ray job logs:

        ```shell
        Waiting for placement group to start.
        (scheduler +1s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
        (scheduler +1s) Adding 3 nodes of type worker_node.
        (scheduler +21s) Resized to 20 CPUs, 4 GPUs.
        (CommandActor pid=223, ip=10.130.6.73) initializing `gloo` process group
        (CommandActor pid=223, ip=10.130.6.73) successfully initialized process group
        (CommandActor pid=223, ip=10.130.6.73) rank: 3, actual world_size: 4, computed world_size: 4
        (CommandActor pid=221, ip=10.131.6.32) initializing `gloo` process group
        (CommandActor pid=221, ip=10.131.6.32) successfully initialized process group
        (CommandActor pid=221, ip=10.131.6.32) rank: 1, actual world_size: 4, computed world_size: 4
        (CommandActor pid=222, ip=10.130.6.74) initializing `gloo` process group
        (CommandActor pid=222, ip=10.130.6.74) successfully initialized process group
        (CommandActor pid=222, ip=10.130.6.74) rank: 0, actual world_size: 4, computed world_size: 4
        (CommandActor pid=225, ip=10.131.6.30) initializing `gloo` process group
        (CommandActor pid=225, ip=10.131.6.30) successfully initialized process group
        (CommandActor pid=225, ip=10.131.6.30) rank: 2, actual world_size: 4, computed world_size: 4
        ```

    - Comment:
        The job can be executed correctly, and output info is shown in the log, however, the job status is stuck at `RUNNING` even after the job has been finished, and the computing resources were NOT released neither. I have to restart the cluster to submit new jobs.

  - Built-in `dist.ddp`:
    - Ray job status:

        ```shell
        ------------------------------------------
        Job 'raysubmit_EtGmNBAYVKrATdUj' succeeded
        ------------------------------------------
        ```

    - Ray job logs:

        ```shell
        Waiting for placement group to start.
        ```

    - Comment:
        Ray job status shows job has succeeded, but the log is stuck at waiting for placement group to start. By checking `ray.nodes()`, the autoscaler didn't work, there is still 1 active worker node.

## Expected behavior

torchx supports elastic training on ray clusters and has the following features:
1. elastic
2. autoscaling with ray autoscaler
3. fault tolerance

## Environment



 - torchx version (e.g. 0.1.0rc1): 0.1.2
 - Python version: 3.9
 - OS (e.g., Linux): Linux
 - How you installed torchx (`conda`, `pip`, source, `docker`): pip
 - Docker image and tag (if using docker):
 - Git commit (if installed from source):
 - Execution environment (on-prem, AWS, GCP, Azure etc):
 - Any other relevant information:

## Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torchx/ray] Is elastic training on ray clusters supported? #520

🐛 Bug

Question

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[torchx/ray] Is elastic training on ray clusters supported? #520

Description

🐛 Bug

Question

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions