Description
🐛 Bug
Hi, I would like to know the current state of running elastic training on ray clusters.
I tried to repeat some experiments(notebook) in this blog on my ray cluster, but I got unexpected behavior.
- I EXPECT to see when use custom component and the cluster has fewer available nodes than the job requested, the submitted job continues running with current nodes, and when there are new nodes become available, they join can join the training process. What I OBSERVED is the job failed and got the error below:
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]
- When use the built-in
dist.ddp
component, even if there are enough computation resources, the ray job status always shows succeed, but from the ray job logs, the expected output never appears, and the only information in the log isWaiting for placement group to start.
- When use custom component and the cluster has the required resources, the submitted job has expected log information in the log file, but the job will never stop, when I check the ray job status, it always shown
Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING Status message: Job is currently running.
Question
Module (check all that applies):
-
torchx.spec
-
torchx.component
-
torchx.apps
-
torchx.runtime
-
torchx.cli
-
torchx.schedulers
-
torchx.pipelines
-
torchx.aws
-
torchx.examples
-
other
To Reproduce
I tried two ways to launch a TorchX job on ray:
# Use custom component
# Required resouses are defined in the component.py file
torchx run -s ray \ # use ray scheduler
-cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
component.py:trainer # use custom component
# Use built-in dist.ddp component
torchx run -s ray \ # use ray scheduler
-cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
dist.ddp \ # use dist.ddp component
-j 4x1 \ # nproc and nnodes
--script ./compute_world_size.py # a distributed script
A detailed description of the command is here.
The provisioned ray cluster:
"headCPU": "4",
"headGPU": "0",
"headMemory": "12Gi",
"headMaxMemory": "24Gi",
"workerMinCount": 1,
"workerMaxCount": 4,
"workerCPU": "4",
"workerGPU": "0",
"workerMemory": "12Gi",
"workerMaxMemory": "24Gi"
Performed following experiments:
-
(Autoscaling) To test if torchx will trigger ray autoscaler to provide more nodes than the minimum nodes, I launched a job that requires 4 nodes.
The results are listed below:-
-
Ray job status:
Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING Status message: Job is currently running.
-
Ray job logs:
Waiting for placement group to start. (scheduler +1s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (scheduler +1s) Adding 3 nodes of type worker_node. (scheduler +21s) Resized to 20 CPUs, 4 GPUs. (CommandActor pid=223, ip=10.130.6.73) initializing `gloo` process group (CommandActor pid=223, ip=10.130.6.73) successfully initialized process group (CommandActor pid=223, ip=10.130.6.73) rank: 3, actual world_size: 4, computed world_size: 4 (CommandActor pid=221, ip=10.131.6.32) initializing `gloo` process group (CommandActor pid=221, ip=10.131.6.32) successfully initialized process group (CommandActor pid=221, ip=10.131.6.32) rank: 1, actual world_size: 4, computed world_size: 4 (CommandActor pid=222, ip=10.130.6.74) initializing `gloo` process group (CommandActor pid=222, ip=10.130.6.74) successfully initialized process group (CommandActor pid=222, ip=10.130.6.74) rank: 0, actual world_size: 4, computed world_size: 4 (CommandActor pid=225, ip=10.131.6.30) initializing `gloo` process group (CommandActor pid=225, ip=10.131.6.30) successfully initialized process group (CommandActor pid=225, ip=10.131.6.30) rank: 2, actual world_size: 4, computed world_size: 4
-
Comment:
The job can be executed correctly, and output info is shown in the log, however, the job status is stuck atRUNNING
even after the job has been finished, and the computing resources were NOT released neither. I have to restart the cluster to submit new jobs.
-
-
Built-in
dist.ddp
:-
Ray job status:
------------------------------------------ Job 'raysubmit_EtGmNBAYVKrATdUj' succeeded ------------------------------------------
-
Ray job logs:
Waiting for placement group to start.
-
Comment:
Ray job status shows job has succeeded, but the log is stuck at waiting for placement group to start. By checkingray.nodes()
, the autoscaler didn't work, there is still 1 active worker node.
-
-
Expected behavior
torchx supports elastic training on ray clusters and has the following features:
- elastic
- autoscaling with ray autoscaler
- fault tolerance
Environment
- torchx version (e.g. 0.1.0rc1): 0.1.2
- Python version: 3.9
- OS (e.g., Linux): Linux
- How you installed torchx (
conda
,pip
, source,docker
): pip - Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information: