Skip to content

schedulers/kubernetes_scheduler: add support for resource instance-type node selectors #433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Mar 25, 2022

This allows specifying specific instance types when scheduling kubernetes jobs. It uses node_selectors and the node.kubernetes.io/instance-type label on nodes to limit pods to specific instances.

To avoid instance type cpu and memory hitting issues with the node reserved cpu/mem this will subtract a small amount of CPU and memory from the requested resources. Limits remains the same.

Also adds g4dn.xlarge resource type.

Test plan:

Unit tests, updated kube dist integration test to specify instance type

spec:
  containers:
  - command:
    - python
    - -m
    - torch.distributed.run
    - --rdzv_backend
    - etcd
    - --rdzv_endpoint
    - etcd-server:2379
    - --rdzv_id
    - cv-trainer-smvh1095z11h5
    - --nnodes
    - "2"
    - --nproc_per_node
    - "1"
    - -m
    - torchx.examples.apps.lightning_classy_vision.train
    - --load_path
    - ""
    - --log_path
    - /tmp/logs
    - --epochs
    - "1"
    - --output_path
    - s3://torchx-test/integration-tests/runner_wh5wn4cz4b7fcd/output
    env:
    - name: TORCHX_RANK0_HOST
      value: localhost
    - name: VC_WORKER_0_HOSTS
      valueFrom:
        configMapKeyRef:
          key: VC_WORKER_0_HOSTS
          name: cv-trainer-smvh1095z11h5-svc
    - name: VC_WORKER_0_NUM
      valueFrom:
        configMapKeyRef:
          key: VC_WORKER_0_NUM
          name: cv-trainer-smvh1095z11h5-svc
    - name: VC_WORKER_1_HOSTS
      valueFrom:
        configMapKeyRef:
          key: VC_WORKER_1_HOSTS
          name: cv-trainer-smvh1095z11h5-svc
    - name: VC_WORKER_1_NUM
      valueFrom:
        configMapKeyRef:
          key: VC_WORKER_1_NUM
          name: cv-trainer-smvh1095z11h5-svc
    - name: VK_TASK_INDEX
      value: "0"
    - name: VC_TASK_INDEX
      value: "0"
    image: 495572122715.dkr.ecr.us-west-2.amazonaws.com/torchx/integration-tests:canary_runner_wh5wn4cz4b7fcd_torchx
    imagePullPolicy: IfNotPresent
    name: worker-0
    resources:
      limits:
        cpu: "2"
        memory: 8192M
        nvidia.com/gpu: "1"
      requests:
        cpu: 1900m
        memory: 7168M
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /etc/volcano
      name: cv-trainer-smvh1095z11h5-svc
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-z8vb9
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: cv-trainer-smvh1095z11h5-worker-0-0
  nodeName: ip-192-168-16-165.us-west-2.compute.internal
  nodeSelector:
    node.kubernetes.io/instance-type: p3.2xlarge

https://github.com/pytorch/torchx/runs/5698855673?check_suite_focus=true

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2022
@d4l3k d4l3k force-pushed the k8sinstancetype branch from 4e01bdf to 97c6f5c Compare March 25, 2022 22:03
@codecov
Copy link

codecov bot commented Mar 25, 2022

Codecov Report

Merging #433 (984a8c0) into main (434c013) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #433      +/-   ##
==========================================
+ Coverage   94.52%   94.54%   +0.02%     
==========================================
  Files          64       64              
  Lines        3782     3798      +16     
==========================================
+ Hits         3575     3591      +16     
  Misses        207      207              
Impacted Files Coverage Δ
torchx/schedulers/kubernetes_scheduler.py 93.08% <100.00%> (+0.40%) ⬆️
torchx/specs/__init__.py 96.15% <100.00%> (ø)
torchx/specs/named_resources_aws.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 434c013...984a8c0. Read the comment docs.

@d4l3k d4l3k force-pushed the k8sinstancetype branch from 97c6f5c to 984a8c0 Compare March 25, 2022 22:15
@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@d4l3k d4l3k deleted the k8sinstancetype branch April 13, 2022 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants