Skip to content

Commit 883f6c7

Browse files
johnugeorgeszaher
authored andcommitted
HPA support for PyTorch Elastic (kubeflow/trainer#1701)
* Support for k8s v1.25 in CI * Support for k8s v1.25 in CI * Change k8s api to v1.25 * Upgrade golangci-lint version * Fixes for HPA in pytorchjob * Common changes
1 parent 7d8ff45 commit 883f6c7

20 files changed

+35
-33
lines changed

python/docs/KubeflowOrgV1ElasticPolicy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Name | Type | Description | Notes
55
------------ | ------------- | ------------- | -------------
66
**max_replicas** | **int** | upper limit for the number of pods that can be set by the autoscaler; cannot be smaller than MinReplicas, defaults to null. | [optional]
77
**max_restarts** | **int** | | [optional]
8-
**metrics** | [**list[K8sIoApiAutoscalingV2beta2MetricSpec]**](K8sIoApiAutoscalingV2beta2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional]
8+
**metrics** | [**list[K8sIoApiAutoscalingV2MetricSpec]**](K8sIoApiAutoscalingV2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional]
99
**min_replicas** | **int** | minReplicas is the lower limit for the number of replicas to which the training job can scale down. It defaults to null. | [optional]
1010
**n_proc_per_node** | **int** | Number of workers per node; supported values: [auto, cpu, gpu, int]. | [optional]
1111
**rdzv_backend** | **str** | | [optional]

python/docs/KubeflowOrgV1PaddleElasticPolicy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Name | Type | Description | Notes
55
------------ | ------------- | ------------- | -------------
66
**max_replicas** | **int** | upper limit for the number of pods that can be set by the autoscaler; cannot be smaller than MinReplicas, defaults to null. | [optional]
77
**max_restarts** | **int** | MaxRestarts is the limit for restart times of pods in elastic mode. | [optional]
8-
**metrics** | [**list[K8sIoApiAutoscalingV2beta2MetricSpec]**](K8sIoApiAutoscalingV2beta2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional]
8+
**metrics** | [**list[K8sIoApiAutoscalingV2MetricSpec]**](K8sIoApiAutoscalingV2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional]
99
**min_replicas** | **int** | minReplicas is the lower limit for the number of replicas to which the training job can scale down. It defaults to null. | [optional]
1010

1111
[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)

python/docs/V1ReplicaStatus.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Name | Type | Description | Notes
66
------------ | ------------- | ------------- | -------------
77
**active** | **int** | The number of actively running pods. | [optional]
88
**failed** | **int** | The number of pods which reached phase Failed. | [optional]
9-
**label_selector** | [**V1LabelSelector**](V1LabelSelector.md) | | [optional]
9+
**label_selector** | **str** | A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. | [optional]
1010
**succeeded** | **int** | The number of pods which reached phase Succeeded. | [optional]
1111

1212
[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)

python/kubeflow/training/models/kubeflow_org_v1_elastic_policy.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ class KubeflowOrgV1ElasticPolicy(object):
3535
openapi_types = {
3636
'max_replicas': 'int',
3737
'max_restarts': 'int',
38-
'metrics': 'list[K8sIoApiAutoscalingV2beta2MetricSpec]',
38+
'metrics': 'list[K8sIoApiAutoscalingV2MetricSpec]',
3939
'min_replicas': 'int',
4040
'n_proc_per_node': 'int',
4141
'rdzv_backend': 'str',
@@ -153,7 +153,7 @@ def metrics(self):
153153
Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. # noqa: E501
154154
155155
:return: The metrics of this KubeflowOrgV1ElasticPolicy. # noqa: E501
156-
:rtype: list[K8sIoApiAutoscalingV2beta2MetricSpec]
156+
:rtype: list[K8sIoApiAutoscalingV2MetricSpec]
157157
"""
158158
return self._metrics
159159

@@ -164,7 +164,7 @@ def metrics(self, metrics):
164164
Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. # noqa: E501
165165
166166
:param metrics: The metrics of this KubeflowOrgV1ElasticPolicy. # noqa: E501
167-
:type: list[K8sIoApiAutoscalingV2beta2MetricSpec]
167+
:type: list[K8sIoApiAutoscalingV2MetricSpec]
168168
"""
169169

170170
self._metrics = metrics

python/kubeflow/training/models/kubeflow_org_v1_paddle_elastic_policy.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ class KubeflowOrgV1PaddleElasticPolicy(object):
3535
openapi_types = {
3636
'max_replicas': 'int',
3737
'max_restarts': 'int',
38-
'metrics': 'list[K8sIoApiAutoscalingV2beta2MetricSpec]',
38+
'metrics': 'list[K8sIoApiAutoscalingV2MetricSpec]',
3939
'min_replicas': 'int'
4040
}
4141

@@ -120,7 +120,7 @@ def metrics(self):
120120
Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. # noqa: E501
121121
122122
:return: The metrics of this KubeflowOrgV1PaddleElasticPolicy. # noqa: E501
123-
:rtype: list[K8sIoApiAutoscalingV2beta2MetricSpec]
123+
:rtype: list[K8sIoApiAutoscalingV2MetricSpec]
124124
"""
125125
return self._metrics
126126

@@ -131,7 +131,7 @@ def metrics(self, metrics):
131131
Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. # noqa: E501
132132
133133
:param metrics: The metrics of this KubeflowOrgV1PaddleElasticPolicy. # noqa: E501
134-
:type: list[K8sIoApiAutoscalingV2beta2MetricSpec]
134+
:type: list[K8sIoApiAutoscalingV2MetricSpec]
135135
"""
136136

137137
self._metrics = metrics

python/kubeflow/training/models/v1_replica_status.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ class V1ReplicaStatus(object):
3535
openapi_types = {
3636
'active': 'int',
3737
'failed': 'int',
38-
'label_selector': 'V1LabelSelector',
38+
'label_selector': 'str',
3939
'succeeded': 'int'
4040
}
4141

@@ -117,19 +117,21 @@ def failed(self, failed):
117117
def label_selector(self):
118118
"""Gets the label_selector of this V1ReplicaStatus. # noqa: E501
119119
120+
A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. # noqa: E501
120121
121122
:return: The label_selector of this V1ReplicaStatus. # noqa: E501
122-
:rtype: V1LabelSelector
123+
:rtype: str
123124
"""
124125
return self._label_selector
125126

126127
@label_selector.setter
127128
def label_selector(self, label_selector):
128129
"""Sets the label_selector of this V1ReplicaStatus.
129130
131+
A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. # noqa: E501
130132
131133
:param label_selector: The label_selector of this V1ReplicaStatus. # noqa: E501
132-
:type: V1LabelSelector
134+
:type: str
133135
"""
134136

135137
self._label_selector = label_selector

python/test/test_kubeflow_org_v1_mpi_job.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ def make_instance(self, include_optional):
7777
'key' : V1ReplicaStatus(
7878
active = 56,
7979
failed = 56,
80-
label_selector = None,
80+
label_selector = '0',
8181
succeeded = 56, )
8282
},
8383
start_time = None, )

python/test/test_kubeflow_org_v1_mpi_job_list.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ def make_instance(self, include_optional):
8080
'key' : V1ReplicaStatus(
8181
active = 56,
8282
failed = 56,
83-
label_selector = None,
83+
label_selector = '0',
8484
succeeded = 56, )
8585
},
8686
start_time = None, ), )
@@ -133,7 +133,7 @@ def make_instance(self, include_optional):
133133
'key' : V1ReplicaStatus(
134134
active = 56,
135135
failed = 56,
136-
label_selector = None,
136+
label_selector = '0',
137137
succeeded = 56, )
138138
},
139139
start_time = None, ), )

python/test/test_kubeflow_org_v1_mx_job.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ def make_instance(self, include_optional):
7575
'key' : V1ReplicaStatus(
7676
active = 56,
7777
failed = 56,
78-
label_selector = None,
78+
label_selector = '0',
7979
succeeded = 56, )
8080
},
8181
start_time = None, )

python/test/test_kubeflow_org_v1_mx_job_list.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ def make_instance(self, include_optional):
7878
'key' : V1ReplicaStatus(
7979
active = 56,
8080
failed = 56,
81-
label_selector = None,
81+
label_selector = '0',
8282
succeeded = 56, )
8383
},
8484
start_time = None, ), )
@@ -129,7 +129,7 @@ def make_instance(self, include_optional):
129129
'key' : V1ReplicaStatus(
130130
active = 56,
131131
failed = 56,
132-
label_selector = None,
132+
label_selector = '0',
133133
succeeded = 56, )
134134
},
135135
start_time = None, ), )

python/test/test_kubeflow_org_v1_paddle_job.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ def make_instance(self, include_optional):
8181
'key' : V1ReplicaStatus(
8282
active = 56,
8383
failed = 56,
84-
label_selector = None,
84+
label_selector = '0',
8585
succeeded = 56, )
8686
},
8787
start_time = None, )

python/test/test_kubeflow_org_v1_paddle_job_list.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ def make_instance(self, include_optional):
8484
'key' : V1ReplicaStatus(
8585
active = 56,
8686
failed = 56,
87-
label_selector = None,
87+
label_selector = '0',
8888
succeeded = 56, )
8989
},
9090
start_time = None, ), )
@@ -141,7 +141,7 @@ def make_instance(self, include_optional):
141141
'key' : V1ReplicaStatus(
142142
active = 56,
143143
failed = 56,
144-
label_selector = None,
144+
label_selector = '0',
145145
succeeded = 56, )
146146
},
147147
start_time = None, ), )

python/test/test_kubeflow_org_v1_py_torch_job.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ def make_instance(self, include_optional):
9292
'key' : V1ReplicaStatus(
9393
active = 56,
9494
failed = 56,
95-
label_selector = None,
95+
label_selector = '0',
9696
succeeded = 56, )
9797
},
9898
start_time = None, )

python/test/test_kubeflow_org_v1_py_torch_job_list.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ def make_instance(self, include_optional):
9595
'key' : V1ReplicaStatus(
9696
active = 56,
9797
failed = 56,
98-
label_selector = None,
98+
label_selector = '0',
9999
succeeded = 56, )
100100
},
101101
start_time = None, ), )
@@ -163,7 +163,7 @@ def make_instance(self, include_optional):
163163
'key' : V1ReplicaStatus(
164164
active = 56,
165165
failed = 56,
166-
label_selector = None,
166+
label_selector = '0',
167167
succeeded = 56, )
168168
},
169169
start_time = None, ), )

python/test/test_kubeflow_org_v1_tf_job.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ def make_instance(self, include_optional):
7676
'key' : V1ReplicaStatus(
7777
active = 56,
7878
failed = 56,
79-
label_selector = None,
79+
label_selector = '0',
8080
succeeded = 56, )
8181
},
8282
start_time = None, )

python/test/test_kubeflow_org_v1_tf_job_list.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ def make_instance(self, include_optional):
7979
'key' : V1ReplicaStatus(
8080
active = 56,
8181
failed = 56,
82-
label_selector = None,
82+
label_selector = '0',
8383
succeeded = 56, )
8484
},
8585
start_time = None, ), )
@@ -131,7 +131,7 @@ def make_instance(self, include_optional):
131131
'key' : V1ReplicaStatus(
132132
active = 56,
133133
failed = 56,
134-
label_selector = None,
134+
label_selector = '0',
135135
succeeded = 56, )
136136
},
137137
start_time = None, ), )

python/test/test_kubeflow_org_v1_xg_boost_job.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ def make_instance(self, include_optional):
7474
'key' : V1ReplicaStatus(
7575
active = 56,
7676
failed = 56,
77-
label_selector = None,
77+
label_selector = '0',
7878
succeeded = 56, )
7979
},
8080
start_time = None, )

python/test/test_kubeflow_org_v1_xg_boost_job_list.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ def make_instance(self, include_optional):
7777
'key' : V1ReplicaStatus(
7878
active = 56,
7979
failed = 56,
80-
label_selector = None,
80+
label_selector = '0',
8181
succeeded = 56, )
8282
},
8383
start_time = None, ), )
@@ -127,7 +127,7 @@ def make_instance(self, include_optional):
127127
'key' : V1ReplicaStatus(
128128
active = 56,
129129
failed = 56,
130-
label_selector = None,
130+
label_selector = '0',
131131
succeeded = 56, )
132132
},
133133
start_time = None, ), )

python/test/test_v1_job_status.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def make_instance(self, include_optional):
5151
'key' : V1ReplicaStatus(
5252
active = 56,
5353
failed = 56,
54-
label_selector = None,
54+
label_selector = '0',
5555
succeeded = 56, )
5656
},
5757
start_time = None
@@ -71,7 +71,7 @@ def make_instance(self, include_optional):
7171
'key' : V1ReplicaStatus(
7272
active = 56,
7373
failed = 56,
74-
label_selector = None,
74+
label_selector = '0',
7575
succeeded = 56, )
7676
},
7777
)

python/test/test_v1_replica_status.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def make_instance(self, include_optional):
3838
return V1ReplicaStatus(
3939
active = 56,
4040
failed = 56,
41-
label_selector = None,
41+
label_selector = '0',
4242
succeeded = 56
4343
)
4444
else :

0 commit comments

Comments
 (0)