HPA support for PyTorch Elastic (kubeflow/trainer#1701)

johnugeorge · szaher · commit 883f6c7bb1f4 · 2025-06-04T23:48:34.000+01:00
* Support for k8s v1.25 in CI

* Support for k8s v1.25 in CI

* Change k8s api to v1.25

* Upgrade golangci-lint version

* Fixes for HPA in pytorchjob

* Common changes
diff --git a/python/docs/KubeflowOrgV1ElasticPolicy.md b/python/docs/KubeflowOrgV1ElasticPolicy.md
@@ -5,7 +5,7 @@ Name | Type | Description | Notes
 ------------ | ------------- | ------------- | -------------
 **max_replicas** | **int** | upper limit for the number of pods that can be set by the autoscaler; cannot be smaller than MinReplicas, defaults to null. | [optional] 
 **max_restarts** | **int** |  | [optional] 
-**metrics** | [**list[K8sIoApiAutoscalingV2beta2MetricSpec]**](K8sIoApiAutoscalingV2beta2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional] 
+**metrics** | [**list[K8sIoApiAutoscalingV2MetricSpec]**](K8sIoApiAutoscalingV2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional] 
 **min_replicas** | **int** | minReplicas is the lower limit for the number of replicas to which the training job can scale down.  It defaults to null. | [optional] 
 **n_proc_per_node** | **int** | Number of workers per node; supported values: [auto, cpu, gpu, int]. | [optional] 
 **rdzv_backend** | **str** |  | [optional] 
diff --git a/python/docs/KubeflowOrgV1PaddleElasticPolicy.md b/python/docs/KubeflowOrgV1PaddleElasticPolicy.md
@@ -5,7 +5,7 @@ Name | Type | Description | Notes
 ------------ | ------------- | ------------- | -------------
 **max_replicas** | **int** | upper limit for the number of pods that can be set by the autoscaler; cannot be smaller than MinReplicas, defaults to null. | [optional] 
 **max_restarts** | **int** | MaxRestarts is the limit for restart times of pods in elastic mode. | [optional] 
-**metrics** | [**list[K8sIoApiAutoscalingV2beta2MetricSpec]**](K8sIoApiAutoscalingV2beta2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional] 
+**metrics** | [**list[K8sIoApiAutoscalingV2MetricSpec]**](K8sIoApiAutoscalingV2MetricSpec.md) | Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created. | [optional] 
 **min_replicas** | **int** | minReplicas is the lower limit for the number of replicas to which the training job can scale down.  It defaults to null. | [optional] 
 
 [[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
diff --git a/python/docs/V1ReplicaStatus.md b/python/docs/V1ReplicaStatus.md
@@ -6,7 +6,7 @@ Name | Type | Description | Notes
 ------------ | ------------- | ------------- | -------------
 **active** | **int** | The number of actively running pods. | [optional] 
 **failed** | **int** | The number of pods which reached phase Failed. | [optional] 
-**label_selector** | [**V1LabelSelector**](V1LabelSelector.md) |  | [optional] 
+**label_selector** | **str** | A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. | [optional] 
 **succeeded** | **int** | The number of pods which reached phase Succeeded. | [optional] 
 
 [[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
diff --git a/python/kubeflow/training/models/kubeflow_org_v1_elastic_policy.py b/python/kubeflow/training/models/kubeflow_org_v1_elastic_policy.py
@@ -35,7 +35,7 @@ class KubeflowOrgV1ElasticPolicy(object):
     openapi_types = {
         'max_replicas': 'int',
         'max_restarts': 'int',
-        'metrics': 'list[K8sIoApiAutoscalingV2beta2MetricSpec]',
+        'metrics': 'list[K8sIoApiAutoscalingV2MetricSpec]',
         'min_replicas': 'int',
         'n_proc_per_node': 'int',
         'rdzv_backend': 'str',
@@ -153,7 +153,7 @@ def metrics(self):
         Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created.  # noqa: E501
 
         :return: The metrics of this KubeflowOrgV1ElasticPolicy.  # noqa: E501
-        :rtype: list[K8sIoApiAutoscalingV2beta2MetricSpec]
+        :rtype: list[K8sIoApiAutoscalingV2MetricSpec]
         """
         return self._metrics
 
@@ -164,7 +164,7 @@ def metrics(self, metrics):
         Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created.  # noqa: E501
 
         :param metrics: The metrics of this KubeflowOrgV1ElasticPolicy.  # noqa: E501
-        :type: list[K8sIoApiAutoscalingV2beta2MetricSpec]
+        :type: list[K8sIoApiAutoscalingV2MetricSpec]
         """
 
         self._metrics = metrics
diff --git a/python/kubeflow/training/models/kubeflow_org_v1_paddle_elastic_policy.py b/python/kubeflow/training/models/kubeflow_org_v1_paddle_elastic_policy.py
@@ -35,7 +35,7 @@ class KubeflowOrgV1PaddleElasticPolicy(object):
     openapi_types = {
         'max_replicas': 'int',
         'max_restarts': 'int',
-        'metrics': 'list[K8sIoApiAutoscalingV2beta2MetricSpec]',
+        'metrics': 'list[K8sIoApiAutoscalingV2MetricSpec]',
         'min_replicas': 'int'
     }
 
@@ -120,7 +120,7 @@ def metrics(self):
         Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created.  # noqa: E501
 
         :return: The metrics of this KubeflowOrgV1PaddleElasticPolicy.  # noqa: E501
-        :rtype: list[K8sIoApiAutoscalingV2beta2MetricSpec]
+        :rtype: list[K8sIoApiAutoscalingV2MetricSpec]
         """
         return self._metrics
 
@@ -131,7 +131,7 @@ def metrics(self, metrics):
         Metrics contains the specifications which are used to calculate the desired replica count (the maximum replica count across all metrics will be used).  The desired replica count is calculated with multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa.  See the individual metric source types for more information about how each type of metric must respond. If not set, the HPA will not be created.  # noqa: E501
 
         :param metrics: The metrics of this KubeflowOrgV1PaddleElasticPolicy.  # noqa: E501
-        :type: list[K8sIoApiAutoscalingV2beta2MetricSpec]
+        :type: list[K8sIoApiAutoscalingV2MetricSpec]
         """
 
         self._metrics = metrics
diff --git a/python/kubeflow/training/models/v1_replica_status.py b/python/kubeflow/training/models/v1_replica_status.py
@@ -35,7 +35,7 @@ class V1ReplicaStatus(object):
     openapi_types = {
         'active': 'int',
         'failed': 'int',
-        'label_selector': 'V1LabelSelector',
+        'label_selector': 'str',
         'succeeded': 'int'
     }
 
@@ -117,19 +117,21 @@ def failed(self, failed):
     def label_selector(self):
         """Gets the label_selector of this V1ReplicaStatus.  # noqa: E501
 
+        A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects.  # noqa: E501
 
         :return: The label_selector of this V1ReplicaStatus.  # noqa: E501
-        :rtype: V1LabelSelector
+        :rtype: str
         """
         return self._label_selector
 
     @label_selector.setter
     def label_selector(self, label_selector):
         """Sets the label_selector of this V1ReplicaStatus.
 
+        A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects.  # noqa: E501
 
         :param label_selector: The label_selector of this V1ReplicaStatus.  # noqa: E501
-        :type: V1LabelSelector
+        :type: str
         """
 
         self._label_selector = label_selector
diff --git a/python/test/test_kubeflow_org_v1_mpi_job.py b/python/test/test_kubeflow_org_v1_mpi_job.py
@@ -77,7 +77,7 @@ def make_instance(self, include_optional):
                         'key' : V1ReplicaStatus(
                             active = 56, 
                             failed = 56, 
-                            label_selector = None, 
+                            label_selector = '0', 
                             succeeded = 56, )
                         }, 
                     start_time = None, )
diff --git a/python/test/test_kubeflow_org_v1_mpi_job_list.py b/python/test/test_kubeflow_org_v1_mpi_job_list.py
@@ -80,7 +80,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
@@ -133,7 +133,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
diff --git a/python/test/test_kubeflow_org_v1_mx_job.py b/python/test/test_kubeflow_org_v1_mx_job.py
@@ -75,7 +75,7 @@ def make_instance(self, include_optional):
                         'key' : V1ReplicaStatus(
                             active = 56, 
                             failed = 56, 
-                            label_selector = None, 
+                            label_selector = '0', 
                             succeeded = 56, )
                         }, 
                     start_time = None, )
diff --git a/python/test/test_kubeflow_org_v1_mx_job_list.py b/python/test/test_kubeflow_org_v1_mx_job_list.py
@@ -78,7 +78,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
@@ -129,7 +129,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
diff --git a/python/test/test_kubeflow_org_v1_paddle_job.py b/python/test/test_kubeflow_org_v1_paddle_job.py
@@ -81,7 +81,7 @@ def make_instance(self, include_optional):
                         'key' : V1ReplicaStatus(
                             active = 56, 
                             failed = 56, 
-                            label_selector = None, 
+                            label_selector = '0', 
                             succeeded = 56, )
                         }, 
                     start_time = None, )
diff --git a/python/test/test_kubeflow_org_v1_paddle_job_list.py b/python/test/test_kubeflow_org_v1_paddle_job_list.py
@@ -84,7 +84,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
@@ -141,7 +141,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
diff --git a/python/test/test_kubeflow_org_v1_py_torch_job.py b/python/test/test_kubeflow_org_v1_py_torch_job.py
@@ -92,7 +92,7 @@ def make_instance(self, include_optional):
                         'key' : V1ReplicaStatus(
                             active = 56, 
                             failed = 56, 
-                            label_selector = None, 
+                            label_selector = '0', 
                             succeeded = 56, )
                         }, 
                     start_time = None, )
diff --git a/python/test/test_kubeflow_org_v1_py_torch_job_list.py b/python/test/test_kubeflow_org_v1_py_torch_job_list.py
@@ -95,7 +95,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
@@ -163,7 +163,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
diff --git a/python/test/test_kubeflow_org_v1_tf_job.py b/python/test/test_kubeflow_org_v1_tf_job.py
@@ -76,7 +76,7 @@ def make_instance(self, include_optional):
                         'key' : V1ReplicaStatus(
                             active = 56, 
                             failed = 56, 
-                            label_selector = None, 
+                            label_selector = '0', 
                             succeeded = 56, )
                         }, 
                     start_time = None, )
diff --git a/python/test/test_kubeflow_org_v1_tf_job_list.py b/python/test/test_kubeflow_org_v1_tf_job_list.py
@@ -79,7 +79,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
@@ -131,7 +131,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
diff --git a/python/test/test_kubeflow_org_v1_xg_boost_job.py b/python/test/test_kubeflow_org_v1_xg_boost_job.py
@@ -74,7 +74,7 @@ def make_instance(self, include_optional):
                         'key' : V1ReplicaStatus(
                             active = 56, 
                             failed = 56, 
-                            label_selector = None, 
+                            label_selector = '0', 
                             succeeded = 56, )
                         }, 
                     start_time = None, )
diff --git a/python/test/test_kubeflow_org_v1_xg_boost_job_list.py b/python/test/test_kubeflow_org_v1_xg_boost_job_list.py
@@ -77,7 +77,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
@@ -127,7 +127,7 @@ def make_instance(self, include_optional):
                                 'key' : V1ReplicaStatus(
                                     active = 56, 
                                     failed = 56, 
-                                    label_selector = None, 
+                                    label_selector = '0', 
                                     succeeded = 56, )
                                 }, 
                             start_time = None, ), )
diff --git a/python/test/test_v1_job_status.py b/python/test/test_v1_job_status.py
@@ -51,7 +51,7 @@ def make_instance(self, include_optional):
                     'key' : V1ReplicaStatus(
                         active = 56, 
                         failed = 56, 
-                        label_selector = None, 
+                        label_selector = '0', 
                         succeeded = 56, )
                     }, 
                 start_time = None
@@ -71,7 +71,7 @@ def make_instance(self, include_optional):
                     'key' : V1ReplicaStatus(
                         active = 56, 
                         failed = 56, 
-                        label_selector = None, 
+                        label_selector = '0', 
                         succeeded = 56, )
                     },
         )
diff --git a/python/test/test_v1_replica_status.py b/python/test/test_v1_replica_status.py
@@ -38,7 +38,7 @@ def make_instance(self, include_optional):
             return V1ReplicaStatus(
                 active = 56, 
                 failed = 56, 
-                label_selector = None, 
+                label_selector = '0', 
                 succeeded = 56
             )
         else :

Original file line number	Diff line number	Diff line change
`@@ -38,7 +38,7 @@ def make_instance(self, include_optional):`
`38`	`38`	`return V1ReplicaStatus(`
`39`	`39`	`active = 56,`
`40`	`40`	`failed = 56,`
`41`		`- label_selector = None,`
	`41`	`+ label_selector = '0',`
`42`	`42`	`succeeded = 56`
`43`	`43`	`)`
`44`	`44`	`else :`