wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. #807

yangjunmyfm192085 · 2021-08-05T03:24:30Z

Signed-off-by: JunYang [email protected]

What this PR does / why we need it:
There are many ci failuers recently
Based on previous research, Within two cycles of resource reporting, a container of the pod reported duplicate data, we skip the containers for which we can't compute the resource usage instead of skipping the whole PodMetrics.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

yangjunmyfm192085 · 2021-08-05T03:36:05Z

/cc @serathius @dgrisonnet

yangjunmyfm192085 · 2021-08-05T08:16:41Z

/hold
Let me do more tests

yangjunmyfm192085 · 2021-08-05T09:28:58Z

Through more tests, I found when we get PodMetricses, we can get only information of one container.

For example:
&PodMetrics{ObjectMeta:{sidecarpod-consumer default 0 2021-08-05 17:06:01 +0800 CST <nil> <nil> map[] map[] [] [] []},Timestamp:2021-08-05 17:05:53 +0800 CST,Window:{10.393s},Containers:[]ContainerMetrics{ContainerMetrics{Name:sidecarpod-consumer,Usage:k8s_io_api_core_v1.ResourceList{cpu: {{49530274 -9} {<nil>} 49530274n DecimalSI},memory: {{8704000 0} {<nil>} 8500Ki BinarySI},},},},}

Now I fixed up, we need check the length of ms.Containers here if (err == nil && len(ms.Containers) == 2) || time.Now().After(deadline) {

the correct result, for example:
&PodMetrics{ObjectMeta:{sidecarpod-consumer default 0 2021-08-05 17:09:57 +0800 CST <nil> <nil> map[] map[] [] [] []},Timestamp:2021-08-05 17:09:50 +0800 CST,Window:{25.523s},Containers:[]ContainerMetrics{ContainerMetrics{Name:sidecar-container,Usage:k8s_io_api_core_v1.ResourceList{cpu: {{49919787 -9} {<nil>} 49919787n DecimalSI},memory: {{8429568 0} {<nil>} BinarySI},},},ContainerMetrics{Name:sidecarpod-consumer,Usage:k8s_io_api_core_v1.ResourceList{cpu: {{49619887 -9} {<nil>} 49619887n DecimalSI},memory: {{9089024 0} {<nil>} 8876Ki BinarySI},},},},}

yangjunmyfm192085 · 2021-08-05T09:29:09Z

/unhold

dgrisonnet

I wonder if we should really check for len(ms.Containers) == 2 in the tests, in my opinion, MS should always return complete PodMetrics containing metrics about all the containers of a pod. The fact that we only get metrics from one container out of the two from the pods sounds like an issue to me.
In your recent changes, you added a check to only return PodMetrics whenever we have metrics for all the containers of the pods: https://github.com/kubernetes-sigs/metrics-server/blob/master/pkg/scraper/client/resource/decode.go#L148-L151. So it means that kubelet is only exposing metrics about one container of the Pod. At the very least, I would expect to have metrics about both containers initialized and exposed by kubelet.

dgrisonnet · 2021-08-05T14:20:21Z

test/e2e_test.go

@@ -561,24 +561,24 @@ func consumeWithSideCarContainer(client clientset.Interface, podName string) err
 			{
 				Name:    podName,
 				Command: []string{"./consume-cpu/consume-cpu"},
-				Args:    []string{"--duration-sec=60", "--millicores=50"},
+				Args:    []string{"--duration-sec=600", "--millicores=50"},


this shouldn't be needed since the test is only running for 60 seconds: https://github.com/kubernetes-sigs/metrics-server/blob/master/test/e2e_test.go#L173

dgrisonnet · 2021-08-05T14:20:28Z

test/e2e_test.go

 					},
 				},
 			},
 			{
 				Name:    "sidecar-container",
 				Command: []string{"./consume-cpu/consume-cpu"},
-				Args:    []string{"--duration-sec=60", "--millicores=50"},
+				Args:    []string{"--duration-sec=600", "--millicores=50"},


yangjunmyfm192085 · 2021-08-06T04:12:27Z

I wonder if we should really check for len(ms.Containers) == 2 in the tests, in my opinion, MS should always return complete PodMetrics containing metrics about all the containers of a pod. The fact that we only get metrics from one container out of the two from the pods sounds like an issue to me.
In your recent changes, you added a check to only return PodMetrics whenever we have metrics for all the containers of the pods: https://github.com/kubernetes-sigs/metrics-server/blob/master/pkg/scraper/client/resource/decode.go#L148-L151. So it means that kubelet is only exposing metrics about one container of the Pod. At the very least, I would expect to have metrics about both containers initialized and exposed by kubelet.

This seems to be a bug, before storing podmetrcis, we should check the number of containers from both PodMetrics and apiserver.
I will do some research recently and Let us modify it

yangjunmyfm192085 · 2021-08-07T16:52:54Z

I wonder if we should really check for len(ms.Containers) == 2 in the tests, in my opinion, MS should always return complete PodMetrics containing metrics about all the containers of a pod. The fact that we only get metrics from one container out of the two from the pods sounds like an issue to me.
In your recent changes, you added a check to only return PodMetrics whenever we have metrics for all the containers of the pods: https://github.com/kubernetes-sigs/metrics-server/blob/master/pkg/scraper/client/resource/decode.go#L148-L151. So it means that kubelet is only exposing metrics about one container of the Pod. At the very least, I would expect to have metrics about both containers initialized and exposed by kubelet.

This seems to be a bug, before storing podmetrcis, we should check the number of containers from both PodMetrics and apiserver.
I will do some research recently and Let us modify it

I have researched this issue recently, the result:

Within two cycles of resource reporting, a container of the pod reported duplicate data
This causes only one cycle of data to be retained for this container and both cycles of data to be retained for another container
so we only get metrics from one container out of the two from the pod at the begining
this seems not a issue of metrics-server

yangjunmyfm192085

/cc @serathius @dgrisonnet ,Do we need to deal with this scenario specially?

dgrisonnet · 2021-08-09T17:32:10Z

Great investigation @yangjunmyfm192085! I think you found a bug in metrics-server. IMO we should handle this scenario since decoding metrics batch and getting metrics from the storage have different behaviors.
When decoding, we are preventing MS from storing PodMetrics until we can get the metrics for all the containers of the pod, whereas when we get the metrics from the storage, we skip the containers for which we can't compute the resource usage instead of skipping the whole PodMetrics:

metrics-server/pkg/storage/pod.go

Lines 66 to 69 in d4def03

    
           prevContainer, found := prevPod.Containers[container] 
        
           if !found { 
        
           	continue 
        
           }

I think we should update this part of the code and wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics.

yangjunmyfm192085 · 2021-08-10T07:06:49Z

Great investigation @yangjunmyfm192085! I think you found a bug in metrics-server. IMO we should handle this scenario since decoding metrics batch and getting metrics from the storage have different behaviors.
When decoding, we are preventing MS from storing PodMetrics until we can get the metrics for all the containers of the pod, whereas when we get the metrics from the storage, we skip the containers for which we can't compute the resource usage instead of skipping the whole PodMetrics:

metrics-server/pkg/storage/pod.go

Lines 66 to 69 in d4def03

prevContainer, found := prevPod.Containers[container]

if !found {

continue

}

I think we should update this part of the code and wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics.

Yeah, you are right. Let me update this part of the code

yangjunmyfm192085 · 2021-08-10T07:28:46Z

/retitle wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics.

pkg/storage/pod.go

dgrisonnet · 2021-08-10T10:34:33Z

Looks good from my side, but do we really need the changes made to test/e2e_test.go?

yangjunmyfm192085 · 2021-08-10T10:55:31Z

Looks good from my side, but do we really need the changes made to test/e2e_test.go?

I am not sure if it nessary to set ResourceCPU 100m and ResourceMemory 100Mi for each container, so I optimized it, but it is also ok to make no changes.

dgrisonnet · 2021-08-10T10:58:20Z

/lgtm

yangjunmyfm192085 · 2021-09-14T00:37:25Z

/retest

…f a pod before exposing its PodMetrics. Signed-off-by: JunYang <[email protected]>

yangjunmyfm192085 · 2021-09-14T06:03:31Z

/retest

yangjunmyfm192085 · 2021-09-14T06:21:11Z

/cc @dgrisonnet pull-metrics-server-test-e2e-ha is failed.

no matches for kind \"PodDisruptionBudget\" in version \"policy/v1\"\n" , Use kindest/node:v1.20.7 does not match?

dgrisonnet · 2021-09-14T07:29:07Z

Indeed, it seems that I missed a couple of errors with Kuberneetes v1.19 and v1,20, I'll have a look.

dgrisonnet · 2021-09-14T08:51:53Z

/lgtm

dgrisonnet · 2021-09-14T12:47:49Z

/retest

yangjunmyfm192085 · 2021-09-14T14:14:03Z

I think we should merge this pr now.
/assign @serathius
/cc @serathius @dgrisonnet

serathius · 2021-09-14T17:14:37Z

/approve

k8s-ci-robot · 2021-09-14T17:14:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, serathius, yangjunmyfm192085

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…source usage of all the containers of a pod before exposing its PodMetrics. Signed-off-by: JunYang <[email protected]>

cherry pick of #807 wait to be able to compute the resource usage of…

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 5, 2021

k8s-ci-robot requested review from dgrisonnet and serathius August 5, 2021 03:24

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 5, 2021

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2021

yangjunmyfm192085 force-pushed the Modify-e2e branch from 035c7a4 to 65f3ad9 Compare August 5, 2021 09:16

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 5, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2021

dgrisonnet reviewed Aug 5, 2021

View reviewed changes

yangjunmyfm192085 force-pushed the Modify-e2e branch from 65f3ad9 to 7243aa5 Compare August 6, 2021 04:05

yangjunmyfm192085 commented Aug 7, 2021

View reviewed changes

yangjunmyfm192085 force-pushed the Modify-e2e branch from 7243aa5 to 96ad63f Compare August 10, 2021 07:26

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 10, 2021

k8s-ci-robot changed the title ~~Modify e2e-test, avoid failure~~ wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. Aug 10, 2021

yangjunmyfm192085 commented Aug 10, 2021

View reviewed changes

pkg/storage/pod.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned dgrisonnet Aug 10, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2021

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 14, 2021

yangjunmyfm192085 force-pushed the Modify-e2e branch from f484ea4 to ed1a3dc Compare September 14, 2021 05:44

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 14, 2021

wait to be able to compute the resource usage of all the containers o…

eed970e

…f a pod before exposing its PodMetrics. Signed-off-by: JunYang <[email protected]>

yangjunmyfm192085 force-pushed the Modify-e2e branch from ed1a3dc to eed970e Compare September 14, 2021 05:48

dgrisonnet mentioned this pull request Sep 14, 2021

Fix test-e2e-ha CI job #827

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2021

k8s-ci-robot requested review from dgrisonnet and serathius September 14, 2021 14:14

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2021

k8s-ci-robot merged commit 8b2b941 into kubernetes-sigs:master Sep 14, 2021

This was referenced Sep 24, 2021

cherry pick of #807 wait to be able to compute the resource usage of… #839

Merged

Prepared for Release v0.5.1 #840

Closed

k8s-ci-robot added a commit that referenced this pull request Sep 24, 2021

Merge pull request #839 from yangjunmyfm192085/release-0.5

b6cdcda

cherry pick of #807 wait to be able to compute the resource usage of…

uGiFarukh mentioned this pull request Jan 5, 2022

Upgrade: metrics server version bump from v0.5.0 to v0.5.2 k3s-io/k3s#4867

Merged

wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. #807

wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. #807

Uh oh!

Conversation

yangjunmyfm192085 commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangjunmyfm192085 commented Aug 5, 2021

Uh oh!

yangjunmyfm192085 commented Aug 5, 2021

Uh oh!

yangjunmyfm192085 commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangjunmyfm192085 commented Aug 5, 2021

Uh oh!

dgrisonnet left a comment

Choose a reason for hiding this comment

Uh oh!

dgrisonnet Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

dgrisonnet Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

yangjunmyfm192085 commented Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangjunmyfm192085 commented Aug 7, 2021

Uh oh!

yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

Uh oh!

dgrisonnet commented Aug 9, 2021

Uh oh!

yangjunmyfm192085 commented Aug 10, 2021

Uh oh!

yangjunmyfm192085 commented Aug 10, 2021

Uh oh!

Uh oh!

dgrisonnet commented Aug 10, 2021

Uh oh!

yangjunmyfm192085 commented Aug 10, 2021

Uh oh!

dgrisonnet commented Aug 10, 2021

Uh oh!

yangjunmyfm192085 commented Sep 14, 2021

Uh oh!

yangjunmyfm192085 commented Sep 14, 2021

Uh oh!

yangjunmyfm192085 commented Sep 14, 2021

Uh oh!

dgrisonnet commented Sep 14, 2021

Uh oh!

dgrisonnet commented Sep 14, 2021

Uh oh!

dgrisonnet commented Sep 14, 2021

Uh oh!

yangjunmyfm192085 commented Sep 14, 2021

Uh oh!

serathius commented Sep 14, 2021

Uh oh!

k8s-ci-robot commented Sep 14, 2021

Uh oh!

Uh oh!

yangjunmyfm192085 commented Aug 5, 2021 •

edited

Loading

yangjunmyfm192085 commented Aug 5, 2021 •

edited

Loading

yangjunmyfm192085 commented Aug 6, 2021 •

edited

Loading