-
Notifications
You must be signed in to change notification settings - Fork 1.9k
wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. #807
Conversation
/hold |
035c7a4
to
65f3ad9
Compare
Through more tests, I found when we get For example: Now I fixed up, we need check the length of the correct result, for example: |
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should really check for len(ms.Containers) == 2
in the tests, in my opinion, MS should always return complete PodMetrics containing metrics about all the containers of a pod. The fact that we only get metrics from one container out of the two from the pods sounds like an issue to me.
In your recent changes, you added a check to only return PodMetrics whenever we have metrics for all the containers of the pods: https://github.com/kubernetes-sigs/metrics-server/blob/master/pkg/scraper/client/resource/decode.go#L148-L151. So it means that kubelet is only exposing metrics about one container of the Pod. At the very least, I would expect to have metrics about both containers initialized and exposed by kubelet.
test/e2e_test.go
Outdated
@@ -561,24 +561,24 @@ func consumeWithSideCarContainer(client clientset.Interface, podName string) err | |||
{ | |||
Name: podName, | |||
Command: []string{"./consume-cpu/consume-cpu"}, | |||
Args: []string{"--duration-sec=60", "--millicores=50"}, | |||
Args: []string{"--duration-sec=600", "--millicores=50"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be needed since the test is only running for 60 seconds: https://github.com/kubernetes-sigs/metrics-server/blob/master/test/e2e_test.go#L173
test/e2e_test.go
Outdated
}, | ||
}, | ||
}, | ||
{ | ||
Name: "sidecar-container", | ||
Command: []string{"./consume-cpu/consume-cpu"}, | ||
Args: []string{"--duration-sec=60", "--millicores=50"}, | ||
Args: []string{"--duration-sec=600", "--millicores=50"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
65f3ad9
to
7243aa5
Compare
This seems to be a bug, before storing podmetrcis, we should check the number of containers from both PodMetrics and apiserver. |
I have researched this issue recently, the result:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @serathius @dgrisonnet ,Do we need to deal with this scenario specially?
Great investigation @yangjunmyfm192085! I think you found a bug in metrics-server. IMO we should handle this scenario since decoding metrics batch and getting metrics from the storage have different behaviors. metrics-server/pkg/storage/pod.go Lines 66 to 69 in d4def03
I think we should update this part of the code and wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. |
Yeah, you are right. Let me update this part of the code |
7243aa5
to
96ad63f
Compare
/retitle wait to be able to compute the resource usage of all the containers of a pod before exposing its PodMetrics. |
Looks good from my side, but do we really need the changes made to |
I am not sure if it nessary to set ResourceCPU 100m and ResourceMemory 100Mi for each container, so I optimized it, but it is also ok to make no changes. |
/lgtm |
/retest |
f484ea4
to
ed1a3dc
Compare
…f a pod before exposing its PodMetrics. Signed-off-by: JunYang <[email protected]>
ed1a3dc
to
eed970e
Compare
/retest |
/cc @dgrisonnet pull-metrics-server-test-e2e-ha is failed.
|
Indeed, it seems that I missed a couple of errors with Kuberneetes v1.19 and v1,20, I'll have a look. |
/lgtm |
/retest |
I think we should merge this pr now. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrisonnet, serathius, yangjunmyfm192085 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…source usage of all the containers of a pod before exposing its PodMetrics. Signed-off-by: JunYang <[email protected]>
…source usage of all the containers of a pod before exposing its PodMetrics. Signed-off-by: JunYang <[email protected]>
cherry pick of #807 wait to be able to compute the resource usage of…
Signed-off-by: JunYang [email protected]
What this PR does / why we need it:
There are many ci failuers recently
Based on previous research, Within two cycles of resource reporting, a container of the pod reported duplicate data, we skip the containers for which we can't compute the resource usage instead of skipping the whole PodMetrics.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #