Description
Description
We run Cosign on our AWS EKS Clusters to verify signatures of our in-house images. Over a couple of days, we see the memory being used by these pods increase significantly.
When the memory hits the allocated requested memory for the pod we start seeing errors for failed verifications and the pod continuously restarts. The following log entries are continuously generated by the pods
[INFO] Webhook ServeHTTP request=&http.Request{Method:"POST", URL:(*url.URL)(0xc0182243f0), Proto:"HTTP/1.1", ProtoMajor:1, ProtoMinor:1, Header:http.Header{"Accept":[]string{"application/json, */*"}, "Accept-Encoding":[]string{"gzip"}, "Content-Length":[]string{"21818"}, "Content-Type":[]string{"application/json"}, "User-Agent":[]string{"kube-apiserver-admission"}}, Body:(*http.body)(0xc03e010980), GetBody:(func() (io.ReadCloser, error))(nil), ContentLength:21818, TransferEncoding:[]string(nil), Close:false, Host:"webhook.cosign-system.svc:443", Form:url.Values(nil), PostForm:url.Values(nil), MultipartForm:(*multipart.Form)(nil), Trailer:http.Header(nil), RemoteAddr:"10.2.0.6:39576", RequestURI:"/mutations?timeout=25s", TLS:(*tls.ConnectionState)(0xc0070c6d80), Cancel:(<-chan struct {})(nil), Response:(*http.Response)(nil), Pattern:"/mutations", ctx:(*context.cancelCtx)(0xc001e362d0), pat:(*http.pattern)(0xc000765200), matches:[]string(nil), otherValues:map[string]string(nil)}
[INFO] remote admission controller audit annotations=map[string]string(nil)
[ERROR] Failed the resource specific validation
[WARN] Failed to validate at least one policy for <IMAGE-URI> wanted 1 policies, only validated 0
[ERROR] error validating signatures: Get "https://<AccountID>.dkr.ecr.eu-west-1.amazonaws.com/v2/": context canceled
Manually killing the pod is the only way to fix this as the newly created one will have much lower memory consumption but over a couple of days this process repeats it self.
We thought this might be related to the webhook timeout being too short, but we recently increased this to 15 seconds and it hasn't helped.
We think this behaviour indicates a memory leak in the Cosign application.
Version
Policy Controller Version 0.12.0, Chart Version 0.9.1, Our EKS clusters are running k8s v1.30