Skip to content

Memory Leak on EKS Clusters #1820

Open
@TomWKraken

Description

@TomWKraken

Description

We run Cosign on our AWS EKS Clusters to verify signatures of our in-house images. Over a couple of days, we see the memory being used by these pods increase significantly.

Image

When the memory hits the allocated requested memory for the pod we start seeing errors for failed verifications and the pod continuously restarts. The following log entries are continuously generated by the pods

[INFO] Webhook ServeHTTP request=&http.Request{Method:"POST", URL:(*url.URL)(0xc0182243f0), Proto:"HTTP/1.1", ProtoMajor:1, ProtoMinor:1, Header:http.Header{"Accept":[]string{"application/json, */*"}, "Accept-Encoding":[]string{"gzip"}, "Content-Length":[]string{"21818"}, "Content-Type":[]string{"application/json"}, "User-Agent":[]string{"kube-apiserver-admission"}}, Body:(*http.body)(0xc03e010980), GetBody:(func() (io.ReadCloser, error))(nil), ContentLength:21818, TransferEncoding:[]string(nil), Close:false, Host:"webhook.cosign-system.svc:443", Form:url.Values(nil), PostForm:url.Values(nil), MultipartForm:(*multipart.Form)(nil), Trailer:http.Header(nil), RemoteAddr:"10.2.0.6:39576", RequestURI:"/mutations?timeout=25s", TLS:(*tls.ConnectionState)(0xc0070c6d80), Cancel:(<-chan struct {})(nil), Response:(*http.Response)(nil), Pattern:"/mutations", ctx:(*context.cancelCtx)(0xc001e362d0), pat:(*http.pattern)(0xc000765200), matches:[]string(nil), otherValues:map[string]string(nil)}

[INFO] remote admission controller audit annotations=map[string]string(nil)

[ERROR] Failed the resource specific validation

[WARN] Failed to validate at least one policy for <IMAGE-URI> wanted 1 policies, only validated 0

[ERROR] error validating signatures: Get "https://<AccountID>.dkr.ecr.eu-west-1.amazonaws.com/v2/": context canceled

Manually killing the pod is the only way to fix this as the newly created one will have much lower memory consumption but over a couple of days this process repeats it self.

We thought this might be related to the webhook timeout being too short, but we recently increased this to 15 seconds and it hasn't helped.

We think this behaviour indicates a memory leak in the Cosign application.

Version

Policy Controller Version 0.12.0, Chart Version 0.9.1, Our EKS clusters are running k8s v1.30

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions