gatekeeper-controller-manager is leaking memory #2654

dethi · 2023-03-29T12:03:08Z

What steps did you take and what happened:

On all our GKE clusters, gatekeeper-controller-manager get OOM Killed once every ~24-30h. You can see on the screenshot below that the memory keep increasing over time, which indicate to me that there is a memory leak somewhere.

I just enabled pprof, I will add heap profile as soon as I have them

What did you expect to happen:
Memory usage to be stable over time.

Anything else you would like to add:
This was already happening on the previous version we were running, openpolicyagent/gatekeeper:v3.10.0-beta.0

Environment:

Gatekeeper version: openpolicyagent/gatekeeper:v3.11.0
Kubernetes version: (use kubectl version): v1.23.14-gke.1800
3 replicas of gatekeeper-controller-manager:

      --port=8443
      --health-addr=:9090
      --prometheus-port=8888
      --logtostderr
      --log-denies=true
      --emit-admission-events=true
      --log-level=DEBUG
      --exempt-namespace=gatekeeper-system
      --operation=webhook
      --enable-external-data=true
      --enable-generator-resource-expansion=false
      --log-mutations=true
      --mutation-annotations=false
      --disable-cert-rotation=false
      --max-serving-threads=-1
      --tls-min-version=1.3
      --metrics-backend=prometheus
      --enable-tls-healthcheck
      --operation=mutation-webhook
      --disable-opa-builtin={http.send}
      --exempt-namespace-prefix=kube
      --enable-pprof
      --pprof-port=6060

The text was updated successfully, but these errors were encountered:

dethi · 2023-03-29T15:54:15Z

Heap Dump: one taken a few minutes after the start and one taken 3h later
gatekeeper-controller-manager-7f58485448-d8rsl--heap.zip

Goroutines stacktrace:
gatekeeper-controller-manager-7f58485448-d8rsl--goroutine.txt

dethi · 2023-03-29T15:56:27Z

Looking at the stacktrace, I think it's leaking connections (considering the growing number of readLoop/writeLoop goroutines)

davis-haba · 2023-03-29T21:07:39Z

Thank you for the detailed write up!

How are you analyzing the heap dump? We tried opening it with

go tool pprof -http=:8082 gatekeeper-controller-manager-7f58485448-d8rsl--3h-later-heap

but this yields an error: gatekeeper-controller-manager-7f58485448-d8rsl--3h-later-heap: parsing profile: unrecognized profile format

dethi · 2023-03-30T16:00:20Z

🤔 I just did that and it opened for me

go tool pprof -http :8080 gatekeeper-controller-manager-7f58485448-d8rsl--3h-later-heap

(I'm using go version go1.20.1 darwin/amd64)

maxsmythe · 2023-04-01T01:50:20Z

Stack trace working for me. It definitely seems like there is a lot of space being used by http read/write buffers.

Does the memory use increasing over time happen to the audit pod and the webhook pods, or just the webhook pods?

maxsmythe · 2023-04-01T02:03:55Z

@ritazh @sozercan

I wonder if this is from external data creating multiple clients?

From the golang docs:

The Client's Transport typically has internal state (cached TCP connections), so Clients should be reused instead of created as needed. Clients are safe for concurrent use by multiple goroutines.

Our code appears to create a new client for every request:

https://github.com/open-policy-agent/frameworks/blob/745f2b50cfcb746bb30997d65e851d1b6ee0b9e9/constraint/pkg/externaldata/request.go#L62-L145

@dethi does disabling calls to external data remove the leak?

maxsmythe · 2023-04-01T02:15:18Z

Interesting, maybe-on-topic finds.

golang/go#43966

https://coder.com/blog/go-leak-mysteries

Note that it does look like we're closing the response body.

dethi · 2023-04-01T15:52:07Z

Does the memory use increasing over time happen to the audit pod and the webhook pods, or just the webhook pods?

Only the webhook pods. The audit pod has stable memory usage.

@dethi does disabling calls to external data remove the leak?

We don't have any external configured and don't use external data at the moment, but I can try to disable it.

dethi · 2023-04-01T17:40:47Z

Still leaking with external-data disabled.

ritazh · 2023-04-03T19:24:58Z

This was already happening on the previous version we were running, openpolicyagent/gatekeeper:v3.10.0-beta.0

Can you please confirm the same feature flags used with older versions of Gatekeeper did not have this issue?

dethi · 2023-04-03T20:29:16Z

openpolicyagent/gatekeeper:v3.10.0-beta.0 was the first version that we deployed.

davis-haba · 2023-04-03T23:44:04Z

Hey @dethi,

We're still trying to chase down this leak, but could use a little more to go off.

We noticed a lot of http calls in the heap dump. One theory is that event emitter could be leaking connections. To this end, could you try disabling admission events (set the flag --emit-admission-events=false) and see if it still leaks?

We're also curious about the admission webhook's usage. Is this a relatively static cluster, where resource are not changed very frequently? Or is it more like a test cluster where resources are getting created and destroyed frequently (particularly namespaces)? If you have metrics set up, it would be helpful if could share the values/graphs for gatekeeper_validation_request_count, and also gatekeeper_validation_request_duration_seconds (the cumulative distribution).

A couple other quick questions:

Are you using mutations? If not, try disabling the mutation operation please (remove flag: --operation=mutation-webhook)
Are you using replicated data, i.e. setting sync values on the Config resource? If so, which GVKs are you syncing?

Thanks :)

dethi · 2023-04-04T00:02:55Z

To this end, could you try disabling admission events (set the flag --emit-admission-events=false) and see if it still leaks?

I will change that now and reports the result tomorrow!

Is this a relatively static cluster, where resource are not changed very frequently?

Both. We see this same leak happening on all our clusters, dev and prod. dev clusters have a lot of churn (new namespace/deployment created at least every 10-15min). Prod cluster have relatively low churn (outside of a daily deploy and a cronjob that run every 5min).

Are you using mutations? If not, try disabling the mutation operation please (remove flag: --operation=mutation-webhook)

Not using any mutation at the moment. I will disable it after the --emit-admission-events=false test, depending on the result.

Are you using replicated data, i.e. setting sync values on the Config resource? If so, which GVKs are you syncing?

No, we aren't using replicated data.

If you have metrics set up, it would be helpful if could share the values/graphs

Yes we do :). Here is an example of what the metrics looks like on a prod cluster. Looking at the last 12h, pretty quiet. Happy to share more if you think it shows something interesting.

dethi · 2023-04-04T09:20:55Z

Tried both --emit-admission-events=false and --operation=mutation-webhook separately, and the process is still leaking memory at the same rate. I'm trying both at the same time now.

EDIT: both at the same time has the same effect: still leaking.

dethi · 2023-04-04T09:32:13Z

One thing that I noticed in the logs, not sure if it's normal or not, but the controller seems to reconcile the constraint every ~10mins, even when there was no change to the constraint or constraint template.

maxsmythe · 2023-04-05T02:53:08Z

Thank you for helping us gather data!

One thing that I noticed in the logs, not sure if it's normal or not, but the controller seems to reconcile the constraint every ~10mins, even when there was no change to the constraint or constraint template.

This could be due to a watch getting interrupted, though periodic re-reconciliation is expected in how controller-runtime works (I forget the default frequency).

You mentioned the audit pod does not have this memory leak behavior. Is that still true? Does it also have these log messages?

How many total constraints do you have on your cluster?

Another explanation could be due to changes to ConstraintPodStatus resources:

gatekeeper/pkg/controller/constraint/constraint_controller.go

Lines 179 to 182 in 7c519b4

    
           err = c.Watch( 
        
           	&source.Kind{Type: &constraintstatusv1beta1.ConstraintPodStatus{}}, 
        
           	handler.EnqueueRequestsFromMapFunc(constraintstatus.PodStatusToConstraintMapper(true, util.EventPackerMapFunc())), 
        
           )

Or reconciling changes to the status field of the constraint due to audit updating the violating resources on the constraint (this would not increment metadata.generation).

The weird thing is that if audit is unaffected, it should be something specific to the webhook process and not controllers, since the audit/status pods share the same controllers and watch the same resources.

Double weird is that the webhook has fairly low activity, which would point against it being due to the webhook serving requests, which would point to controllers.

If audit is still symptom-free, what flags does the audit pod have?

If the audit pod has a --disable-cert-rotation flag, what happens if you add that flag to the webhook pods? The webhook pods should still work as long as you don't touch any secrets in gatekeeper-system and don't modify Gatekeeper's ValidatingWebhookConfiguration (though I'd definitely avoid doing this in prod to be safe). The flag should be re-enabled before the self-signed cert expires (IIRC default is a decade).

dethi · 2023-04-05T13:05:24Z

You mentioned the audit pod does not have this memory leak behavior. Is that still true? Does it also have these log messages?

Yes, audit pod doesn't have this memory leak. And yes, it is also logging handling constraint update at the same interval.

How many total constraints do you have on your cluster?

At the moment, only 3.

If audit is still symptom-free, what flags does the audit pod have?

    Args:
      --audit-interval=60
      --log-level=DEBUG
      --constraint-violations-limit=20
      --audit-from-cache=false
      --audit-chunk-size=500
      --audit-match-kind-only=false
      --emit-audit-events=true
      --operation=audit
      --operation=status
      --operation=mutation-status
      --logtostderr
      --health-addr=:9090
      --prometheus-port=8888
      --enable-external-data=true
      --enable-generator-resource-expansion=true
      --metrics-backend=prometheus
      --disable-cert-rotation=true

what happens if you add that flag to the webhook pods?

I will try that and come back to you!

dethi · 2023-04-05T15:37:17Z

Still leaking memory with --disable-cert-rotation 🙁

ritazh · 2023-04-05T15:49:29Z

Can you share your rego? is the same rego/constraint template being used for both webhook and audit?

dethi · 2023-04-05T16:04:00Z

Yes, both audit and the controller are using the exact same constraint and ConstraintTemplate.

The 3 constraints use:

1x requiredlabels
1x requiredannotations
1x a modified version of blockloadbalancer that allow a whitelist of LB services:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sblockloadbalancer
  annotations:
    metadata.gatekeeper.sh/title: "Block Services with type LoadBalancer"
    description: >-
      Block Services with type LoadBalancer, except if it's in the allowed list.
spec:
  crd:
    spec:
      names:
        kind: K8sBlockLoadBalancer
      validation:
        openAPIV3Schema:
          type: object
          properties:
            allowed:
              description: "List of allowed Loadbalancer"
              type: array
              items:
                type: object
                properties:
                  namespace:
                    type: string
                  name:
                    type: string
                  allow_external:
                    type: boolean
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sblockloadbalancer

        import future.keywords.in

        violation[{"msg": msg}] {
          input.review.kind.kind == "Service"
          input.review.object.spec.type == "LoadBalancer"
          not allowed
          msg := "Service of type LoadBalancer is not allowed."
        }

        allowed {
          some item in input.parameters.allowed

          item.namespace == input.review.object.metadata.namespace
          item.name == input.review.object.metadata.name

          allowed_lb_type(item)
        }

        allowed_lb_type(item) {
          item.allow_external
        }

        allowed_lb_type(item) {
          not item.allow_external
          object.get(input.review.object.metadata.annotations, "cloud.google.com/load-balancer-type", "") == "Internal"
        }

maxsmythe · 2023-04-05T17:52:26Z

If you take a webhook pod and run another copy under a different name and with different labels (this ensures there will be zero webhook traffic since it's not part of the webhook service), does the memory leak still occur?

maxsmythe · 2023-04-05T17:56:14Z

Also curious how much memory the audit pod uses?

maxsmythe · 2023-04-05T18:43:00Z

The heap dump seems to show most memory usage associated with processing the webhook, which seems to jive with the theory that it's serving requests that's causing memory consumption. The zero-traffic test should confirm that's the right area of investigation.

Assuming it is, the next step is why/how? Especially if the cluster is low-traffic. Do low traffic clusters leak at the same rate as clusters with higher Gatekeeper QPS?

Does setting --max-serving-threads=1 affect the memory profile (note this might leave Gatekeeper under-resourced to handle inbound QPS.

How much CPU does the webhook pod have? If <1, does increasing it to somewhere between 1-2 fix the issue?

dethi · 2023-04-06T11:09:40Z

I have taken a new heap dump, when the memory was much higher (180MB), to show more clearly where the leak is coming from (inuse_space):

gatekeeper-controller-manager-754cbf66bd-r88tt--heap.zip

dethi · 2023-04-06T11:15:20Z

How much CPU does the webhook pod have?

CPU req 100m, limit 1. They barely use anything at all.

Do low traffic clusters leak at the same rate as clusters with higher Gatekeeper QPS?

Yes, they all leak at the same rate.

Also curious how much memory the audit pod uses?

maxsmythe · 2023-04-06T21:13:50Z

Thanks for the extra data! I'll take a look at the dump. Since there are a lot of potential places where HTTP connections can be initiated, I'm trying to narrow down what areas of code are the most likely candidates.

Did you get a chance to run this test?

If you take a webhook pod and run another copy under a different name and with different labels (this ensures there will be zero webhook traffic since it's not part of the webhook service), does the memory leak still occur?

dethi · 2023-04-06T22:33:19Z

I missed that part. Trying it now, will report the result tomorrow!

dethi · 2023-04-07T13:51:48Z

Here is the result of the webhook pods: still leaking even with no webhook being served.

Here is the new deployment I deployed:

Name:                   gatekeeper-controller-manager-test
Namespace:              gatekeeper-system
CreationTimestamp:      Thu, 06 Apr 2023 23:32:09 +0100
Labels:                 app=gatekeeper-test
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=gatekeeper-test
Replicas:               3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=gatekeeper-test
  Annotations:      prometheus.io/port: 8888
                    prometheus.io/scrape: true
  Service Account:  gatekeeper-admin
  Containers:
   manager:
    Image:       openpolicyagent/gatekeeper:v3.11.0
    Ports:       8443/TCP, 8888/TCP, 9090/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      /manager
    Args:
      --port=8443
      --health-addr=:9090
      --prometheus-port=8888
      --logtostderr
      --log-denies=true
      --emit-admission-events=true
      --log-level=DEBUG
      --exempt-namespace=gatekeeper-system
      --operation=webhook
      --enable-external-data=true
      --enable-generator-resource-expansion=true
      --log-mutations=true
      --mutation-annotations=false
      --disable-cert-rotation=true
      --max-serving-threads=-1
      --tls-min-version=1.3
      --metrics-backend=prometheus
      --enable-tls-healthcheck
      --operation=mutation-webhook
      --disable-opa-builtin={http.send}
      --exempt-namespace-prefix=kube
      --enable-pprof=true
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:      100m
      memory:   512Mi
    Liveness:   http-get http://:9090/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:9090/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:    (v1:metadata.namespace)
      POD_NAME:         (v1:metadata.name)
      NAMESPACE:        (v1:metadata.namespace)
      CONTAINER_NAME:  manager
    Mounts:
      /certs from cert (ro)
  Volumes:
   cert:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         gatekeeper-webhook-server-cert
    Optional:           false
  Priority Class Name:  system-cluster-critical

And confirmation that it wasn't targeted by the webhook:

Name:              gatekeeper-webhook-service
Namespace:         gatekeeper-system
Labels:            app=gatekeeper
                   app.kubernetes.io/managed-by=Helm
                   chart=gatekeeper
                   gatekeeper.sh/system=yes
                   heritage=Helm
                   release=gatekeeper
Annotations:       meta.helm.sh/release-name: gatekeeper
                   meta.helm.sh/release-namespace: gatekeeper-system
Selector:          app=gatekeeper,chart=gatekeeper,control-plane=controller-manager,gatekeeper.sh/operation=webhook,gatekeeper.sh/system=yes,heritage=Helm,release=gatekeeper
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.29.88.45
IPs:               172.29.88.45
Port:              https-webhook-server  443/TCP
TargetPort:        webhook-server/TCP
Endpoints:         10.128.12.4:8443,10.128.31.47:8443,10.128.33.48:8443
Session Affinity:  None
Events:            <none>

dethi · 2023-04-08T10:24:19Z

I think I found the origin of the leak using kubectl debug and netstat: the webhook TLS healthcheck. Reading the code, it does seems that we don't ever read and close the body.

I'm gonna validate by first disabling it --enable-tls-healthcheck=false, and the submitting a patch.

- The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default transport, to inherit some of the default timeouts. Most importantly, this ensure that there is a default TLSHandshakeTimeout (10s) and dial timeout (30s). - Disable http keep alive, to avoid reuse of the same http connection. Otherwise, we may fail the healthcheck when the certs is rotated (new cert on disk wouldn't match the cert attached to the reused connection opened earlier). Fix open-policy-agent#2654

- The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default transport, to inherit some of the default timeouts. Most importantly, this ensure that there is a default TLSHandshakeTimeout (10s) and dial timeout (30s). - Disable http keep alive, to avoid reuse of the same http connection. Otherwise, we may fail the healthcheck when the certs is rotated (new cert on disk wouldn't match the cert attached to the reused connection opened earlier). Fix open-policy-agent#2654 Signed-off-by: Thibault Deutsch <[email protected]>

dethi · 2023-04-08T12:25:19Z

That was indeed the origin of the leak. Submitted a fix!

maxsmythe · 2023-04-10T21:29:49Z

Thank you so much! What was the breadcrumb from netstat/kubectl debug that led you to the health check code? Only remaining inbound connection?

dethi · 2023-04-10T22:45:58Z

netstat showed a few connections that were still open from localhost to localhost, and some had a non-empty receiving queue. From this I knew it was gatekeeper calling itself.
From the heap profile, I knew it was not only a connection that was kept open, but also that the connection was using TLS.
I also knew that the number of goroutines was growing at a stable rate, so it was something happening regularly, every few seconds

I quickly glanced at the flags and saw --enable-tls-healthcheck: it matches all the criteria. Moreover, the default helm install has this flag disabled, we enabled it. From there I was already convinced, even before looking at the code.

maxsmythe · 2023-04-12T01:28:18Z

Nice debugging! Thanks for sharing.

- The resp.Body was never closed, thus causing one connection to be leaked for each executions. - Creating a new transport based on the default transport, to inherit some of the default timeouts. Most importantly, this ensure that there is a default TLSHandshakeTimeout (10s) and dial timeout (30s). - Disable http keep alive, to avoid reuse of the same http connection. Otherwise, we may fail the healthcheck when the certs is rotated (new cert on disk wouldn't match the cert attached to the reused connection opened earlier). Fix open-policy-agent#2654 Signed-off-by: Thibault Deutsch <[email protected]>

dethi added the bug Something isn't working label Mar 29, 2023

dethi mentioned this issue Apr 8, 2023

fix: memory leak in the webhook TLS healthcheck #2690

Merged

sozercan closed this as completed in #2690 Apr 14, 2023

gatekeeper-controller-manager is leaking memory #2654

gatekeeper-controller-manager is leaking memory #2654

Comments

dethi commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dethi commented Mar 29, 2023

Uh oh!

dethi commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davis-haba commented Mar 29, 2023

Uh oh!

dethi commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxsmythe commented Apr 1, 2023

Uh oh!

maxsmythe commented Apr 1, 2023

Uh oh!

maxsmythe commented Apr 1, 2023

Uh oh!

dethi commented Apr 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dethi commented Apr 1, 2023

Uh oh!

ritazh commented Apr 3, 2023

Uh oh!

dethi commented Apr 3, 2023

Uh oh!

davis-haba commented Apr 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dethi commented Apr 4, 2023

Uh oh!

dethi commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dethi commented Apr 4, 2023

Uh oh!

maxsmythe commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dethi commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dethi commented Apr 5, 2023

Uh oh!

ritazh commented Apr 5, 2023

Uh oh!

dethi commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxsmythe commented Apr 5, 2023

Uh oh!

maxsmythe commented Apr 5, 2023

Uh oh!

maxsmythe commented Apr 5, 2023

Uh oh!

dethi commented Apr 6, 2023

Uh oh!

dethi commented Apr 6, 2023

Uh oh!

maxsmythe commented Apr 6, 2023

Uh oh!

dethi commented Apr 6, 2023

Uh oh!

dethi commented Apr 7, 2023

Uh oh!

dethi commented Apr 8, 2023

Uh oh!

dethi commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxsmythe commented Apr 10, 2023

Uh oh!

dethi commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxsmythe commented Apr 12, 2023

dethi commented Mar 29, 2023 •

edited

Loading

dethi commented Mar 29, 2023 •

edited

Loading

dethi commented Mar 30, 2023 •

edited

Loading

dethi commented Apr 1, 2023 •

edited

Loading

davis-haba commented Apr 3, 2023 •

edited

Loading

dethi commented Apr 4, 2023 •

edited

Loading

maxsmythe commented Apr 5, 2023 •

edited

Loading

dethi commented Apr 5, 2023 •

edited

Loading

dethi commented Apr 5, 2023 •

edited

Loading

dethi commented Apr 8, 2023 •

edited

Loading

dethi commented Apr 10, 2023 •

edited

Loading