Kubo stopped listing pin after some times

### Checklist

- [X] This is a bug report, not a question. Ask questions on [discuss.ipfs.tech](https://discuss.ipfs.tech/c/help/13).
- [X] I have searched on the [issue tracker](https://github.com/ipfs/kubo/issues?q=is%3Aissue) for my bug.
- [X] I am running the latest [kubo version](https://dist.ipfs.tech/#kubo) or have an issue updating.

### Installation method

built from source

### Version

```Text
Kubo version: 0.32.1
Repo version: 16
System version: amd64/linux
Golang version: go1.23.3
```


### Config

```json
{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "AppendAnnounce": [],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip4/0.0.0.0/udp/4001/webrtc-direct",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/webrtc-direct"
    ]
  },
  "AutoNAT": {},
  "AutoTLS": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/3",
            "sync": false,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "31TB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "OptimisticProvide": false,
    "OptimisticProvideJobsPoolSize": 0,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "DeserializedResponses": null,
    "DisableHTMLErrors": null,
    "ExposeRoutingAPI": null,
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": {
      "cfg": {
        "DeserializedResponses": null,
        "InlineDNSLink": null,
        "NoDNSLink": false,
        "Paths": null,
        "UseSubdomains": false
      },
      "localhost": {
        "DeserializedResponses": null,
        "InlineDNSLink": null,
        "NoDNSLink": false,
        "Paths": [
          "/ipfs"
        ],
        "UseSubdomains": false
      }
    },
    "RootRedirect": ""
  },
  "Identity": {
    "PeerID": "12D3KooWAJJJwXsB5b68cbq69KpXiKqQAgTKssg76heHkg6mo2qB"
  },
  "Import": {
    "CidVersion": null,
    "HashFunction": null,
    "UnixFSChunker": null,
    "UnixFSRawLeaves": null
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 128,
      "EngineTaskWorkerCount": 8,
      "MaxOutstandingBytesPerPeer": null,
      "ProviderSearchDelay": null,
      "TaskWorkerCount": 8
    }
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": "removed for brevity"
  },
  "Pinning": {},
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "24h0m0s",
    "Strategy": "all"
  },
  "Routing": {
    "AcceleratedDHTClient": true,
    "Methods": null,
    "Routers": null,
    "Type": "auto"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "30s",
      "HighWater": 4096,
      "LowWater": 1024
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {
      "Enabled": false
    },
    "RelayService": {
      "Enabled": false
    },
    "ResourceMgr": {
      "Limits": {},
      "MaxMemory": "24GB"
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  },
  "Version": {}
}
```


### Description

Hello,

We started to experience an issue with Kubo in our 2-node cluster where Kubo don't list pin anymore.

We have 2 nodes that both pin all the pinset we keep track of, which is around 16.39 million pins right now.

Last weeks (while we were still using 0.29), Kubo stopped responding to the `/pin/ls` queries sent by the cluster, those requests were hanging "indefinitely" (as in, when using curl I stopped the command after ~16h without response). Our `ipfs-cluster` process is returning the following in the log when this happens:
```
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.328Z        ERROR        pintracker        stateless/stateless.go:540        Post "http://127.0.0.1:5001/api/v0/pin/ls?stream=true&arg=QmcPpbdw8k8fyhas77WjzPMqSAgVYPonbVve3xTPtxL8ab&type=recursive": context canceled
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.328Z        ERROR        cluster        ipfs-cluster/cluster.go:2022        12D3KooTheOtherClusterNodeHash: error in broadcast response from 12D3KooTheOtherClusterNodeHash: context canceled
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.447Z        ERROR        ipfshttp        ipfshttp/ipfshttp.go:1276        error posting to IPFS:Post "http://127.0.0.1:5001/api/v0/pin/ls?stream=true&arg=QmRSGNoqrbTsKCgHbi8xCeQEn4sQXFDjfNEUgXWG5wAg6U&type=recursive": context canceled
```

This started out of the blue, there was no change on the server. The issue remained after upgrading to 0.32.1.

At that time, we had the bloom filter activated, deactivating it did improve the situation for a while (maybe 24h), and then the issue started to show up again. In retrospect, I think it may not be related to the bloom filter at all).

This is the typical metrics reported by `ipfs-cluster` which show when Kubo stop responding to `/pin/ls`:

<img width="1385" alt="Screenshot 2024-11-25 at 10 31 12" src="https://github.com/user-attachments/assets/799d8968-716e-41f7-b599-26dd7e58bdc9">

The graph on top is the number of pins the cluster is keeping track of, and on the one on the bottom is the number of pins reported by Kubo. When restarting Kubo it generally jumps to the expected amount, and after a while it drops to 0. At that point any attempt to list pin from Kubo fails.

We only have the metrics reported by ipfs-cluster because of [this Kubo bug](https://github.com/ipfs/kubo/issues/9891).

The server CPU, RAM, and disk utilization is fairly low when this issue show up, so it doesn't look like it a performance issue. The only metric that started to go out of bound is the number of open file descriptors which grow and reached the 128k limit set. I bumped it to 1.28 million, but it still reaches it (with or without the bloom filter):
<img width="1385" alt="Screenshot 2024-11-25 at 10 31 40" src="https://github.com/user-attachments/assets/64f8c695-fbf0-442f-ab1c-35b40d8b7467">

The FDs limit is set both at the systemd unit level, and via `IPFS_FD_MAX`.

Restarting Kubo make it work again most of the time, but sometimes it doesn't change anything and it instantly starts to fail.

Here is some profiling data from one of our nodes:
- when it's starting and respond to pin listing: https://drive.google.com/file/d/1-HagskynT6D-WoZ_JxIvHiu8kZGS1EUd/view?usp=sharing
- when it's in this fail state: https://drive.google.com/file/d/1fexP-ZBLxkvcCVtZTHPcIUNalzJsfemh/view?usp=sharing

More info about the system:
- Nixos, current version of Nixpkgs being the PR that updated Kubo to 0.32.1
- AMD Ryzen 5 Pro 3600 - 6c/12t - 3.6 GHz/4.2 GHz
- 128 GB ECC 2666 MHz
- 2×512 GB SSD NVMe
  - one with the system
  - the other is split and is used as `logs` and `cache` for ZFS
- one ZFS ZRAID-0 pool with 4×6 TB HDD SATA

Kubo also emit a lot of:
```
INFO        net/identify        identify/id.go:457        failed negotiate identify protocol with peer        {"peer": "12D3KooWMyK8arvRjtC33rxTRZfKDcyQgZTC9yWpfMFHRbp
ngXwK", "error": "Application error 0x1 (remote): conn-31160279: system: cannot reserve inbound connection: resource limit exceeded"}
```
But `ipfs swarm resources` doesn't return anything above 5-15%, so I think this error is actually on the remote node side and not related to our issue, right?

Anything else we could gather to help solve this issue?

Right now I'm out of ideas to get our cluster back into a working state (beside restarting Kubo every 2h but that's not a solution since it will prevent us from reproving the pins to the rest of the network)

Edit with additional info:
- Kubo is launched without the `--enable-gc` flag, as prescribed by ipfs-cluster doc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Kubo stopped listing pin after some times #10596

Checklist

Installation method

Version

Config

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Kubo stopped listing pin after some times #10596

Description

Checklist

Installation method

Version

Config

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions