Description
Checklist
- This is a bug report, not a question. Ask questions on discuss.ipfs.tech.
- I have searched on the issue tracker for my bug.
- I am running the latest kubo version or have an issue updating.
Installation method
built from source
Version
Kubo version: 0.32.1
Repo version: 16
System version: amd64/linux
Golang version: go1.23.3
Config
{
"API": {
"HTTPHeaders": {}
},
"Addresses": {
"API": "/ip4/127.0.0.1/tcp/5001",
"Announce": [],
"AppendAnnounce": [],
"Gateway": "/ip4/127.0.0.1/tcp/8080",
"NoAnnounce": [
"/ip4/10.0.0.0/ipcidr/8",
"/ip4/100.64.0.0/ipcidr/10",
"/ip4/169.254.0.0/ipcidr/16",
"/ip4/172.16.0.0/ipcidr/12",
"/ip4/192.0.0.0/ipcidr/24",
"/ip4/192.0.2.0/ipcidr/24",
"/ip4/192.168.0.0/ipcidr/16",
"/ip4/198.18.0.0/ipcidr/15",
"/ip4/198.51.100.0/ipcidr/24",
"/ip4/203.0.113.0/ipcidr/24",
"/ip4/240.0.0.0/ipcidr/4",
"/ip6/100::/ipcidr/64",
"/ip6/2001:2::/ipcidr/48",
"/ip6/2001:db8::/ipcidr/32",
"/ip6/fc00::/ipcidr/7",
"/ip6/fe80::/ipcidr/10"
],
"Swarm": [
"/ip4/0.0.0.0/tcp/4001",
"/ip6/::/tcp/4001",
"/ip4/0.0.0.0/udp/4001/quic-v1",
"/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
"/ip4/0.0.0.0/udp/4001/webrtc-direct",
"/ip6/::/udp/4001/quic-v1",
"/ip6/::/udp/4001/quic-v1/webtransport",
"/ip6/::/udp/4001/webrtc-direct"
]
},
"AutoNAT": {},
"AutoTLS": {},
"Bootstrap": [
"/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
"/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
"/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
],
"DNS": {
"Resolvers": {}
},
"Datastore": {
"BloomFilterSize": 0,
"GCPeriod": "1h",
"HashOnRead": false,
"Spec": {
"mounts": [
{
"child": {
"path": "blocks",
"shardFunc": "/repo/flatfs/shard/v1/next-to-last/3",
"sync": false,
"type": "flatfs"
},
"mountpoint": "/blocks",
"prefix": "flatfs.datastore",
"type": "measure"
},
{
"child": {
"compression": "none",
"path": "datastore",
"type": "levelds"
},
"mountpoint": "/",
"prefix": "leveldb.datastore",
"type": "measure"
}
],
"type": "mount"
},
"StorageGCWatermark": 90,
"StorageMax": "31TB"
},
"Discovery": {
"MDNS": {
"Enabled": false
}
},
"Experimental": {
"FilestoreEnabled": false,
"GraphsyncEnabled": false,
"Libp2pStreamMounting": false,
"OptimisticProvide": false,
"OptimisticProvideJobsPoolSize": 0,
"P2pHttpProxy": false,
"StrategicProviding": false,
"UrlstoreEnabled": false
},
"Gateway": {
"APICommands": [],
"DeserializedResponses": null,
"DisableHTMLErrors": null,
"ExposeRoutingAPI": null,
"HTTPHeaders": {},
"NoDNSLink": false,
"NoFetch": false,
"PathPrefixes": [],
"PublicGateways": {
"cfg": {
"DeserializedResponses": null,
"InlineDNSLink": null,
"NoDNSLink": false,
"Paths": null,
"UseSubdomains": false
},
"localhost": {
"DeserializedResponses": null,
"InlineDNSLink": null,
"NoDNSLink": false,
"Paths": [
"/ipfs"
],
"UseSubdomains": false
}
},
"RootRedirect": ""
},
"Identity": {
"PeerID": "12D3KooWAJJJwXsB5b68cbq69KpXiKqQAgTKssg76heHkg6mo2qB"
},
"Import": {
"CidVersion": null,
"HashFunction": null,
"UnixFSChunker": null,
"UnixFSRawLeaves": null
},
"Internal": {
"Bitswap": {
"EngineBlockstoreWorkerCount": 128,
"EngineTaskWorkerCount": 8,
"MaxOutstandingBytesPerPeer": null,
"ProviderSearchDelay": null,
"TaskWorkerCount": 8
}
},
"Ipns": {
"RecordLifetime": "",
"RepublishPeriod": "",
"ResolveCacheSize": 128
},
"Migration": {
"DownloadSources": [],
"Keep": ""
},
"Mounts": {
"FuseAllowOther": false,
"IPFS": "/ipfs",
"IPNS": "/ipns"
},
"Peering": {
"Peers": "removed for brevity"
},
"Pinning": {},
"Plugins": {
"Plugins": null
},
"Provider": {
"Strategy": ""
},
"Pubsub": {
"DisableSigning": false,
"Router": ""
},
"Reprovider": {
"Interval": "24h0m0s",
"Strategy": "all"
},
"Routing": {
"AcceleratedDHTClient": true,
"Methods": null,
"Routers": null,
"Type": "auto"
},
"Swarm": {
"AddrFilters": [
"/ip4/10.0.0.0/ipcidr/8",
"/ip4/100.64.0.0/ipcidr/10",
"/ip4/169.254.0.0/ipcidr/16",
"/ip4/172.16.0.0/ipcidr/12",
"/ip4/192.0.0.0/ipcidr/24",
"/ip4/192.0.2.0/ipcidr/24",
"/ip4/192.168.0.0/ipcidr/16",
"/ip4/198.18.0.0/ipcidr/15",
"/ip4/198.51.100.0/ipcidr/24",
"/ip4/203.0.113.0/ipcidr/24",
"/ip4/240.0.0.0/ipcidr/4",
"/ip6/100::/ipcidr/64",
"/ip6/2001:2::/ipcidr/48",
"/ip6/2001:db8::/ipcidr/32",
"/ip6/fc00::/ipcidr/7",
"/ip6/fe80::/ipcidr/10"
],
"ConnMgr": {
"GracePeriod": "30s",
"HighWater": 4096,
"LowWater": 1024
},
"DisableBandwidthMetrics": false,
"DisableNatPortMap": true,
"RelayClient": {
"Enabled": false
},
"RelayService": {
"Enabled": false
},
"ResourceMgr": {
"Limits": {},
"MaxMemory": "24GB"
},
"Transports": {
"Multiplexers": {},
"Network": {},
"Security": {}
}
},
"Version": {}
}
Description
Hello,
We started to experience an issue with Kubo in our 2-node cluster where Kubo don't list pin anymore.
We have 2 nodes that both pin all the pinset we keep track of, which is around 16.39 million pins right now.
Last weeks (while we were still using 0.29), Kubo stopped responding to the /pin/ls
queries sent by the cluster, those requests were hanging "indefinitely" (as in, when using curl I stopped the command after ~16h without response). Our ipfs-cluster
process is returning the following in the log when this happens:
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.328Z ERROR pintracker stateless/stateless.go:540 Post "http://127.0.0.1:5001/api/v0/pin/ls?stream=true&arg=QmcPpbdw8k8fyhas77WjzPMqSAgVYPonbVve3xTPtxL8ab&type=recursive": context canceled
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.328Z ERROR cluster ipfs-cluster/cluster.go:2022 12D3KooTheOtherClusterNodeHash: error in broadcast response from 12D3KooTheOtherClusterNodeHash: context canceled
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.447Z ERROR ipfshttp ipfshttp/ipfshttp.go:1276 error posting to IPFS:Post "http://127.0.0.1:5001/api/v0/pin/ls?stream=true&arg=QmRSGNoqrbTsKCgHbi8xCeQEn4sQXFDjfNEUgXWG5wAg6U&type=recursive": context canceled
This started out of the blue, there was no change on the server. The issue remained after upgrading to 0.32.1.
At that time, we had the bloom filter activated, deactivating it did improve the situation for a while (maybe 24h), and then the issue started to show up again. In retrospect, I think it may not be related to the bloom filter at all).
This is the typical metrics reported by ipfs-cluster
which show when Kubo stop responding to /pin/ls
:

The graph on top is the number of pins the cluster is keeping track of, and on the one on the bottom is the number of pins reported by Kubo. When restarting Kubo it generally jumps to the expected amount, and after a while it drops to 0. At that point any attempt to list pin from Kubo fails.
We only have the metrics reported by ipfs-cluster because of this Kubo bug.
The server CPU, RAM, and disk utilization is fairly low when this issue show up, so it doesn't look like it a performance issue. The only metric that started to go out of bound is the number of open file descriptors which grow and reached the 128k limit set. I bumped it to 1.28 million, but it still reaches it (with or without the bloom filter):
The FDs limit is set both at the systemd unit level, and via IPFS_FD_MAX
.
Restarting Kubo make it work again most of the time, but sometimes it doesn't change anything and it instantly starts to fail.
Here is some profiling data from one of our nodes:
- when it's starting and respond to pin listing: https://drive.google.com/file/d/1-HagskynT6D-WoZ_JxIvHiu8kZGS1EUd/view?usp=sharing
- when it's in this fail state: https://drive.google.com/file/d/1fexP-ZBLxkvcCVtZTHPcIUNalzJsfemh/view?usp=sharing
More info about the system:
- Nixos, current version of Nixpkgs being the PR that updated Kubo to 0.32.1
- AMD Ryzen 5 Pro 3600 - 6c/12t - 3.6 GHz/4.2 GHz
- 128 GB ECC 2666 MHz
- 2×512 GB SSD NVMe
- one with the system
- the other is split and is used as
logs
andcache
for ZFS
- one ZFS ZRAID-0 pool with 4×6 TB HDD SATA
Kubo also emit a lot of:
INFO net/identify identify/id.go:457 failed negotiate identify protocol with peer {"peer": "12D3KooWMyK8arvRjtC33rxTRZfKDcyQgZTC9yWpfMFHRbp
ngXwK", "error": "Application error 0x1 (remote): conn-31160279: system: cannot reserve inbound connection: resource limit exceeded"}
But ipfs swarm resources
doesn't return anything above 5-15%, so I think this error is actually on the remote node side and not related to our issue, right?
Anything else we could gather to help solve this issue?
Right now I'm out of ideas to get our cluster back into a working state (beside restarting Kubo every 2h but that's not a solution since it will prevent us from reproving the pins to the rest of the network)
Edit with additional info:
- Kubo is launched without the
--enable-gc
flag, as prescribed by ipfs-cluster doc.