Skip to content

chore: add tag code #380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

vasco-santos
Copy link
Member

@vasco-santos vasco-santos commented May 21, 2025

This PR adds a retrieval code as draft to hint what kind of retrieval protocols one host may run. e.g. /ip4/.../p2p/qmfoo/retrieval/bitswap/retrieval/http as discussed with @achingbrain

This PR adds a tag code for application purpose context, such as to pass hints of kind of retrieval protocols one host may run. e.g. /ip4/.../p2p/qmfoo/tag/bitswap/tag/http

This PR does not intend to decide on the tag names and assumes is out of content here if bitswap / /ipfs/bitswap/1.2.0 / etc, and we should leave that to other discussion. Would be great to avoid needing for encode though.

This is particularly interesting in the context of ipfs/specs#504 given it let's clients fetch content in a non interactive way relying on provider hints and (optionally) tags for them.

This enables us to be able to prioritize dialling to hosts that we know how to communicate with and avoid unnecessary dials.

More context: https://github.com/vasco-santos/provider-hinted-uri/blob/main/EXPLORATION.md

@lidel
Copy link
Member

lidel commented May 23, 2025

Mind elaborating how useful /retrieval/http and /retrieval/bitswap in multiaddrs would be in practice?

Is the idea to store this information in routing systems like cid.contact or Amino DHT as part of multiaddrs (which means we don't have to modify these systems) and couple this on the web with IPIP-0484: Opt-in Filtering in Routing V1 HTTP API to only get bitswap or http peers?

As noted in comment on your exploration document, I'm skeptical /retrieval/(bitswap|http) hint in multiaddr will save us any dials. My understanding is:

  • If a client sees /tls/http or /https in multiaddr, it already has hint to use "http" and attempt request for application/vnd.ipld.raw
    • Q: why we need extra one notation for this if /http is already in multiaddr?
  • If a client wants to bitswap, I will connect to a peer and do libp2p identify anyway, to (1) confirm peer supports bitswap (2) know which version of bitswap to be used
    • Q: why we need extra hint? is this to support future where we have other protocols and want to avoid libp2p identify? If so, why are we inventing separate dictionary that lsoes version information? could this reuse values from libp2p identify? should /retrieval/bitswap be changed to /identify/%2Fipfs%2Fbitswap%2F1.2.0?

@vasco-santos
Copy link
Member Author

Thanks for raising these questions @lidel . I will keep the discussion here as well for https://github.com/vasco-santos/provider-hinted-uri/pull/1/files#r2105422391 to make it easier. This is not as critical as supporting a multiaddr as a provider, but in my opinion there is value in this optional add on. I will try to make it clear in this answer.

Let's start with the broader improvement this intends to bring:

I think /retrieval/ in multiaddr won't avoid extra probing / lookups

The main goal of having clients understanding a provider query parameter acting as a Hint is to make content discovery more efficient and resilient:

  • client MAY immediately try to fetch content from providers explicitly in the URI, while in parallel do discovery, or even be configurable. The key thing here is that this approach adds a fast lane to fetch content without indirection latency and hops for discovery.
  • moreover, if discovery systems are having issues, or are slow to get updated, this can actually add a fault tolerance layer on top.

Building on this direction, where we try to create a fast lane for clients to try to retrieve some trustless content, we MAY end up in situations where despite a provider providing this content, it may provide it in ways that the client does not know how to communicate. Let's consider that a provider only talks Graphsync, the client won't know how to talk graphsync and this fast lane is actually meaning more requests and latency, and becomes a slow lane. Moreover, the client may have protocol preference and if there are multiple providers with different protocols, it makes it easier to sort prioritization right away. With that in mind, I think having a way to express retrieval protocol can definitely avoid extra probing/lookups.

I totally agree that it is helpful to be more granular than just bitswap as you mention @lidel , so that we can actually know exactly if a protocol the client knows how to use is present. I think this comes from the table https://github.com/vasco-santos/provider-hinted-uri/blob/main/EXPLORATION.md?plain=1#L79 , which as title suggests is a "Informative" code, more of an example at this stage. I am still not clear on what is exactly the best structure here at this point. That is the reason why this document and general design is more of an Exploration than a Spec yet, as we are trying to figure out the best balance between ergonomics, usage, and general improvements on client.

If a client sees /tls/http or /https in multiaddr, it already has hint to use "http" and attempt request for application/vnd.ipld.raw
Q: why we need extra one notation for this if /http is already in multiaddr?

This is a fair question. I know this is the current behaviour of client and that there are probings in the spec that allow the client to make sure the host behind such multiaddr can really serve this content. However, we are again facing a pattern where extra requests are required. If we try to create a routing system encoded in the URI that reduces indirection layers and hops to the bare minimum, we are trying to make this not only be for HTTP, but also for other protocols like the ones today used via libp2p. Given we try to accommodate the more general case, does it make sense that we try to find an encoding pattern that we can use across the board?

If a client wants to bitswap, I will connect to a peer and do libp2p identify anyway, to (1) confirm peer supports bitswap (2) know which version of bitswap to be used
Q: why we need extra hint? is this to support future where we have other protocols and want to avoid libp2p identify? If so, why are we inventing separate dictionary that lsoes version information? could this reuse values from libp2p identify? should /retrieval/bitswap be changed to /identify/%2Fipfs%2Fbitswap%2F1.2.0?

I thought about libp2p identify and using that name, but we already have protocols that go outside of libp2p route (HTTP as described above), so I assume having a separate dictionary that is a superset of libp2p is critical? That feels better ergonomics to me than having two different ways of expressing this. Thus retrieval or any other name is better to not be strictly within the border of libp2p. Happy to rename retrieval if there are better suggestions though! Today libp2p MAY need do identify anyway, but in the future it can potentially skip it if the main way interactions exist are short lived single requests and close connection. I totally think libp2p identify is great for use cases where peers will keep long lived connections, but here it just adds more latency to just fetch some bytes from a target host.

Also to be extra clear, we CAN and probably SHOULD encode the version, the exploration table is a simple example.

Is the idea to store this information in routing systems like cid.contact or Amino DHT as part of multiaddrs (which means we don't have to modify these systems) and couple this on the web with IPIP-0484: Opt-in Filtering in Routing V1 HTTP API to only get bitswap or http peers?

For now the general idea with this exploration work is to give content publishers the power to encode this in URIs, so that smart clients like helia-verified-fetch can rely on these hints to fetch data more efficiently, reducing discovery load, etc. But this question is actually a Hint on how much power more exists to explore on this direction :) Not only this could also be published in discovery systems like IPNI and DHT, but even Gateway servers may be able to rely on this to avoid discovery when client does not know how to use these HINTs and just delegates discovery to a Gateway.

Not sure what /retrieval/http mean. What is Trustless IPFS Gateway v1? I assume you mean some subset of https://specs.ipfs.tech/http-gateways/trustless-gateway/ ?

I mean just the trustless gateway spec in its current shape, but opening the door to improvements. Though, I am still trying to figure out what would fit better here, and maybe even separating by CAR and BLOCK support could actually be a good call?

@lidel can you see more merit in being able to encode protocols that some host accepts to retrieve some content ?

@lidel
Copy link
Member

lidel commented May 26, 2025

Thank you for clarification. Yes, I do see merit in removing need for libp2p identify and/or extra HTTP probing.

Some fresh thoughts below + would like to hear from @rvagg and @aschmahmann as my perspective may also be too biased towards positives (ad-hoc webseeds, being able to publish protocol hints that allow clients do skip libp2p identify roundtrips).

On making this /tag not limited to retrieval

I thought about libp2p identify and using that name, but we already have protocols that go outside of libp2p route (HTTP as described above), so I assume having a separate dictionary that is a superset of libp2p is critical?

Hm.. I understand why you did not want to narrow down to libp2p protocols, however it is still "narrowed down" due to the name "retrieval".

Could this be /tag/foo instead? It would create a general-purpose surface across entire stack for passing protocol or app-specific hints as multiaddrs without the need for libp2p identify OR modifying existing software (in Amino DHT unknown multiaddrs are ignored, but still gossiped).

👉 Why I believe this is worth exploring

I'm looking at this as a backward-compatible way of passing arbitrary hints via existing routing systems (with focus on Amino DHT).

If we are going to "abuse" multiaddrs for passing hints, I feel going with /tag would make this more palatable by being useful beyond the narrow use case of "retrieval".

/tag would enable publishing "protocol hints" that allow clients to skip libp2p identify in many cases, and perform smarter Peer Selection based on "Features" and "Protocols" – something @guillaumemichel mentioned in his talk (Enabling More Applications to Join the libp2p DHT Ecosystem).

Because this works with existing DHT servers, start leveraging it on clients without having to do any extra dev to add new field to DHT messages and wait for update of existing IPNI and DHT servers to support them.

On ability to pass protocol versions

One realization I had when thinking about this, is that we already seem to have TWO dictionaries for signaling retrieval protocols. The delegated routing uses different strings than libp2p identify protocol strings or HTTP content-types:

  1. libp2p already signaling bitswap as /ipfs/bitswap/1.2.0 and also HTTP /http/1.1 protocols over libp2p (libp2p spec) and there is a spec for signaling trustless gateway over libp2p as /http/1.1/ipfs/gateway (ipfs spec).

  2. At similar time, IPNI introduced Add a transport code to indicate content is provided via Trustless HTTP Gateway #321 for use in delegated http /routing/v1 API and today cid.contact peer results contain Protocols: ["transport-ipfs-gateway-http"] or Protocols: ["transport-bitswap"] to signal trustless http and bitswap support.

You can see the latter one is "lossy": libp2p identify returns specific bitswap version like /ipfs/bitswap/1.2.0, however delegated http routing turns that into a generic transport-bitswap, losing the version information. Similar for HTTP: there is no hint if CAR is supported, or only Blocks.

It would be nice if we avoided adding a THIRD "protocol" dictionary here, and instead come up with convention for exposing libp2p identify information as /tag segments.

This would enable us to start cleaning up this mess: if we had protocol hints returned as an extra multiaddr, it could effectively mean sun-setting Protocols list in delegated routing responses, and lean on /tag instead. That would drop the number of "dictionaries" we need to maintain back to two, and solve the problem of ecosystem being under impression that introducing new retrieval protocol is gate-kept by table.csv.

Also, if we had clients announce their libp2p identify info with /tag, other new clients could read it and skip one roundtrip – a "fast track" even when ?provider is not present.

On interop with Amino DHT

Not only this could also be published in discovery systems like IPNI and DHT,

Amino DHT interop sounds sensible, as long we are careful and deduplicate: average Amino DHT peer that has both ip4 and ip6, tcp, quic, webrtc, webtransport, websockets, has > 10 addrs. Storing /tag/bitswap appended to each of them may be wasteful.

I imagine "global" hints would be stored as a single top level multiaddr /tag/bitswap, and not appended to each result at rest.

On separating Block and CAR

separating [HTTP retrieval] by CAR and BLOCK support could actually be a good call?

Yes, those are effectively different retrieval protocols.
Only Block is codec-agnostic and mandatory (MUST).
CAR is suggested, but optional (SHOULD) due to complexity of server having to understand how to walk the DAG (esp. for IPIP-402).

Signaling support for them separately would allow client to skip a CAR probe.

@achingbrain
Copy link
Member

On interop with Amino DHT

"global" hints would be stored as a single top level multiaddr /tag/bitswap

If I understand correctly we'd have PeerInfo objects like:

{
  "id": "123Foo",
  "multiaddrs": [
    "/tag/bitswap-1.2.0/tag/ipfs-http-gateway",
    "/ip4/123.123.123.123/tcp/1234",
    "/ip4/123.123.123.123/udp/12345/quic-v1"
  ]
}

The first address is not routable and needs special handling. Why not add a new field instead? It has been requested previously.

{
  "id": "123Foo",
  "multiaddrs": [
    "/ip4/123.123.123.123/tcp/1234",
    "/ip4/123.123.123.123/udp/12345/quic-v1"
  ],
  "protocols": [
    "/ipfs/bitswap/1.2.0",
    "/http/1.1/ipfs/gateway"
  ]
}

Protocols are already stored in the peer store so there would be no additional storage at rest, though DHT messages would get bigger.

@lidel
Copy link
Member

lidel commented May 29, 2025

The first address is not routable and needs special handling. Why not add a new [DHT message] field instead?

My understanding is BOTH require special handling, but /tag requires it only on the client.

IIRC if you add a new "protocol" field to DHT message it will be ignored by all exiting DHT servers until they update to software that can understand it, read and persists value in the peerstore.

This means waiting 6-12-24 months until enough DHT servers update and are able to persist and gossip this new "protocols" field. Note that Amino DHT server update rate is worse than clients (hit ipfs stats dht and see % of gala.games nodes acting as DHT servers still running Kubo 0.22 from 2023).

To simplify conversation here, the choice IPFS ecosystem has is:

  • (A) "long way, low-level": update client and server code to add protocol/metadata field to DHT messages and wait 6-12-24 months before we can use it reliably on Amino DHT. This does not solve the problem on delegated routing endpoints – we still need to update spec and implement passing protocol info in multiple places (someguy, IPNI at cid.contact). As a data point, cleaning up Reframe and /routing/v1 and getting everyone to agree on spec and align implementations across GO/JS and multiple vendors took at least a year(s). This will be similar effort.
  • (B) "short way, userland": update client-only to start announcing an extra top-level /tag multiaddr with a list of libp2p identify protocols. This works everywhere with now or minimal dev (DHT, IPNI at cid.contact, delegated routing).

That is to say, if we are not doing to pivot this PR towards /tag with intention to do (B) then utility of /retrieval is limited to ?provider URL hints from IPIP-504: provider query parameter as hint for HTTP Gateways, and does not benefit wider routing system.

cc @guillaumemichel @aschmahmann @robin for visibility

@aschmahmann
Copy link
Contributor

Aside/Process: IMO this discussion should really have happened in the multiaddr or libp2p/specs repo rather than here in the multicodec repo. Ideally the multicodec repo is just for handling registration which would occur AFTER there's more consensus / alignment between the people who would use the code. That will hopefully allow the registration process to be a little less opinionated and instead objective with questions like:

  • Is anyone using planning to use this code, is it speculative, is it just the author hacking on something, ....?
  • Is this the right byte range for it?
  • Does it fit an existing category vs not?
  • Does approving the request increase the likelihood of many seemingly extraneous new code requests?

@aschmahmann
Copy link
Contributor

Overall this proposal seems to me like one that should be handled at the multiaddr and/or libp2p layer with feedback from those users. Fundamentally it's about putting more data into a multiaddr to save on round-trips and as a result you're likely to run into a bunch of the associated paper cuts from multiaddr being pretty neglected over the last several years.

Some common examples around saving round-trips include:

A couple of the nearby problems it also touches are:

  • "How do I encode / in multiaddr components" since currently almost all libp2p protocols have names like /ipfs/bitswap/1.2.0. It also might not be a big problem if these are terminal anyway, but it's something to watch out for
  • How do we avoid sending / storing lots of data by exponentially growing our multiaddr set?

There's already been some questionable yolo-ing here in the last couple years such as with http-path (ignoring long lived issues like multiformats/multiaddr#87, multiformats/multiaddr#55, and multiformats/multiformats#55 in exchange for just hacking just http-path to work) and of course the invention of the transport category (ignoring the long lived issues like those associated with the multistream-select 2.0/protocol select, in exchange for hacking a few protocol IDs that cid.contact wanted). If people really want to keep doing it and the other multicodec repo maintainers feel like it's a good idea I suppose it's doable.... but maybe after running into these paper cuts again it's worth seeing if pushing for progress on the underlying issues is worthwhile.

FWIW I'm not trying to bring up the thorniness (and ancientness) of some of the issues above to scare anyone off, it's more of an indication that if this is a multiaddr related spec then it could be a good idea to try moving a little bit forwards towards solving these many years ignored issues.

@guillaumemichel
Copy link
Contributor

/tag would enable publishing "protocol hints" that allow clients to skip libp2p identify in many cases, and perform smarter Peer Selection based on "Features" and "Protocols" – something @guillaumemichel mentioned in his talk (Enabling More Applications to Join the libp2p DHT Ecosystem).

If we are to change DHT implementations to be upgradable/composable, we need to ship more changes anyway, so I wouldn't count a potential future Composable DHT as beneficiary of the /tag feature.


While I understand that a /tag feature could spare some time/RTTs during retrieval, I would like the potential users (both publishers and retrievers) to be clearly identified to understand better the tradeoffs. We don't want to increase the record size or transmit more bytes on the wire if only a fraction of the users can benefit from it.

To simplify conversation here, the choice IPFS ecosystem has is:

  • (A) "long way, low-level": update client and server code to add protocol/metadata field to DHT messages and wait 6-12-24 months before we can use it reliably on Amino DHT. This does not solve the problem on delegated routing endpoints – we still need to update spec and implement passing protocol info in multiple places (someguy, IPNI at cid.contact). As a data point, cleaning up Reframe and /routing/v1 and getting everyone to agree on spec and align implementations across GO/JS and multiple vendors took at least a year(s). This will be similar effort.
  • (B) "short way, userland": update client-only to start announcing an extra top-level /tag multiaddr with a list of libp2p identify protocols. This works everywhere with now or minimal dev (DHT, IPNI at cid.contact, delegated routing).

I would lean more toward (A), because clients (retrievers) could signal to the server which extra information/tags they are expecting, without forcing extra tags upon all retrievers. The adoption delay is currently long because content routing systems have been neglected over the last years. Investing into improving the content routing systems could make such protocol upgrades quick to ship.

@darobin
Copy link

darobin commented Jun 2, 2025

solve the problem of ecosystem being under impression that introducing new retrieval protocol is gate-kept by table.csv.

I like /tag but… don't we need to then maintain a registry of tags that reintroduces the problem? Or am I missing something?

PS: note that @robin is someone else entirely!

@vasco-santos vasco-santos changed the title chore: add retrieval draft code chore: add label code Jun 5, 2025
@vasco-santos vasco-santos force-pushed the chore/add-retrieval-draft-code branch from 022ebfc to 3a6092a Compare June 5, 2025 13:34
@vasco-santos vasco-santos changed the title chore: add label code chore: add tag code Jun 5, 2025
@vasco-santos
Copy link
Member Author

vasco-santos commented Jun 5, 2025

Thanks for all the feedback. Followed up updated retrieval to tag, which I agree makes everything much nicer for application level input without this table as a gate keeper.

To make it clear, the main intention for this is usage in smart clients like @helia/verified-fetch, so that these hints can be used for fetching content in a non-interactive way, while still being able to fallback to third party discovery options. However, as the conversation direction goes, it is a clear signal that this has value much outside of this scope, and could potentially help other systems including Discovering systems themselves. I am seeing this as an experiment now, trying it out with @helia/verified-fetch and gathering feedback, eventually iterating on it. The conversation to have something that works not only in clients, but also in all the other places can of course happen, but would wish that this does not block possibility of experimenting while we flush this out.

On ability to pass protocol versions

@lidel this is totally right and something I did not explicitly added because I did not feel this was the space to agree on how tag/retrieval names would be. But yes, I totally agree we MUST include versions and would be great to avoid keeping more dictionaries if possible

a registry of tags

@darobin yes, but that can be "application" level problem. For instance, this can be a full application level param for some hash, or protocol hints that are part of a spec for smart content addressable clients. In other words, one can parse multiaddr get all the tags and see the ones it cares about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants