Skip to content

feat: Build cached anchor/alias node list for faster alias resolving #612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 30, 2025

Conversation

PolyMeilex
Copy link
Contributor

Currently, to resolve an alias, we have to visit the whole document tree until the alias node is found. The lower in the file we put an alias, the slower the resolution is.

This is an attempt to speed this process significantly by:

  1. Putting all nodes in a list and iterating over it, this is significantly faster than descending the tree via visitor every time
  2. If we are putting nodes into a list we might as well filter them to only the ones that are needed for alias resolution, so only aliases and anchors are cached

Not sure if this is the right approach, but it helped me reduce the load time of my dataset, which is not very alias-heavy, from 10s to 4s.

Copy link
Owner

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this might be helpful. I'd much prefer for most of the work to be done in the Alias resolve() method, though. See inline for further comments.

Would you be able to share some YAML file with which the performance benefits of the change become measurable?

@PolyMeilex
Copy link
Contributor Author

PolyMeilex commented Mar 29, 2025

Would you be able to share some YAML file with which the performance benefits of the change become measurable?

Had to anonymize it, so it looks kinda goofy, but here is one of the worst offenders in my data set: https://pastebin.com/nPB4hLyT

(usually my files are a lot smaller than this, but with each loaded file it adds up quickly)

Copy link
Owner

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there, I think. And thank you for the sample data, I'll need to find a bit of time to test how much this improves the performance on it.

@PolyMeilex
Copy link
Contributor Author

PolyMeilex commented Mar 29, 2025

I'll need to find a bit of time to test how much this improves the performance on it.

Not a proper scientific benchmark, but with the example file this is the difference on my setup:

before

parse: 141.241ms
toJs: 1.684s

after

parse: 145.326ms
toJs: 26.266ms

Also just as a fun fact, caching all nodes without any filtering still gives a significant improvement as we skip the visitor's overhead:

parse: 143.696ms
toJs: 119.525ms

@PolyMeilex
Copy link
Contributor Author

PolyMeilex commented Mar 30, 2025

We could also build an anchors map where a key would be the anchor's name and the value would be the node with all its child aliases already resolved (or maybe already JSified). Then if a duplicated anchor name is found it would simply replace the previous one, but I'm assuming that there is some reason why that's not how it's done already, that I'm not aware of.

Copy link
Owner

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @PolyMeilex for your contribution!

With you test file, this does provide something like a 12x improvement when I test it locally, though processing it does require adjusting maxAliasCount to avoid detecting the input as a resource exhaustion attack.

We could also build an anchors map where a key would be the anchor's name [...] but I'm assuming that there is some reason why that's not how it's done already, that I'm not aware of.

That's effectively ctx.anchors, but it's not as simple as one might like. The YAML spec allows for fun structures like this:

&foo
key:
- *foo
- &foo 42
- *foo

Here, the resolution of the top-level map depends on resolving all of its contents, and that includes resolving the first *foo as a circular reference to the map itself, as well as resolving the second *foo as the scalar 42, thanks to the anchor re-use.

Getting that to work right was a bit tricky.

@eemeli eemeli merged commit 55c5ef4 into eemeli:main Mar 30, 2025
21 checks passed
@PolyMeilex
Copy link
Contributor Author

That's effectively ctx.anchors, but it's not as simple as one might like. The YAML spec allows for fun structures like this:

&foo
key:
- *foo
- &foo 42
- *foo

Here, the resolution of the top-level map depends on resolving all of its contents, and that includes resolving the first *foo as a circular reference to the map itself, as well as resolving the second *foo as the scalar 42, thanks to the anchor re-use.

Getting that to work right was a bit tricky.

Oh, yeah that's fun, would rather pay the price of wasteful iteration then try to implement this 😅

Thank you for being open to this! I will try to find more low hanging fruit so hopefully we can keep generators, while still making the package usable for my usecase 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants