-
-
Notifications
You must be signed in to change notification settings - Fork 125
feat: Build cached anchor/alias node list for faster alias resolving #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this might be helpful. I'd much prefer for most of the work to be done in the Alias resolve()
method, though. See inline for further comments.
Would you be able to share some YAML file with which the performance benefits of the change become measurable?
Had to anonymize it, so it looks kinda goofy, but here is one of the worst offenders in my data set: https://pastebin.com/nPB4hLyT (usually my files are a lot smaller than this, but with each loaded file it adds up quickly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there, I think. And thank you for the sample data, I'll need to find a bit of time to test how much this improves the performance on it.
Not a proper scientific benchmark, but with the example file this is the difference on my setup:
Also just as a fun fact, caching all nodes without any filtering still gives a significant improvement as we skip the visitor's overhead:
|
We could also build an anchors map where a key would be the anchor's name and the value would be the node with all its child aliases already resolved (or maybe already JSified). Then if a duplicated anchor name is found it would simply replace the previous one, but I'm assuming that there is some reason why that's not how it's done already, that I'm not aware of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @PolyMeilex for your contribution!
With you test file, this does provide something like a 12x improvement when I test it locally, though processing it does require adjusting maxAliasCount
to avoid detecting the input as a resource exhaustion attack.
We could also build an anchors map where a key would be the anchor's name [...] but I'm assuming that there is some reason why that's not how it's done already, that I'm not aware of.
That's effectively ctx.anchors
, but it's not as simple as one might like. The YAML spec allows for fun structures like this:
&foo
key:
- *foo
- &foo 42
- *foo
Here, the resolution of the top-level map depends on resolving all of its contents, and that includes resolving the first *foo
as a circular reference to the map itself, as well as resolving the second *foo
as the scalar 42
, thanks to the anchor re-use.
Getting that to work right was a bit tricky.
Oh, yeah that's fun, would rather pay the price of wasteful iteration then try to implement this 😅 Thank you for being open to this! I will try to find more low hanging fruit so hopefully we can keep generators, while still making the package usable for my usecase 🚀 |
Currently, to resolve an alias, we have to visit the whole document tree until the alias node is found. The lower in the file we put an alias, the slower the resolution is.
This is an attempt to speed this process significantly by:
Not sure if this is the right approach, but it helped me reduce the load time of my dataset, which is not very alias-heavy, from 10s to 4s.