Skip to content

Add AARCH64-NEON fastpath for data state #618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 16, 2025
Merged

Conversation

Pavel-Irzhauski
Copy link
Contributor

@Pavel-Irzhauski Pavel-Irzhauski commented Jun 13, 2025

Solution for #612 is almost the same as #601. It also gives significant performance improvement on aarch64 for lipsum.html and lipsum.zh.html and close to 0 change for smaller cases.

Pavel Irzhauski and others added 8 commits June 12, 2025 18:36
Signed-off-by: Pavel Irzhauski <[email protected]>
 into simd-neon

implement SIMD fast path using ARM NEON in the same way as using SSE2
 into simd-neon

implement SIMD fast path using ARM NEON in the same way as using SSE2

Signed-off-by: Pavel Irzhauski <[email protected]>
 into simd-neon

implement SIMD fast path using ARM NEON in the same way as using SSE2

Signed-off-by: Pavel Irzhauski <[email protected]>
@Pavel-Irzhauski Pavel-Irzhauski marked this pull request as ready for review June 16, 2025 13:01
@jschwe
Copy link
Member

jschwe commented Jun 16, 2025

For reference, on an M4 pro with target-cpu=native criterion reports the improvements as follows:

html5ever/html5ever on  simd-neon [?] is 📦 v0.31.0 via 🦀 v1.86.0 took 1m0s 
❯ RUSTFLAGS='-C target-cpu=native' cargo criterion --target aarch64-apple-darwin

[...]

html tokenizing lipsum.html                                                                             
                        time:   [919.57 ns 926.26 ns 933.06 ns]
                        change: [-90.834% -90.776% -90.723%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing lipsum-zh.html                                                                             
                        time:   [558.04 ns 559.27 ns 560.41 ns]
                        change: [-89.129% -89.067% -89.007%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing medium-fragment.html                                                                             
                        time:   [20.532 µs 20.591 µs 20.656 µs]
                        change: [-3.8151% -3.5140% -3.1986%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing small-fragment.html                                                                             
                        time:   [1.7032 µs 1.7076 µs 1.7119 µs]
                        change: [-16.467% -16.132% -15.774%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing tiny-fragment.html                                                                            
                        time:   [196.18 ns 196.97 ns 197.85 ns]
                        change: [-0.6594% -0.2306% +0.1772%] (p = 0.29 > 0.05)
                        No change in performance detected.

html tokenizing strong.html                                                                             
                        time:   [13.787 µs 13.837 µs 13.887 µs]
                        change: [-1.0206% -0.3762% +0.2516%] (p = 0.24 > 0.05)
                        No change in performance detected.

@jschwe jschwe changed the title ARM-NEON Add AARCH64-NEON fastpath for data state Jun 16, 2025
@nicoburns
Copy link
Contributor

nicoburns commented Jun 16, 2025

I'm seeing similar results on Apple M1 even without target-cpu=native (my understanding is that target-cpu=native isn't generally necessary on Apple Silicon macs because the oldest CPU in that target (M1) supports most of the cpu features (e.g. neon) that you'd be targeting anyway):

html tokenizing lipsum.html
                        time:   [1.3519 µs 1.3535 µs 1.3553 µs]
                        change: [-91.299% -91.265% -91.231%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing lipsum-zh.html
                        time:   [803.92 ns 806.68 ns 809.73 ns]
                        change: [-89.009% -88.980% -88.950%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing medium-fragment.html
                        time:   [31.325 µs 31.364 µs 31.406 µs]
                        change: [-3.1117% -2.7794% -2.4573%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing small-fragment.html
                        time:   [2.6697 µs 2.6742 µs 2.6787 µs]
                        change: [-14.902% -14.606% -14.309%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing tiny-fragment.html
                        time:   [282.97 ns 283.55 ns 284.26 ns]
                        change: [+0.0774% +0.3738% +0.6806%] (p = 0.02 < 0.05)
                        Change within noise threshold.

html tokenizing strong.html
                        time:   [20.848 µs 20.869 µs 20.892 µs]
                        change: [-3.4900% -2.9871% -2.5518%] (p = 0.00 < 0.05)
                        Performance has improved.

@nicoburns
Copy link
Contributor

Have you considered using the "safe architecture intrinsics" (aka target features v1.1) from Rust 1.87? https://blog.rust-lang.org/2025/05/15/Rust-1.87.0/#safe-architecture-intrinsics

I think this would allow the body of the SIMD function(s) to be safe (even though it would still be unsafe to call.

@jschwe
Copy link
Member

jschwe commented Jun 16, 2025

html5ever still has an MSRV of 1.70 and for libraries imho it is preferable to update the MSRV conservatively. The safe architecture intrinsic is a nice change, but imho not important enough to warrant an MSRV increase, since it just makes the code look slightly nicer.

@jdm jdm added this pull request to the merge queue Jun 16, 2025
Merged via the queue into servo:main with commit f1cb6e7 Jun 16, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants