Add AARCH64-NEON fastpath for data state #618

Pavel-Irzhauski · 2025-06-13T14:30:29Z

Solution for #612 is almost the same as #601. It also gives significant performance improvement on aarch64 for lipsum.html and lipsum.zh.html and close to 0 change for smaller cases.

… x86/x86_64

Signed-off-by: Pavel Irzhauski <[email protected]>

into simd-neon implement SIMD fast path using ARM NEON in the same way as using SSE2

into simd-neon implement SIMD fast path using ARM NEON in the same way as using SSE2 Signed-off-by: Pavel Irzhauski <[email protected]>

jschwe · 2025-06-16T14:45:03Z

For reference, on an M4 pro with target-cpu=native criterion reports the improvements as follows:

html5ever/html5ever on  simd-neon [?] is 📦 v0.31.0 via 🦀 v1.86.0 took 1m0s 
❯ RUSTFLAGS='-C target-cpu=native' cargo criterion --target aarch64-apple-darwin

[...]

html tokenizing lipsum.html                                                                             
                        time:   [919.57 ns 926.26 ns 933.06 ns]
                        change: [-90.834% -90.776% -90.723%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing lipsum-zh.html                                                                             
                        time:   [558.04 ns 559.27 ns 560.41 ns]
                        change: [-89.129% -89.067% -89.007%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing medium-fragment.html                                                                             
                        time:   [20.532 µs 20.591 µs 20.656 µs]
                        change: [-3.8151% -3.5140% -3.1986%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing small-fragment.html                                                                             
                        time:   [1.7032 µs 1.7076 µs 1.7119 µs]
                        change: [-16.467% -16.132% -15.774%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing tiny-fragment.html                                                                            
                        time:   [196.18 ns 196.97 ns 197.85 ns]
                        change: [-0.6594% -0.2306% +0.1772%] (p = 0.29 > 0.05)
                        No change in performance detected.

html tokenizing strong.html                                                                             
                        time:   [13.787 µs 13.837 µs 13.887 µs]
                        change: [-1.0206% -0.3762% +0.2516%] (p = 0.24 > 0.05)
                        No change in performance detected.

nicoburns · 2025-06-16T15:02:36Z

I'm seeing similar results on Apple M1 even without target-cpu=native (my understanding is that target-cpu=native isn't generally necessary on Apple Silicon macs because the oldest CPU in that target (M1) supports most of the cpu features (e.g. neon) that you'd be targeting anyway):

html tokenizing lipsum.html
                        time:   [1.3519 µs 1.3535 µs 1.3553 µs]
                        change: [-91.299% -91.265% -91.231%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing lipsum-zh.html
                        time:   [803.92 ns 806.68 ns 809.73 ns]
                        change: [-89.009% -88.980% -88.950%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing medium-fragment.html
                        time:   [31.325 µs 31.364 µs 31.406 µs]
                        change: [-3.1117% -2.7794% -2.4573%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing small-fragment.html
                        time:   [2.6697 µs 2.6742 µs 2.6787 µs]
                        change: [-14.902% -14.606% -14.309%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing tiny-fragment.html
                        time:   [282.97 ns 283.55 ns 284.26 ns]
                        change: [+0.0774% +0.3738% +0.6806%] (p = 0.02 < 0.05)
                        Change within noise threshold.

html tokenizing strong.html
                        time:   [20.848 µs 20.869 µs 20.892 µs]
                        change: [-3.4900% -2.9871% -2.5518%] (p = 0.00 < 0.05)
                        Performance has improved.

html5ever/src/tokenizer/mod.rs

nicoburns · 2025-06-16T15:24:46Z

Have you considered using the "safe architecture intrinsics" (aka target features v1.1) from Rust 1.87? https://blog.rust-lang.org/2025/05/15/Rust-1.87.0/#safe-architecture-intrinsics

I think this would allow the body of the SIMD function(s) to be safe (even though it would still be unsafe to call.

jschwe · 2025-06-16T15:37:24Z

html5ever still has an MSRV of 1.70 and for libraries imho it is preferable to update the MSRV conservatively. The safe architecture intrinsic is a nice change, but imho not important enough to warrant an MSRV increase, since it just makes the code look slightly nicer.

Pavel Irzhauski and others added 8 commits June 12, 2025 18:36

optimize tokenizer using NEON for aarch64 similarly to using SSE2 for…

e76e74c

… x86/x86_64

refine NEON optimization

6d481ed

meet format check requirements

9cdd12e

meet format check requirements

b1a5316

Signed-off-by: Pavel Irzhauski <[email protected]>

Merge branch 'simd-neon' of https://github.com/Pavel-Irzhauski/html5ever

602d756

into simd-neon implement SIMD fast path using ARM NEON in the same way as using SSE2

Merge branch 'simd-neon' of https://github.com/Pavel-Irzhauski/html5ever

c810331

into simd-neon implement SIMD fast path using ARM NEON in the same way as using SSE2 Signed-off-by: Pavel Irzhauski <[email protected]>

Merge branch 'simd-neon' of https://github.com/Pavel-Irzhauski/html5ever

7ffe27b

into simd-neon implement SIMD fast path using ARM NEON in the same way as using SSE2 Signed-off-by: Pavel Irzhauski <[email protected]>

Merge branch 'main' into simd-neon

49b538b

Pavel-Irzhauski marked this pull request as ready for review June 16, 2025 13:01

jschwe changed the title ~~ARM-NEON~~ Add AARCH64-NEON fastpath for data state Jun 16, 2025

jschwe reviewed Jun 16, 2025

View reviewed changes

html5ever/src/tokenizer/mod.rs Show resolved Hide resolved

jschwe approved these changes Jun 16, 2025

View reviewed changes

jdm added this pull request to the merge queue Jun 16, 2025

Merged via the queue into servo:main with commit f1cb6e7 Jun 16, 2025
6 checks passed

jschwe mentioned this pull request Jun 16, 2025

Add SIMD fast path for data state on aarch64 #612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AARCH64-NEON fastpath for data state #618

Add AARCH64-NEON fastpath for data state #618

Pavel-Irzhauski commented Jun 13, 2025 •

edited

Loading

Uh oh!

jschwe commented Jun 16, 2025

Uh oh!

nicoburns commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

nicoburns commented Jun 16, 2025

Uh oh!

jschwe commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Add AARCH64-NEON fastpath for data state #618

Add AARCH64-NEON fastpath for data state #618

Conversation

Pavel-Irzhauski commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jschwe commented Jun 16, 2025

Uh oh!

nicoburns commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nicoburns commented Jun 16, 2025

Uh oh!

jschwe commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Pavel-Irzhauski commented Jun 13, 2025 •

edited

Loading

nicoburns commented Jun 16, 2025 •

edited

Loading