Optimize `is_ascii` for `str` and `[u8]` further #130733

okaneco · 2024-09-23T06:05:11Z

Replace the existing optimized function with one that enables auto-vectorization.

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

The resulting implementation is completely safe.

case00_libcore is the current implementation, case04_while_loop is this PR.

benchmarks:
    ascii::is_ascii_slice::long::case00_libcore                             22.25/iter  +/- 1.09
    ascii::is_ascii_slice::long::case04_while_loop                           6.78/iter  +/- 0.92
    ascii::is_ascii_slice::medium::case00_libcore                            2.81/iter  +/- 0.39
    ascii::is_ascii_slice::medium::case04_while_loop                         1.56/iter  +/- 0.78
    ascii::is_ascii_slice::short::case00_libcore                             5.55/iter  +/- 0.85
    ascii::is_ascii_slice::short::case04_while_loop                          3.75/iter  +/- 0.22
    ascii::is_ascii_slice::unaligned_both_long::case00_libcore              26.59/iter  +/- 0.66
    ascii::is_ascii_slice::unaligned_both_long::case04_while_loop            5.78/iter  +/- 0.16
    ascii::is_ascii_slice::unaligned_both_medium::case00_libcore             2.97/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop          2.41/iter  +/- 0.10
    ascii::is_ascii_slice::unaligned_head_long::case00_libcore              23.71/iter  +/- 0.79
    ascii::is_ascii_slice::unaligned_head_long::case04_while_loop            7.83/iter  +/- 1.31
    ascii::is_ascii_slice::unaligned_head_medium::case00_libcore             3.69/iter  +/- 0.54
    ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop          7.05/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_tail_long::case00_libcore              24.44/iter  +/- 1.41
    ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop            5.12/iter  +/- 0.18
    ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore             3.24/iter  +/- 0.40
    ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop          2.86/iter  +/- 0.14

unaligned_head_medium is the main regression in the benchmarks. It is a 32 byte string being sliced bytes[1..].

The first commit can be used to run the benchmarks against the current core implementation.

Previous implementation was done in #74066

Two potential drawbacks of this implementation are that it increases instruction count and may regress other platforms/architectures. The benches here may also be too artificial to glean much insight from.
https://rust.godbolt.org/z/G9znGfY36

rustbot · 2024-09-23T06:05:19Z

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

okaneco · 2024-09-23T06:20:41Z

I tried to rename the benchmark file to is_ascii_slice.rs but Github didn't like that and said it was a merge conflict. I didn't know how to resolve that so I dropped the commit.

workingjubilee · 2024-09-23T07:00:29Z

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

Are you taking into account that this can cause contention over a single execution port whereas the current implementation may be able to execute multiple checks at once in the same cycles, despite having more apparent instructions?

workingjubilee · 2024-09-23T07:49:56Z

apparently some more recent CPUs have more ports for pmovmskb, interestingly, so they should indeed go significantly faster.

jhorstmann · 2024-09-23T09:08:16Z

library/core/src/slice/ascii.rs

+    // Process the remaining `bytes.len() % N` bytes.
+    let mut is_ascii = true;
+    while i < bytes.len() {
+        is_ascii &= bytes[i] <= 127;


On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts. The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

// Process the remaining `bytes.len() % N` bytes. while i < bytes.len() { if bytes[i] > 127 { return false; } } true

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

// Constant chosen to enable `pmovmskb` instruction on x86-64 const N1: usize = 32; const N2: usize = 8; // or 16 let mut i = 0; while i + N1 <= bytes.len() { let chunk_end = i + N1; let mut count = 0; while i < chunk_end { count += (bytes[i] <= 127) as u8; i += 1; } // All bytes should be <= 127 so count is equal to chunk size. if count != N1 as u8 { return false; } } while i + N2 <= bytes.len() { let chunk_end = i + N2; let mut count = 0; while i < chunk_end { count += (bytes[i] <= 127) as u8; i += 1; } // All bytes should be <= 127 so count is equal to chunk size. if count != N2 as u8 { return false; } } // Process the remaining `bytes.len() % N` bytes. while i < bytes.len() { if bytes[i] > 127 { return false; } } true

On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts.

I reported this when I had the time to #130795

The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

This ended up being a slowdown, also note it's missing the i += 1; for anyone wanting to experiment with this.
I made that mistake refactoring a few times trying out different loops.

I tried slice::from_raw_parts on the tail to use the is_ascii_simple function above and that was slower too. It seemed faster to just eat the extra unrolled code since I couldn't get a scalar remainder loop construction that was as fast.

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

I tried a few permutations of that, and reductions of the constant like 16, 8, 4, 2, 1. The second loop of 8 seemed to perform the best on the benchmark suite, but only performed markedly better on the unaligned_both_medium and unaligned_tail_medium cases.

ascii::is_ascii_slice::long::case00_libcore 24.39/iter +/- 1.16 ascii::is_ascii_slice::long::case04_while_loop_32 8.14/iter +/- 0.80 ascii::is_ascii_slice::long::case05_while_loop_32_8 9.28/iter +/- 1.17 ascii::is_ascii_slice::medium::case00_libcore 2.93/iter +/- 0.06 ascii::is_ascii_slice::medium::case04_while_loop_32 1.64/iter +/- 0.30 ascii::is_ascii_slice::medium::case05_while_loop_32_8 1.86/iter +/- 0.09 ascii::is_ascii_slice::short::case00_libcore 5.34/iter +/- 0.99 ascii::is_ascii_slice::short::case04_while_loop_32 3.91/iter +/- 0.13 ascii::is_ascii_slice::short::case05_while_loop_32_8 3.82/iter +/- 0.11 ascii::is_ascii_slice::unaligned_both_long::case00_libcore 23.13/iter +/- 1.18 ascii::is_ascii_slice::unaligned_both_long::case04_while_loop_32 5.29/iter +/- 0.26 ascii::is_ascii_slice::unaligned_both_long::case05_while_loop_32_8 5.31/iter +/- 0.18 ascii::is_ascii_slice::unaligned_both_medium::case00_libcore 3.04/iter +/- 0.35 ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop_32 2.78/iter +/- 0.08 ascii::is_ascii_slice::unaligned_both_medium::case05_while_loop_32_8 1.62/iter +/- 0.10 ascii::is_ascii_slice::unaligned_head_long::case00_libcore 23.03/iter +/- 2.51 ascii::is_ascii_slice::unaligned_head_long::case04_while_loop_32 8.04/iter +/- 0.24 ascii::is_ascii_slice::unaligned_head_long::case05_while_loop_32_8 13.04/iter +/- 0.34 ascii::is_ascii_slice::unaligned_head_medium::case00_libcore 3.78/iter +/- 0.52 ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop_32 6.86/iter +/- 0.61 ascii::is_ascii_slice::unaligned_head_medium::case05_while_loop_32_8 8.39/iter +/- 0.16 ascii::is_ascii_slice::unaligned_tail_long::case00_libcore 24.26/iter +/- 0.22 ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop_32 4.54/iter +/- 0.15 ascii::is_ascii_slice::unaligned_tail_long::case05_while_loop_32_8 4.69/iter +/- 0.05 ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore 2.89/iter +/- 0.40 ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop_32 2.63/iter +/- 0.09 ascii::is_ascii_slice::unaligned_tail_medium::case05_while_loop_32_8 1.98/iter +/- 0.13

the8472 · 2024-09-23T19:58:41Z

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

okaneco · 2024-09-27T01:15:42Z

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

I've reinstated the SWAR loop.

I know str::contains has an optimized path for SSE2.
Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

I tried integrating this fast path into the current function and then using slice::from_raw_parts to handle the tail with SWAR but the speed was 2-4x slower than the new standalone function. So I'm not sure it's worth making the SWAR function any more complicated than it is, unless there are other tricks that can be used.

fn is_ascii(mut s: &[u8]) -> bool {
    #[cfg(...)]
    {
        let mut i = 0;
        while i + 32 <= bytes.len() {}
        let new_len = s.len() - i;
        s = crate::slice::from_raw_parts(s.as_ptr().add(i), new_len);
    }
    // proceed with SWAR
}

the8472 · 2024-09-28T22:49:47Z

Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

If the speedup requires platform-specific code or shaping the code to benefit x86 would negatively impact other platforms then that's acceptable, yes.

bors · 2024-11-04T09:53:49Z

☔ The latest upstream changes (presumably #132586) made this pull request unmergeable. Please resolve the merge conflicts.

Add LONG benchmarks for more comparison between the methods

The new `is_ascii` function is optimized to use the `pmovmskb` vector instruction which tests the high bit in a lane. This corresponds to the same check of whether a byte is ASCII so ASCII validity checking can be vectorized. This instruction does not exist on other platforms so it is likely to regress performance and is gated to all(target_arch = "x86_64", target_feature = "sse2"). Add codegen test Remove crate::mem import for functions included in the prelude

scottmcm · 2024-12-20T08:39:55Z

tests/codegen/slice-is-ascii.rs

+    // CHECK-NEXT: icmp slt <32 x i8>
+    // CHECK-NEXT: bitcast <32 x i1>
+    // CHECK-NEXT: icmp eq i32


This sure makes me tempted to write it with portable simd, but I don't know where we are on that just yet, so probably not something to do this PR.

scottmcm · 2024-12-20T08:42:07Z

library/core/src/slice/ascii.rs

+        }
+    }
+
+    // Process the remaining `bytes.len() % N` bytes.


My brain sees this and things as_chunks::<CHUNK_SIZE>(), but I guess that's not really worth doing right now when we can't loop over slice iterators in const fn anyway. Maybe in the Glorious Const Future™.

We can just call is_ascii_simple in const, so there's no reason to do any const-specific adjustments IMO.

scottmcm · 2024-12-20T08:49:00Z

Neat! Thanks for including the codegen test so it's less likely to break in future refactorings, and I like that the code itself is pretty straight-forward despite the contortions to get the pmovmskb.

@bors r+ rollup=iffy (new codegen tests are always scary)

bors · 2024-12-20T08:49:03Z

📌 Commit 1b5c02b has been approved by scottmcm

It is now in the queue for this repository.

bors · 2024-12-20T10:09:12Z

⌛ Testing commit 1b5c02b with merge 770b134...

bors · 2024-12-20T14:24:13Z

💥 Test timed out

okaneco · 2024-12-21T21:28:12Z

Could this get a retry?

x86_64-apple-1 timed out, not sure if it's spurious
https://github.com/rust-lang-ci/rust/actions/runs/12429499450/job/34703145750

RalfJung · 2024-12-21T22:08:27Z

@bors retry

bors · 2024-12-22T02:44:17Z

⌛ Testing commit 1b5c02b with merge c113247...

bors · 2024-12-22T05:28:38Z

☀️ Test successful - checks-actions
Approved by: scottmcm
Pushing c113247 to master...

rust-timer · 2024-12-22T06:58:27Z

Finished benchmarking commit (c113247): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results (primary 4.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	4.3%	[3.8%, 4.8%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	4.3%	[3.8%, 4.8%]	2

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary -0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.1%	[-0.1%, -0.1%]	4
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.1%	[-0.1%, -0.1%]	4

Bootstrap: 760.466s -> 761.198s (0.10%)
Artifact size: 330.63 MiB -> 330.59 MiB (-0.01%)

rustbot assigned scottmcm Sep 23, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 23, 2024

okaneco force-pushed the is_ascii branch from bc78676 to bf7bb44 Compare September 23, 2024 06:18

jhorstmann reviewed Sep 23, 2024

View reviewed changes

okaneco force-pushed the is_ascii branch from bf7bb44 to 44ccc3b Compare September 27, 2024 01:15

okaneco added 2 commits November 5, 2024 22:34

Add new implementation benchmark

d7d67ad

Add LONG benchmarks for more comparison between the methods

okaneco force-pushed the is_ascii branch from 44ccc3b to 1b5c02b Compare November 6, 2024 07:29

scottmcm reviewed Dec 20, 2024

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 20, 2024

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Dec 20, 2024

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 21, 2024

bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 22, 2024

bors merged commit c113247 into rust-lang:master Dec 22, 2024
7 checks passed

rustbot added this to the 1.85.0 milestone Dec 22, 2024

okaneco deleted the is_ascii branch December 23, 2024 21:41

Optimize is_ascii for str and [u8] further #130733

Optimize is_ascii for str and [u8] further #130733

Uh oh!

Conversation

okaneco commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Sep 23, 2024

Uh oh!

okaneco commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

workingjubilee commented Sep 23, 2024

Uh oh!

workingjubilee commented Sep 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the8472 commented Sep 23, 2024

Uh oh!

okaneco commented Sep 27, 2024

Uh oh!

the8472 commented Sep 28, 2024

Uh oh!

bors commented Nov 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottmcm commented Dec 20, 2024

Uh oh!

bors commented Dec 20, 2024

Uh oh!

bors commented Dec 20, 2024

Uh oh!

bors commented Dec 20, 2024

Uh oh!

okaneco commented Dec 21, 2024

Uh oh!

RalfJung commented Dec 21, 2024

Uh oh!

bors commented Dec 22, 2024

Uh oh!

bors commented Dec 22, 2024

Uh oh!

Uh oh!

rust-timer commented Dec 22, 2024

Overall result: no relevant changes - no action needed

Uh oh!

Uh oh!

Optimize `is_ascii` for `str` and `[u8]` further #130733

Optimize `is_ascii` for `str` and `[u8]` further #130733

okaneco commented Sep 23, 2024 •

edited

Loading

okaneco commented Sep 23, 2024 •

edited

Loading