Skip to content

Optimize is_ascii for str and [u8] further #130733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 22, 2024
Merged

Conversation

okaneco
Copy link
Contributor

@okaneco okaneco commented Sep 23, 2024

Replace the existing optimized function with one that enables auto-vectorization.

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

The resulting implementation is completely safe.

case00_libcore is the current implementation, case04_while_loop is this PR.

benchmarks:
    ascii::is_ascii_slice::long::case00_libcore                             22.25/iter  +/- 1.09
    ascii::is_ascii_slice::long::case04_while_loop                           6.78/iter  +/- 0.92
    ascii::is_ascii_slice::medium::case00_libcore                            2.81/iter  +/- 0.39
    ascii::is_ascii_slice::medium::case04_while_loop                         1.56/iter  +/- 0.78
    ascii::is_ascii_slice::short::case00_libcore                             5.55/iter  +/- 0.85
    ascii::is_ascii_slice::short::case04_while_loop                          3.75/iter  +/- 0.22
    ascii::is_ascii_slice::unaligned_both_long::case00_libcore              26.59/iter  +/- 0.66
    ascii::is_ascii_slice::unaligned_both_long::case04_while_loop            5.78/iter  +/- 0.16
    ascii::is_ascii_slice::unaligned_both_medium::case00_libcore             2.97/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop          2.41/iter  +/- 0.10
    ascii::is_ascii_slice::unaligned_head_long::case00_libcore              23.71/iter  +/- 0.79
    ascii::is_ascii_slice::unaligned_head_long::case04_while_loop            7.83/iter  +/- 1.31
    ascii::is_ascii_slice::unaligned_head_medium::case00_libcore             3.69/iter  +/- 0.54
    ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop          7.05/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_tail_long::case00_libcore              24.44/iter  +/- 1.41
    ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop            5.12/iter  +/- 0.18
    ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore             3.24/iter  +/- 0.40
    ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop          2.86/iter  +/- 0.14

unaligned_head_medium is the main regression in the benchmarks. It is a 32 byte string being sliced bytes[1..].

The first commit can be used to run the benchmarks against the current core implementation.

Previous implementation was done in #74066


Two potential drawbacks of this implementation are that it increases instruction count and may regress other platforms/architectures. The benches here may also be too artificial to glean much insight from.
https://rust.godbolt.org/z/G9znGfY36

@rustbot
Copy link
Collaborator

rustbot commented Sep 23, 2024

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 23, 2024
@okaneco
Copy link
Contributor Author

okaneco commented Sep 23, 2024

I tried to rename the benchmark file to is_ascii_slice.rs but Github didn't like that and said it was a merge conflict. I didn't know how to resolve that so I dropped the commit.

@workingjubilee
Copy link
Member

This is especially beneficial on x86-64 as pmovmskb can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current usize at a time check.

Are you taking into account that this can cause contention over a single execution port whereas the current implementation may be able to execute multiple checks at once in the same cycles, despite having more apparent instructions?

@workingjubilee
Copy link
Member

apparently some more recent CPUs have more ports for pmovmskb, interestingly, so they should indeed go significantly faster.

// Process the remaining `bytes.len() % N` bytes.
let mut is_ascii = true;
while i < bytes.len() {
is_ascii &= bytes[i] <= 127;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts. The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

    // Process the remaining `bytes.len() % N` bytes.
    while i < bytes.len() {
        if bytes[i] > 127 {
            return false;
        }
    }

    true

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

    // Constant chosen to enable `pmovmskb` instruction on x86-64
    const N1: usize = 32;
    const N2: usize = 8; // or 16

    let mut i = 0;

    while i + N1 <= bytes.len() {
        let chunk_end = i + N1;

        let mut count = 0;
        while i < chunk_end {
            count += (bytes[i] <= 127) as u8;
            i += 1;
        }

        // All bytes should be <= 127 so count is equal to chunk size.
        if count != N1 as u8 {
            return false;
        }
    }

    while i + N2 <= bytes.len() {
        let chunk_end = i + N2;

        let mut count = 0;
        while i < chunk_end {
            count += (bytes[i] <= 127) as u8;
            i += 1;
        }

        // All bytes should be <= 127 so count is equal to chunk size.
        if count != N2 as u8 {
            return false;
        }
    }

    // Process the remaining `bytes.len() % N` bytes.
    while i < bytes.len() {
        if bytes[i] > 127 {
            return false;
        }
    }

    true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the godbolt link it looks like the compiler tries to unroll and autovectorize this loop again, which might not be beneficial for usually low loop counts.

I reported this when I had the time to #130795

The code size would be much smaller if the remainder loop has a simple early exit, but I haven't run the benchmarks on that:

This ended up being a slowdown, also note it's missing the i += 1; for anyone wanting to experiment with this.
I made that mistake refactoring a few times trying out different loops.

I tried slice::from_raw_parts on the tail to use the is_ascii_simple function above and that was slower too. It seemed faster to just eat the extra unrolled code since I couldn't get a scalar remainder loop construction that was as fast.

If that turns out to be a slowdown, another thing to try could be a second loop with a smaller chunk size:

I tried a few permutations of that, and reductions of the constant like 16, 8, 4, 2, 1. The second loop of 8 seemed to perform the best on the benchmark suite, but only performed markedly better on the unaligned_both_medium and unaligned_tail_medium cases.

ascii::is_ascii_slice::long::case00_libcore                          24.39/iter +/- 1.16
ascii::is_ascii_slice::long::case04_while_loop_32                     8.14/iter +/- 0.80
ascii::is_ascii_slice::long::case05_while_loop_32_8                   9.28/iter +/- 1.17
ascii::is_ascii_slice::medium::case00_libcore                         2.93/iter +/- 0.06
ascii::is_ascii_slice::medium::case04_while_loop_32                   1.64/iter +/- 0.30
ascii::is_ascii_slice::medium::case05_while_loop_32_8                 1.86/iter +/- 0.09
ascii::is_ascii_slice::short::case00_libcore                          5.34/iter +/- 0.99
ascii::is_ascii_slice::short::case04_while_loop_32                    3.91/iter +/- 0.13
ascii::is_ascii_slice::short::case05_while_loop_32_8                  3.82/iter +/- 0.11
ascii::is_ascii_slice::unaligned_both_long::case00_libcore           23.13/iter +/- 1.18
ascii::is_ascii_slice::unaligned_both_long::case04_while_loop_32      5.29/iter +/- 0.26
ascii::is_ascii_slice::unaligned_both_long::case05_while_loop_32_8    5.31/iter +/- 0.18
ascii::is_ascii_slice::unaligned_both_medium::case00_libcore          3.04/iter +/- 0.35
ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop_32    2.78/iter +/- 0.08
ascii::is_ascii_slice::unaligned_both_medium::case05_while_loop_32_8  1.62/iter +/- 0.10
ascii::is_ascii_slice::unaligned_head_long::case00_libcore           23.03/iter +/- 2.51
ascii::is_ascii_slice::unaligned_head_long::case04_while_loop_32      8.04/iter +/- 0.24
ascii::is_ascii_slice::unaligned_head_long::case05_while_loop_32_8   13.04/iter +/- 0.34
ascii::is_ascii_slice::unaligned_head_medium::case00_libcore          3.78/iter +/- 0.52
ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop_32    6.86/iter +/- 0.61
ascii::is_ascii_slice::unaligned_head_medium::case05_while_loop_32_8  8.39/iter +/- 0.16
ascii::is_ascii_slice::unaligned_tail_long::case00_libcore           24.26/iter +/- 0.22
ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop_32      4.54/iter +/- 0.15
ascii::is_ascii_slice::unaligned_tail_long::case05_while_loop_32_8    4.69/iter +/- 0.05
ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore          2.89/iter +/- 0.40
ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop_32    2.63/iter +/- 0.09
ascii::is_ascii_slice::unaligned_tail_medium::case05_while_loop_32_8  1.98/iter +/- 0.13

@the8472
Copy link
Member

the8472 commented Sep 23, 2024

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

@okaneco
Copy link
Contributor Author

okaneco commented Sep 27, 2024

The old code operates on usizes to at least get SWAR on targets without SIMD, even on those where unaligned loads are more expensive. So this needs to be checked on more than x86.

I've reinstated the SWAR loop.

I know str::contains has an optimized path for SSE2.
Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

I tried integrating this fast path into the current function and then using slice::from_raw_parts to handle the tail with SWAR but the speed was 2-4x slower than the new standalone function. So I'm not sure it's worth making the SWAR function any more complicated than it is, unless there are other tricks that can be used.

fn is_ascii(mut s: &[u8]) -> bool {
    #[cfg(...)]
    {
        let mut i = 0;
        while i + 32 <= bytes.len() {}
        let new_len = s.len() - i;
        s = crate::slice::from_raw_parts(s.as_ptr().add(i), new_len);
    }
    // proceed with SWAR
}

@the8472
Copy link
Member

the8472 commented Sep 28, 2024

Would it be okay to conditionally compile this is_ascii function for the same #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))]?

If the speedup requires platform-specific code or shaping the code to benefit x86 would negatively impact other platforms then that's acceptable, yes.

@bors
Copy link
Collaborator

bors commented Nov 4, 2024

☔ The latest upstream changes (presumably #132586) made this pull request unmergeable. Please resolve the merge conflicts.

Add LONG benchmarks for more comparison between the methods
The new `is_ascii` function is optimized to use the
`pmovmskb` vector instruction which tests the high bit in a lane.
This corresponds to the same check of whether a byte is ASCII so
ASCII validity checking can be vectorized. This instruction
does not exist on other platforms so it is likely to regress performance
and is gated to all(target_arch = "x86_64", target_feature = "sse2").

Add codegen test
Remove crate::mem import for functions included in the prelude
Comment on lines +12 to +14
// CHECK-NEXT: icmp slt <32 x i8>
// CHECK-NEXT: bitcast <32 x i1>
// CHECK-NEXT: icmp eq i32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sure makes me tempted to write it with portable simd, but I don't know where we are on that just yet, so probably not something to do this PR.

}
}

// Process the remaining `bytes.len() % N` bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain sees this and things as_chunks::<CHUNK_SIZE>(), but I guess that's not really worth doing right now when we can't loop over slice iterators in const fn anyway. Maybe in the Glorious Const Future™.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just call is_ascii_simple in const, so there's no reason to do any const-specific adjustments IMO.

@scottmcm
Copy link
Member

Neat! Thanks for including the codegen test so it's less likely to break in future refactorings, and I like that the code itself is pretty straight-forward despite the contortions to get the pmovmskb.

@bors r+ rollup=iffy (new codegen tests are always scary)

@bors
Copy link
Collaborator

bors commented Dec 20, 2024

📌 Commit 1b5c02b has been approved by scottmcm

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 20, 2024
@bors
Copy link
Collaborator

bors commented Dec 20, 2024

⌛ Testing commit 1b5c02b with merge 770b134...

@bors
Copy link
Collaborator

bors commented Dec 20, 2024

💥 Test timed out

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Dec 20, 2024
@okaneco
Copy link
Contributor Author

okaneco commented Dec 21, 2024

Could this get a retry?

x86_64-apple-1 timed out, not sure if it's spurious
https://github.com/rust-lang-ci/rust/actions/runs/12429499450/job/34703145750

@RalfJung
Copy link
Member

@bors retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 21, 2024
@bors
Copy link
Collaborator

bors commented Dec 22, 2024

⌛ Testing commit 1b5c02b with merge c113247...

@bors
Copy link
Collaborator

bors commented Dec 22, 2024

☀️ Test successful - checks-actions
Approved by: scottmcm
Pushing c113247 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 22, 2024
@bors bors merged commit c113247 into rust-lang:master Dec 22, 2024
7 checks passed
@rustbot rustbot added this to the 1.85.0 milestone Dec 22, 2024
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (c113247): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results (primary 4.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
4.3% [3.8%, 4.8%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 4.3% [3.8%, 4.8%] 2

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary -0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.1% [-0.1%, -0.1%] 4
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.1% [-0.1%, -0.1%] 4

Bootstrap: 760.466s -> 761.198s (0.10%)
Artifact size: 330.63 MiB -> 330.59 MiB (-0.01%)

@okaneco okaneco deleted the is_ascii branch December 23, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants