Skip to content

More fixed_regex_linter features #1166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AshesITR opened this issue May 21, 2022 · 7 comments
Closed

More fixed_regex_linter features #1166

AshesITR opened this issue May 21, 2022 · 7 comments
Labels
feature a feature request or enhancement

Comments

@AshesITR
Copy link
Collaborator

AshesITR commented May 21, 2022

There are some more cases where regexes can be optimized away:

For regex detection functions, we can safely replace

  • grepl("^static_rx", x) by startsWith(x, "static_rx")
  • grepl("static_rx$", x) by endsWith(x, "static_rx")
  • grepl("^static_rx$", x) by x == "static_rx"

For substitution functions, we can replace

  • gsub("^static_rx$", "replacement", x) by (function(.) { .[. == "static_rx"] <- "replacement"; . })(x)
@AshesITR
Copy link
Collaborator Author

AshesITR commented May 21, 2022

Benchmarks:

> x <- sample(letters, 1e3, TRUE)

> bench::mark(grepl("^a", x), startsWith(x, "a"))
# A tibble: 2 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 grepl("^a", x)      29.86µs  31.15µs    31909.        NA     3.19  9999     1    313.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 startsWith(x, "a")   2.22µs   2.28µs   426737.        NA    42.7   9999     1     23.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

> bench::mark(grepl("a$", x), endsWith(x, "a"))
# A tibble: 2 × 13
  expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 grepl("a$", x)    29.81µs  30.27µs    32644.        NA     3.26  9999     1    306.3ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 endsWith(x, "a")   3.55µs   3.61µs   273457.        NA    54.7   9998     2     36.6ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

> bench::mark(grepl("^a$", x), x == "a")
# A tibble: 2 × 13
  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 grepl("^a$", x)  30.06µs  30.54µs    32414.        NA     6.48  9998     2    308.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 x == "a"          2.54µs   2.58µs   383640.        NA    38.4   9999     1     26.1ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

> bench::mark(gsub("^a$", "", x), ifelse(x == "a", "", x))
# A tibble: 2 × 13
  expression                   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time               gc                  
  <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>             <list>              
1 gsub("^a$", "", x)        85.3µs   86.2µs    11326.        NA     4.09  5542     2      489ms <chr [1,000]> <NULL> <bench_tm [5,544]> <tibble [5,544 × 3]>
2 ifelse(x == "a", "", x)  102.2µs  103.5µs     9609.        NA    17.3   4452     8      463ms <chr [1,000]> <NULL> <bench_tm [4,460]> <tibble [4,460 × 3]>

> bench::mark(gsub("^a$", "", x), dplyr::recode(x, "a" = ""))
# A tibble: 2 × 13
  expression                    min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time               gc                  
  <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>             <list>              
1 gsub("^a$", "", x)         85.5µs   86.9µs    11181.        NA     2.02  5535     1      495ms <chr [1,000]> <NULL> <bench_tm [5,536]> <tibble [5,536 × 3]>
2 dplyr::recode(x, a = "")   66.2µs   68.5µs    14303.        NA    50.9   5341    19      373ms <chr [1,000]> <NULL> <bench_tm [5,360]> <tibble [5,360 × 3]>

> bench::mark(gsub("^a$", "", x), (\(.) {.[. == "a"] <- ""; .})(x))
# A tibble: 2 × 13
  expression                                    min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>                               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 gsub("^a$", "", x)                        82.84µs   85.9µs    11458.        NA     2.02  5673     1    495.1ms <chr [1,000]> <NULL> <bench_tm [5,674]>  <tibble [5,674 × 3]> 
2 (function(.) { .[. == "a"] <- "" . })(x)   6.33µs   6.79µs   145736.        NA    58.3   9996     4     68.6ms <chr [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

Interestingly, gsub("^a$", "", x) is faster than ifelse()

@MichaelChirico
Copy link
Collaborator

we have a different linter for this actually. the linter looks for grepl and substr usages that can become startsWith/endsWith.

because of the substr part it got a different linter

@AshesITR
Copy link
Collaborator Author

Fine by me, although the static regex regex would need to be duplicated in that case.
Also for the == case?

@MichaelChirico
Copy link
Collaborator

using the C version, we just reused is_not_regex after skipping the initial ^

@AshesITR
Copy link
Collaborator Author

Oh, yeah. We can do the same. if (startsWith(x, "^") && is_not_regex(substr(x, 2L, nchar(x)))) ...

@AshesITR AshesITR added the feature a feature request or enhancement label May 22, 2022
@MichaelChirico
Copy link
Collaborator

This is mostly handled by string_boundary_linter.

What's left is to consider regexes like ^static$, but I am not sure how common they'll be.

@MichaelChirico
Copy link
Collaborator

^ following the above comment, I'll close this & replace by a more focused issue extending string_boundary_linter for cases like ^static$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants