Skip to content

Use axpy for scaled_add #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 22, 2017
Merged

Use axpy for scaled_add #278

merged 4 commits into from
Mar 22, 2017

Conversation

SuperFluffy
Copy link
Contributor

@SuperFluffy SuperFluffy commented Mar 18, 2017

This adds a method to use saxpy and daxpy when doing scaled addition. It gives a nice performance boost:

test scaled_add_2d_f32_axpy           ... bench:         312 ns/iter (+/- 13)
test scaled_add_2d_f32_regular        ... bench:         571 ns/iter (+/- 23)

The main addition in this code is Dimension::equispaced_stride, which is a generalization of the code formerly inside Dimension::is_contiguous: if all elements of an array are one ptr-width away from each other, we call the array contiguous. If all elements of an array are exactly n ptr-widths away from each other, we call the array equispaced (not sure if that's the right terminology). If n = 1, then an equispaced array is contiguous. As long as an array is equispaced, it is admissable to blas via the incx and incy variables (in case of *axpy).

I think we might be able to use this for the gemm implementations, too.

EDIT: I notice that there are inlined versions of the Dimensions trait functions for 1 to 3 dimensional arrays. I guess I should extend those too?

EDIT 2: I am also not sure I like the extra allocations in that continuity/equidistance check. But I guess it doesn't matter much for large arrays.

@@ -73,6 +73,11 @@ impl<A, S, D> ArrayBase<S, D> where S: Data<Elem=A>, D: Dimension
self.dim.clone()
}

/// Return the strides of the array as they are stored in the array.
pub fn raw_strides(&self) -> D {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be exposed. You can use the strides field directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rolled back this change by amending the original PR. Using the strides field directly (I didn't think about the linalg module having direct access to ArrayBase's private fields due to ArrayBase being defined in lib and linalg being a child module of lib).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children have access to it all (which is why some methods are defined in lib.rs directly)

@bluss
Copy link
Member

bluss commented Mar 19, 2017

Just like matrix multiplication, this should not be a new method, it should use scaled_add (It's the reason scaled_add exists, so that it can get this improvement).

}

unsafe {
Some(::std::ptr::read(&base_stride as *const _ as *const isize))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure what this is doing but it should not be unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rolled that back. I was following some of the other casting examples in the code, but a direct base_stride as isize makes more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm surely ndarray doesn't do anything silly like this. The exception would be for the Any cast, where we have no choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I got the idea from this:

// Read pointer to type `A` as type `B`.
//
// **Panics** if `A` and `B` are not the same type
fn cast_as<A: 'static + Copy, B: 'static + Copy>(a: &A) -> B {
    assert!(same_type::<A, B>());
    unsafe {
        ::std::ptr::read(a as *const _ as *const B)
    }
}

The way these casts are used in the gemm definition, cast_as(&alpha) you could just replace them by alpha as $ty, since c_single and c_double are just type aliases for f32 and f64, respectively: https://doc.rust-lang.org/libc/x86_64-unknown-linux-gnu/libc/type.c_double.html

Or is there something I don't see?

@bluss
Copy link
Member

bluss commented Mar 19, 2017

I don't see any extra alloactions, where is that?

@SuperFluffy
Copy link
Contributor Author

The allocations happen due to _fastest_varying_stride_order around https://github.com/SuperFluffy/rust-ndarray/blob/936b9aced096b60707fff9b003c23353f0f6f8ef/src/linalg/impl_linalg.rs#L330.

blas_compat is also called twice, which is also calling _fastest_varying_stride_order via equispaced_stride.

@SuperFluffy
Copy link
Contributor Author

I moved the blas call into scaled_add_impl similar to the dot one together with a blas feature switch.

I had an extra _axpy method because it made benching and testing easier. Now I am not sure how to explicitly test the performance between the two.

@SuperFluffy
Copy link
Contributor Author

Since ndarray most likely will be used by performance sensitive people, is it wise to automatically fall back to the non-blas version of dot and scaled_add when the arrays are deemed non-blas-compatible? Maybe the methods should panic instead (or return an error) so that the user knows what's up?

@bluss
Copy link
Member

bluss commented Mar 19, 2017

There's no dynamic memory allocation in _fastest_varying_stride_order that I can see. (The exception is for dynamic dimensionality IxD, but then that's something that happens everywhere due to necessity.)

This is what _fastest_varying_stride_order does for the 2d (Ix2) case:

    #[inline]
    fn _fastest_varying_stride_order(&self) -> Self {
        if get!(self, 0) as Ixs <= get!(self, 1) as Ixs { Ix2(0, 1) } else { Ix2(1, 0) }
    }

Much better with the code moved.

I don't agree with that view on performance, ndarray will simply try its best to be performant.

@bluss
Copy link
Member

bluss commented Mar 19, 2017

How to benchmark: Record benchmark results into a file with and without the blas feature. Then use cargo benchcmp to compare. I'll give it a spin.

@SuperFluffy
Copy link
Contributor Author

Re: _fastest_varying_stride_order

Ah yes! I was looking at the generic version which always comes with clone()s.

Should equispaced_stride also be included and #[inline]d in the Dimension impls?

@bluss
Copy link
Member

bluss commented Mar 19, 2017

.clone() doesn't automatically mean allocations. Ix2's clone is copying a [usize; 2]

@bluss
Copy link
Member

bluss commented Mar 19, 2017

TBH I haven't looked in detail of every line of the PR yet. I'm meaning to look through with a thought towards integer maximum values in blas first.

@bluss
Copy link
Member

bluss commented Mar 19, 2017

This is the 64x64 f32 case in benchmarks. However! This is only with default codegen

cargo-benchcmp noblas1 blas2
 name                       noblas1 ns/iter  blas2 ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  552              512                     -40  -7.25%

If I use target-cpu=native and let regular rust use AVX (on this configuration) then regular Rust improves

RUSTFLAGS=-Ctarget-cpu=native cargo bench scaled_add
RUSTFLAGS=-Ctarget-cpu=native cargo bench --features=test scaled_add
cargo-benchcmp noblas_native.1  blas_native.2
 name                       noblas_native.1 ns/iter  blas_native.2 ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  510                      493                             -17  -3.33% 

@bluss
Copy link
Member

bluss commented Mar 19, 2017

(Patch used to compile without blas)

diff --git a/benches/bench1.rs b/benches/bench1.rs
index 08f70a9..b71341f 100644
--- a/benches/bench1.rs
+++ b/benches/bench1.rs
@@ -1,6 +1,7 @@
 #![feature(test)]
 #![allow(unused_imports)]
 
+#[cfg(feature = "blas")]
 extern crate blas_sys;
 extern crate test;
 #[macro_use(s)]

@bluss
Copy link
Member

bluss commented Mar 19, 2017

General before / after without blas for this PR. Just to see if there are any changes due to the Dimension changes, and what I can see there aren't any. Just fluctuations.

https://gist.github.com/bluss/1e2b7fd3141da40603d7d98885299525

@SuperFluffy
Copy link
Contributor Author

I wasn't aware how much RUSTFLAGS=-Ctarget-cpu=native does:

% cargo benchcmp native_blas native_noblas             
 name                       native_blas ns/iter  native_noblas ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  337                  312                             -25  -7.42%
% cargo benchcmp nonnative_blas nonnative_noblas
 name                       nonnative_blas ns/iter  nonnative_noblas ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  353                     588                                235  66.57% 

@SuperFluffy
Copy link
Contributor Author

What happened here?

map_stride_double_f64 794 520 -274 -34.51%

@bluss
Copy link
Member

bluss commented Mar 19, 2017

Random fluctuations. There simply isn't one benchmark run that is consistent in all the variables. To find a regression you have to find a suspicious case and then try to find more evidence for a regression. In this case there is none. A different run:

cargo-benchcmp before.1 after.0 | grep map_stride
 map_stride                        87                88                          1    1.15% 
 map_stride_double_f64             516               520                         4    0.78% 
 map_stride_f64                    543               543                         0    0.00% 
 map_stride_u32                    565               558                        -7   -1.24% 

@bluss
Copy link
Member

bluss commented Mar 19, 2017

I don't think anyone escapes CPUs that change their cpufreq dynamically nowadays.

@bluss
Copy link
Member

bluss commented Mar 20, 2017

blas wins if we use equispaced by 2 arrays (scaled_add_2d_f32_stride):

cargo-benchcmp noblas_stride.1 blas_stride.1
 name                       noblas_stride.1 ns/iter  blas_stride.1 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  515                      506                              -9   -1.75% 
 scaled_add_2d_f32_stride   2,801                    2,068                          -733  -26.17%
const SCALE_ADD_SZ: usize = 64;

#[bench]
fn scaled_add_2d_f32_regular(bench: &mut test::Bencher)
{
    let mut av = Array::<f32, _>::zeros((SCALE_ADD_SZ, SCALE_ADD_SZ));
    let bv = Array::<f32, _>::zeros((SCALE_ADD_SZ, SCALE_ADD_SZ));
    let scalar = 3.1415926535;
    bench.iter(|| {
        av.scaled_add(scalar, &bv);
    });
}

#[bench]
fn scaled_add_2d_f32_stride(bench: &mut test::Bencher)
{
    let mut av = Array::<f32, _>::zeros((SCALE_ADD_SZ, 2 * SCALE_ADD_SZ));
    let bv = Array::<f32, _>::zeros((SCALE_ADD_SZ, 2 * SCALE_ADD_SZ));
    let mut av = av.slice_mut(s![.., ..;2]);
    let bv = bv.slice(s![.., ..;2]);
    let scalar = 3.1415926535;
    bench.iter(|| {
        av.scaled_add(scalar, &bv);
    });
}

@SuperFluffy
Copy link
Contributor Author

SuperFluffy commented Mar 20, 2017 via email

@SuperFluffy
Copy link
Contributor Author

I cranked up SCALE_ADD_SZ to 512. While OpenBLAS is still faster, the difference is less pronounced for single threaded code. But you can see how the bench really benefits from threading:

 name                       native_noblas ns/iter  native_blas_thread_4 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  76,582                 37,479                             -39,103  -51.06% 
 scaled_add_2d_f32_stride   261,775                121,237                           -140,538  -53.69% 

 name                       native_noblas ns/iter  native_blas_thread_2 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  76,582                 40,386                             -36,196  -47.26% 
 scaled_add_2d_f32_stride   261,775                131,919                           -129,856  -49.61% 

 name                       native_noblas ns/iter  native_blas_thread_1 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  76,582                 76,494                                 -88   -0.11% 
 scaled_add_2d_f32_stride   261,775                226,802                            -34,973  -13.36%

@SuperFluffy
Copy link
Contributor Author

SuperFluffy commented Mar 20, 2017

It further looks like OpenBLAS is smart when it comes to parallelizing. For SCALE_ADD_SZ = 64 there is a) no difference between the different thread numbers, and b) it's always a lot faster than the non-blas implementation:

 name                       native_noblas ns/iter  native_blas_thread_4 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  313                    327                                     14    4.47% 
 scaled_add_2d_f32_stride   3,163                  2,359                                 -804  -25.42% 

 name                       native_noblas ns/iter  native_blas_thread_2 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  313                    333                                     20    6.39% 
 scaled_add_2d_f32_stride   3,163                  2,332                                 -831  -26.27% 

 name                       native_noblas ns/iter  native_blas_thread_1 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  313                    319                                      6    1.92% 
 scaled_add_2d_f32_stride   3,163                  2,333                                 -830  -26.24%

So what's happening between the regular and the strided arrays that ndarray loses so much speed?

Note: I compiled all these benches with RUSTFLAGS=-Ctarget-cpu=native, and also OpenBLAS was compiled with Haswell optimizations, i.e. I have this:

ls -1 libopenblas*
libopenblas.a -> libopenblas_haswellp-r0.2.19.a
libopenblas.so -> libopenblas_haswellp-r0.2.19.so
libopenblas.so.0 -> libopenblas_haswellp-r0.2.19.so
libopenblas_haswellp-r0.2.19.a
libopenblas_haswellp-r0.2.19.so

@bluss
Copy link
Member

bluss commented Mar 20, 2017

Right. I forgot about openblas's threading.

ndarray doesn't contain a lot of special case code. As you know we have efficient common cases (array is contiguous) and fall back to more general loops for other cases.

@bluss
Copy link
Member

bluss commented Mar 20, 2017

Besides, isn't it clear that blas is also losing a lot due to the strided case?

It's pretty simple I think: If we have contiguous data, we can load simd words of 4 or 8 or 16 values at once. Can't do that if there are “holes”.

Unfortunately an equidistance=2 array has to act as if something could be writing to the other elements in between, so we're not allowed to even read those IIUC.

@bluss
Copy link
Member

bluss commented Mar 20, 2017

Clearly we should go for using blas here, right?

@bluss
Copy link
Member

bluss commented Mar 20, 2017

For matrix multiplication we always pack the matrices into contiguous buffers, so we can always use SIMD.

@bluss
Copy link
Member

bluss commented Mar 20, 2017

However that is a particularity of mm, that we have enough arithmetic operations on the data that it makes sense to do it.

@SuperFluffy
Copy link
Contributor Author

  1. Yeah, BLAS is clearly losing speed, and the loss is in the same ballpark.
  2. Blas is clearly the way to go.

I guess there is no way to do a as_slice_mut with holes to get an extra case for zip_mut_with_same_shape?

@bluss
Copy link
Member

bluss commented Mar 20, 2017

there can always be more cases, but I'm not sure that's the problem.

@bluss
Copy link
Member

bluss commented Mar 20, 2017

Consider existing overhead in the iterators / loops of the current cases, for example we don't have zip specialization.

@SuperFluffy
Copy link
Contributor Author

What changes would you like on the PR to merge it? I guess I will add the strided scaled_add to the benches.

@@ -1,6 +1,7 @@
#![feature(test)]
#![allow(unused_imports)]

extern crate blas_sys;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg(feature = "blas") on top of this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line wasn't actually needed, was it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that one was no necessary.


if a.len() > blas_index::max_value() as usize {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be a.len() really but some kind of upper bound on the size * stride. I'll have to think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at saxpy.f:

DO i = 1,n
    sy(iy) = sy(iy) + sa*sx(ix)
    ix = ix + incx
    iy = iy + incy
END DO

It looks like we need a.len() * stride?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and ndarray will already guarantee that len * stride does not overflow, so we can just multiply them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not overflow usize that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it around a bit. I am not sure about whether we should cast to usize or isize though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, stride signedness. I'd use stride.abs() and usize.

I also see how you mean it dispatches at runtime. It does; the type lookup is all compile time (it's statically known), but the size information is at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit confused: do you want me to change to stride.abs() and cast to usize, or is the PR approved as is now?

return false;
}

match D::equispaced_stride(&a.raw_dim(), &a.strides) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this stride is 0 here? I don't think blas supports that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends, can the stride ever be 0 in any dimension and for any kind of vector? A n-dimensional vector with one dimension set to 0 doesn't really make sense, so we should make sure that such a state can never be reached.

A special case is a 0-dimensional array. It looks like its stride is set to 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's very common, that's how broadcasting works (same in numpy)

>>> import numpy as np
>>> np.broadcast_to(np.arange(9.), (9, 9)).strides
(0, 8)

Copy link
Contributor Author

@SuperFluffy SuperFluffy Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the way the equispaced_stride method is implemented right now is that it grabs the smallest stride and calculates the rest based on that. That means that if we indeed have one stride set to 0, the demanded/calculated stride will always be 0 irrespective of the dimension (because it predicts the necessary stride s_i+1 as s_i * d_i). It follows that the demanded stride will never match the rest of the stride info in the struct. Thus the function will return None and we won't be able to use blas.

@bluss bluss merged commit 245fcde into rust-ndarray:master Mar 22, 2017
@bluss
Copy link
Member

bluss commented Mar 22, 2017

Thank you! Merging to fixup the last bits myself like we talked about. Will need to document this a bit more explicitly.

E: Dimension,
{
debug_assert_eq!(self.len(), rhs.len());
assert!(self.len() == rhs.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, I missed a lot. This is not what we want to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't? :( I basically took this from dot_impl:

if self.len() >= DOT_BLAS_CUTOFF {
    debug_assert_eq!(self.len(), rhs.len());
    assert!(self.len() == rhs.len());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rhs is broadcast to self's dim if they mismatch. Yeah, dot doesn't broadcast.

Copy link
Member

@bluss bluss Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the mess is on my side. I've added more rigorous tests so that it has something to aspire to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the assertion there doesn't make any sense. With broadcasting, one stride would be 0 so that equispaced_stride returns None (as per the argument above), which would skip axpy and fall through to the default implementation. The assertion would simply prevent the broadcasting with the generic scaled_add to trigger.

@bluss
Copy link
Member

bluss commented Mar 23, 2017

It's basically stuck on the fact that equispaced stride (just like is contig) does not handle negative strides. If I don't find more time to fix this, I'll reduce the blas case to be less general until we can implement equispaced stride better.

bluss added a commit that referenced this pull request Mar 23, 2017
This reverts commit 245fcde, reversing
changes made to d9492d7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants