Use axpy for scaled_add #278

SuperFluffy · 2017-03-18T22:42:43Z

This adds a method to use saxpy and daxpy when doing scaled addition. It gives a nice performance boost:

test scaled_add_2d_f32_axpy           ... bench:         312 ns/iter (+/- 13)
test scaled_add_2d_f32_regular        ... bench:         571 ns/iter (+/- 23)

The main addition in this code is Dimension::equispaced_stride, which is a generalization of the code formerly inside Dimension::is_contiguous: if all elements of an array are one ptr-width away from each other, we call the array contiguous. If all elements of an array are exactly n ptr-widths away from each other, we call the array equispaced (not sure if that's the right terminology). If n = 1, then an equispaced array is contiguous. As long as an array is equispaced, it is admissable to blas via the incx and incy variables (in case of *axpy).

I think we might be able to use this for the gemm implementations, too.

EDIT: I notice that there are inlined versions of the Dimensions trait functions for 1 to 3 dimensional arrays. I guess I should extend those too?

EDIT 2: I am also not sure I like the extra allocations in that continuity/equidistance check. But I guess it doesn't matter much for large arrays.

bluss · 2017-03-19T00:39:19Z

src/impl_methods.rs

@@ -73,6 +73,11 @@ impl<A, S, D> ArrayBase<S, D> where S: Data<Elem=A>, D: Dimension
        self.dim.clone()
    }

+    /// Return the strides of the array as they are stored in the array.
+    pub fn raw_strides(&self) -> D {


This should not be exposed. You can use the strides field directly.

I rolled back this change by amending the original PR. Using the strides field directly (I didn't think about the linalg module having direct access to ArrayBase's private fields due to ArrayBase being defined in lib and linalg being a child module of lib).

children have access to it all (which is why some methods are defined in lib.rs directly)

bluss · 2017-03-19T00:40:11Z

Just like matrix multiplication, this should not be a new method, it should use scaled_add (It's the reason scaled_add exists, so that it can get this improvement).

bluss · 2017-03-19T00:43:16Z

src/dimension/dimension_trait.rs

+        }
+
+        unsafe {
+           Some(::std::ptr::read(&base_stride as *const _ as *const isize))


I'm unsure what this is doing but it should not be unsafe.

Rolled that back. I was following some of the other casting examples in the code, but a direct base_stride as isize makes more sense.

Hm surely ndarray doesn't do anything silly like this. The exception would be for the Any cast, where we have no choice.

In fact I got the idea from this:

// Read pointer to type `A` as type `B`. // // **Panics** if `A` and `B` are not the same type fn cast_as<A: 'static + Copy, B: 'static + Copy>(a: &A) -> B { assert!(same_type::<A, B>()); unsafe { ::std::ptr::read(a as *const _ as *const B) } }

The way these casts are used in the gemm definition, cast_as(&alpha) you could just replace them by alpha as $ty, since c_single and c_double are just type aliases for f32 and f64, respectively: https://doc.rust-lang.org/libc/x86_64-unknown-linux-gnu/libc/type.c_double.html

Or is there something I don't see?

bluss · 2017-03-19T00:44:02Z

I don't see any extra alloactions, where is that?

SuperFluffy · 2017-03-19T10:25:18Z

The allocations happen due to _fastest_varying_stride_order around https://github.com/SuperFluffy/rust-ndarray/blob/936b9aced096b60707fff9b003c23353f0f6f8ef/src/linalg/impl_linalg.rs#L330.

blas_compat is also called twice, which is also calling _fastest_varying_stride_order via equispaced_stride.

SuperFluffy · 2017-03-19T10:27:26Z

I moved the blas call into scaled_add_impl similar to the dot one together with a blas feature switch.

I had an extra _axpy method because it made benching and testing easier. Now I am not sure how to explicitly test the performance between the two.

SuperFluffy · 2017-03-19T10:29:34Z

Since ndarray most likely will be used by performance sensitive people, is it wise to automatically fall back to the non-blas version of dot and scaled_add when the arrays are deemed non-blas-compatible? Maybe the methods should panic instead (or return an error) so that the user knows what's up?

bluss · 2017-03-19T11:04:08Z

There's no dynamic memory allocation in _fastest_varying_stride_order that I can see. (The exception is for dynamic dimensionality IxD, but then that's something that happens everywhere due to necessity.)

This is what _fastest_varying_stride_order does for the 2d (Ix2) case:

    #[inline]
    fn _fastest_varying_stride_order(&self) -> Self {
        if get!(self, 0) as Ixs <= get!(self, 1) as Ixs { Ix2(0, 1) } else { Ix2(1, 0) }
    }

Much better with the code moved.

I don't agree with that view on performance, ndarray will simply try its best to be performant.

bluss · 2017-03-19T11:13:48Z

How to benchmark: Record benchmark results into a file with and without the blas feature. Then use cargo benchcmp to compare. I'll give it a spin.

SuperFluffy · 2017-03-19T11:21:45Z

Re: _fastest_varying_stride_order

Ah yes! I was looking at the generic version which always comes with clone()s.

Should equispaced_stride also be included and #[inline]d in the Dimension impls?

bluss · 2017-03-19T11:23:09Z

.clone() doesn't automatically mean allocations. Ix2's clone is copying a [usize; 2]

bluss · 2017-03-19T11:24:09Z

TBH I haven't looked in detail of every line of the PR yet. I'm meaning to look through with a thought towards integer maximum values in blas first.

bluss · 2017-03-19T11:30:13Z

This is the 64x64 f32 case in benchmarks. However! This is only with default codegen

cargo-benchcmp noblas1 blas2
 name                       noblas1 ns/iter  blas2 ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  552              512                     -40  -7.25%

If I use target-cpu=native and let regular rust use AVX (on this configuration) then regular Rust improves

RUSTFLAGS=-Ctarget-cpu=native cargo bench scaled_add
RUSTFLAGS=-Ctarget-cpu=native cargo bench --features=test scaled_add

cargo-benchcmp noblas_native.1  blas_native.2
 name                       noblas_native.1 ns/iter  blas_native.2 ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  510                      493                             -17  -3.33%

bluss · 2017-03-19T11:32:51Z

(Patch used to compile without blas)

diff --git a/benches/bench1.rs b/benches/bench1.rs
index 08f70a9..b71341f 100644
--- a/benches/bench1.rs
+++ b/benches/bench1.rs
@@ -1,6 +1,7 @@
 #![feature(test)]
 #![allow(unused_imports)]
 
+#[cfg(feature = "blas")]
 extern crate blas_sys;
 extern crate test;
 #[macro_use(s)]

bluss · 2017-03-19T11:53:33Z

General before / after without blas for this PR. Just to see if there are any changes due to the Dimension changes, and what I can see there aren't any. Just fluctuations.

https://gist.github.com/bluss/1e2b7fd3141da40603d7d98885299525

SuperFluffy · 2017-03-19T11:56:33Z

I wasn't aware how much RUSTFLAGS=-Ctarget-cpu=native does:

% cargo benchcmp native_blas native_noblas             
 name                       native_blas ns/iter  native_noblas ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  337                  312                             -25  -7.42%

% cargo benchcmp nonnative_blas nonnative_noblas
 name                       nonnative_blas ns/iter  nonnative_noblas ns/iter  diff ns/iter  diff % 
 scaled_add_2d_f32_regular  353                     588                                235  66.57%

SuperFluffy · 2017-03-19T11:57:50Z

What happened here?

map_stride_double_f64 794 520 -274 -34.51%

bluss · 2017-03-19T12:02:09Z

Random fluctuations. There simply isn't one benchmark run that is consistent in all the variables. To find a regression you have to find a suspicious case and then try to find more evidence for a regression. In this case there is none. A different run:

cargo-benchcmp before.1 after.0 | grep map_stride
 map_stride                        87                88                          1    1.15% 
 map_stride_double_f64             516               520                         4    0.78% 
 map_stride_f64                    543               543                         0    0.00% 
 map_stride_u32                    565               558                        -7   -1.24%

bluss · 2017-03-19T12:03:21Z

I don't think anyone escapes CPUs that change their cpufreq dynamically nowadays.

bluss · 2017-03-20T08:46:09Z

blas wins if we use equispaced by 2 arrays (scaled_add_2d_f32_stride):

cargo-benchcmp noblas_stride.1 blas_stride.1
 name                       noblas_stride.1 ns/iter  blas_stride.1 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  515                      506                              -9   -1.75% 
 scaled_add_2d_f32_stride   2,801                    2,068                          -733  -26.17%

const SCALE_ADD_SZ: usize = 64;

#[bench]
fn scaled_add_2d_f32_regular(bench: &mut test::Bencher)
{
    let mut av = Array::<f32, _>::zeros((SCALE_ADD_SZ, SCALE_ADD_SZ));
    let bv = Array::<f32, _>::zeros((SCALE_ADD_SZ, SCALE_ADD_SZ));
    let scalar = 3.1415926535;
    bench.iter(|| {
        av.scaled_add(scalar, &bv);
    });
}

#[bench]
fn scaled_add_2d_f32_stride(bench: &mut test::Bencher)
{
    let mut av = Array::<f32, _>::zeros((SCALE_ADD_SZ, 2 * SCALE_ADD_SZ));
    let bv = Array::<f32, _>::zeros((SCALE_ADD_SZ, 2 * SCALE_ADD_SZ));
    let mut av = av.slice_mut(s![.., ..;2]);
    let bv = bv.slice(s![.., ..;2]);
    let scalar = 3.1415926535;
    bench.iter(|| {
        av.scaled_add(scalar, &bv);
    });
}

SuperFluffy · 2017-03-20T08:50:48Z

It took me by surprise yesterday that OpenBlas turns out to be multithreaded. Could you redo the bench with threading disabled? See this answer in the OpenBlas FAQs: https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded

…

On Mar 20, 2017 09:46, "bluss" ***@***.***> wrote: blas wins if we use equispaced by 2 arrays (scaled_add_2d_f32_stride): cargo-benchcmp noblas_stride.1 blas_stride.1 name noblas_stride.1 ns/iter blas_stride.1 ns/iter diff ns/iter diff % scaled_add_2d_f32_regular 515 506 -9 -1.75% scaled_add_2d_f32_stride 2,801 2,068 -733 -26.17% — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#278 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqy-VmfHbrrfxZFbQ8u4kGzPIfEAwQ9ks5rnjzRgaJpZM4MhkTI> .

SuperFluffy · 2017-03-20T09:28:37Z

I cranked up SCALE_ADD_SZ to 512. While OpenBLAS is still faster, the difference is less pronounced for single threaded code. But you can see how the bench really benefits from threading:

 name                       native_noblas ns/iter  native_blas_thread_4 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  76,582                 37,479                             -39,103  -51.06% 
 scaled_add_2d_f32_stride   261,775                121,237                           -140,538  -53.69% 

 name                       native_noblas ns/iter  native_blas_thread_2 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  76,582                 40,386                             -36,196  -47.26% 
 scaled_add_2d_f32_stride   261,775                131,919                           -129,856  -49.61% 

 name                       native_noblas ns/iter  native_blas_thread_1 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  76,582                 76,494                                 -88   -0.11% 
 scaled_add_2d_f32_stride   261,775                226,802                            -34,973  -13.36%

SuperFluffy · 2017-03-20T09:33:29Z

It further looks like OpenBLAS is smart when it comes to parallelizing. For SCALE_ADD_SZ = 64 there is a) no difference between the different thread numbers, and b) it's always a lot faster than the non-blas implementation:

 name                       native_noblas ns/iter  native_blas_thread_4 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  313                    327                                     14    4.47% 
 scaled_add_2d_f32_stride   3,163                  2,359                                 -804  -25.42% 

 name                       native_noblas ns/iter  native_blas_thread_2 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  313                    333                                     20    6.39% 
 scaled_add_2d_f32_stride   3,163                  2,332                                 -831  -26.27% 

 name                       native_noblas ns/iter  native_blas_thread_1 ns/iter  diff ns/iter   diff % 
 scaled_add_2d_f32_regular  313                    319                                      6    1.92% 
 scaled_add_2d_f32_stride   3,163                  2,333                                 -830  -26.24%

So what's happening between the regular and the strided arrays that ndarray loses so much speed?

Note: I compiled all these benches with RUSTFLAGS=-Ctarget-cpu=native, and also OpenBLAS was compiled with Haswell optimizations, i.e. I have this:

ls -1 libopenblas*
libopenblas.a -> libopenblas_haswellp-r0.2.19.a
libopenblas.so -> libopenblas_haswellp-r0.2.19.so
libopenblas.so.0 -> libopenblas_haswellp-r0.2.19.so
libopenblas_haswellp-r0.2.19.a
libopenblas_haswellp-r0.2.19.so

bluss · 2017-03-20T09:36:36Z

Right. I forgot about openblas's threading.

ndarray doesn't contain a lot of special case code. As you know we have efficient common cases (array is contiguous) and fall back to more general loops for other cases.

bluss · 2017-03-20T09:38:37Z

Besides, isn't it clear that blas is also losing a lot due to the strided case?

It's pretty simple I think: If we have contiguous data, we can load simd words of 4 or 8 or 16 values at once. Can't do that if there are “holes”.

Unfortunately an equidistance=2 array has to act as if something could be writing to the other elements in between, so we're not allowed to even read those IIUC.

bluss · 2017-03-20T09:39:17Z

Clearly we should go for using blas here, right?

bluss · 2017-03-20T09:46:18Z

For matrix multiplication we always pack the matrices into contiguous buffers, so we can always use SIMD.

bluss · 2017-03-20T09:46:58Z

However that is a particularity of mm, that we have enough arithmetic operations on the data that it makes sense to do it.

SuperFluffy · 2017-03-20T09:47:10Z

Yeah, BLAS is clearly losing speed, and the loss is in the same ballpark.
Blas is clearly the way to go.

I guess there is no way to do a as_slice_mut with holes to get an extra case for zip_mut_with_same_shape?

bluss · 2017-03-20T11:39:56Z

there can always be more cases, but I'm not sure that's the problem.

bluss · 2017-03-20T11:40:28Z

Consider existing overhead in the iterators / loops of the current cases, for example we don't have zip specialization.

SuperFluffy · 2017-03-20T12:43:00Z

What changes would you like on the PR to merge it? I guess I will add the strided scaled_add to the benches.

bluss · 2017-03-20T13:05:03Z

benches/bench1.rs

@@ -1,6 +1,7 @@
 #![feature(test)]
 #![allow(unused_imports)]

+extern crate blas_sys;


cfg(feature = "blas") on top of this one

This line wasn't actually needed, was it?

Yeah, that one was no necessary.

bluss · 2017-03-20T13:07:23Z

src/linalg/impl_linalg.rs

+
+    if a.len() > blas_index::max_value() as usize {
+        return false;
+    }


I think this should not be a.len() really but some kind of upper bound on the size * stride. I'll have to think about it.

Looking at saxpy.f:

DO i = 1,n sy(iy) = sy(iy) + sa*sx(ix) ix = ix + incx iy = iy + incy END DO

It looks like we need a.len() * stride?

Right, and ndarray will already guarantee that len * stride does not overflow, so we can just multiply them

Not overflow usize that is.

Changed it around a bit. I am not sure about whether we should cast to usize or isize though...

Right, stride signedness. I'd use stride.abs() and usize.

I also see how you mean it dispatches at runtime. It does; the type lookup is all compile time (it's statically known), but the size information is at runtime.

Bit confused: do you want me to change to stride.abs() and cast to usize, or is the PR approved as is now?

bluss · 2017-03-21T14:34:23Z

src/linalg/impl_linalg.rs

+        return false;
+    }
+
+    match D::equispaced_stride(&a.raw_dim(), &a.strides) {


Is it possible that this stride is 0 here? I don't think blas supports that.

Depends, can the stride ever be 0 in any dimension and for any kind of vector? A n-dimensional vector with one dimension set to 0 doesn't really make sense, so we should make sure that such a state can never be reached.

A special case is a 0-dimensional array. It looks like its stride is set to 0.

Yes it's very common, that's how broadcasting works (same in numpy)

>>> import numpy as np >>> np.broadcast_to(np.arange(9.), (9, 9)).strides (0, 8)

Ok, so the way the equispaced_stride method is implemented right now is that it grabs the smallest stride and calculates the rest based on that. That means that if we indeed have one stride set to 0, the demanded/calculated stride will always be 0 irrespective of the dimension (because it predicts the necessary stride s_i+1 as s_i * d_i). It follows that the demanded stride will never match the rest of the stride info in the struct. Thus the function will return None and we won't be able to use blas.

bluss · 2017-03-22T19:26:38Z

Thank you! Merging to fixup the last bits myself like we talked about. Will need to document this a bit more explicitly.

bluss · 2017-03-22T19:45:47Z

src/linalg/impl_linalg.rs

+              E: Dimension,
+    {
+        debug_assert_eq!(self.len(), rhs.len());
+        assert!(self.len() == rhs.len());


Lol, I missed a lot. This is not what we want to do.

We don't? :( I basically took this from dot_impl:

if self.len() >= DOT_BLAS_CUTOFF { debug_assert_eq!(self.len(), rhs.len()); assert!(self.len() == rhs.len());

rhs is broadcast to self's dim if they mismatch. Yeah, dot doesn't broadcast.

Sorry, the mess is on my side. I've added more rigorous tests so that it has something to aspire to.

Right, the assertion there doesn't make any sense. With broadcasting, one stride would be 0 so that equispaced_stride returns None (as per the argument above), which would skip axpy and fall through to the default implementation. The assertion would simply prevent the broadcasting with the generic scaled_add to trigger.

bluss · 2017-03-23T14:33:37Z

It's basically stuck on the fact that equispaced stride (just like is contig) does not handle negative strides. If I don't find more time to fix this, I'll reduce the blas case to be less general until we can implement equispaced stride better.

This reverts commit 245fcde, reversing changes made to d9492d7.

Check if all array elements are equidistant

5e6a6b4

SuperFluffy force-pushed the axpy branch from d831ecf to 0d76526 Compare March 18, 2017 22:57

bluss reviewed Mar 19, 2017

View reviewed changes

SuperFluffy force-pushed the axpy branch from 0d76526 to 6566185 Compare March 19, 2017 10:02

Use axpy for scaled_add

35a7dc6

SuperFluffy force-pushed the axpy branch from 6566185 to 936b9ac Compare March 19, 2017 10:18

bluss requested changes Mar 20, 2017

View reviewed changes

SuperFluffy added 2 commits March 21, 2017 12:20

Move axpy into a generic scaled_add interface

38cdcc3

Add bench for scaled add

d1f50b0

SuperFluffy force-pushed the axpy branch from 936b9ac to d1f50b0 Compare March 21, 2017 11:21

bluss reviewed Mar 21, 2017

View reviewed changes

bluss approved these changes Mar 21, 2017

View reviewed changes

bluss merged commit 245fcde into rust-ndarray:master Mar 22, 2017

bluss reviewed Mar 22, 2017

View reviewed changes

bluss added a commit that referenced this pull request Mar 23, 2017

Revert "Merge pull request #278 from SuperFluffy/axpy"

2b4d215

This reverts commit 245fcde, reversing changes made to d9492d7.

Use axpy for scaled_add #278

Use axpy for scaled_add #278

Uh oh!

Conversation

SuperFluffy commented Mar 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

SuperFluffy commented Mar 19, 2017

Uh oh!

SuperFluffy commented Mar 19, 2017

Uh oh!

SuperFluffy commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

SuperFluffy commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

SuperFluffy commented Mar 19, 2017

Uh oh!

SuperFluffy commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

bluss commented Mar 19, 2017

Uh oh!

bluss commented Mar 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperFluffy commented Mar 20, 2017 via email

Uh oh!

SuperFluffy commented Mar 20, 2017

Uh oh!

SuperFluffy commented Mar 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluss commented Mar 20, 2017

Uh oh!

bluss commented Mar 20, 2017

Uh oh!

bluss commented Mar 20, 2017

Uh oh!

bluss commented Mar 20, 2017

Uh oh!

bluss commented Mar 20, 2017

Uh oh!

SuperFluffy commented Mar 20, 2017

Uh oh!

bluss commented Mar 20, 2017

Uh oh!

SuperFluffy commented Mar 18, 2017 •

edited

Loading

bluss commented Mar 19, 2017 •

edited

Loading

bluss commented Mar 20, 2017 •

edited

Loading

SuperFluffy commented Mar 20, 2017 •

edited

Loading

SuperFluffy Mar 22, 2017 •

edited

Loading