-
Notifications
You must be signed in to change notification settings - Fork 333
Use axpy for scaled_add #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/impl_methods.rs
Outdated
@@ -73,6 +73,11 @@ impl<A, S, D> ArrayBase<S, D> where S: Data<Elem=A>, D: Dimension | |||
self.dim.clone() | |||
} | |||
|
|||
/// Return the strides of the array as they are stored in the array. | |||
pub fn raw_strides(&self) -> D { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be exposed. You can use the strides field directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rolled back this change by amending the original PR. Using the strides
field directly (I didn't think about the linalg
module having direct access to ArrayBase
's private fields due to ArrayBase
being defined in lib
and linalg
being a child module of lib
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children have access to it all (which is why some methods are defined in lib.rs directly)
Just like matrix multiplication, this should not be a new method, it should use scaled_add (It's the reason scaled_add exists, so that it can get this improvement). |
src/dimension/dimension_trait.rs
Outdated
} | ||
|
||
unsafe { | ||
Some(::std::ptr::read(&base_stride as *const _ as *const isize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure what this is doing but it should not be unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rolled that back. I was following some of the other casting examples in the code, but a direct base_stride as isize
makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm surely ndarray doesn't do anything silly like this. The exception would be for the Any
cast, where we have no choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I got the idea from this:
// Read pointer to type `A` as type `B`.
//
// **Panics** if `A` and `B` are not the same type
fn cast_as<A: 'static + Copy, B: 'static + Copy>(a: &A) -> B {
assert!(same_type::<A, B>());
unsafe {
::std::ptr::read(a as *const _ as *const B)
}
}
The way these casts are used in the gemm
definition, cast_as(&alpha)
you could just replace them by alpha as $ty
, since c_single
and c_double
are just type aliases for f32
and f64
, respectively: https://doc.rust-lang.org/libc/x86_64-unknown-linux-gnu/libc/type.c_double.html
Or is there something I don't see?
I don't see any extra alloactions, where is that? |
The allocations happen due to
|
I moved the blas call into I had an extra |
Since |
There's no dynamic memory allocation in _fastest_varying_stride_order that I can see. (The exception is for dynamic dimensionality IxD, but then that's something that happens everywhere due to necessity.) This is what _fastest_varying_stride_order does for the 2d (Ix2) case: #[inline]
fn _fastest_varying_stride_order(&self) -> Self {
if get!(self, 0) as Ixs <= get!(self, 1) as Ixs { Ix2(0, 1) } else { Ix2(1, 0) }
} Much better with the code moved. I don't agree with that view on performance, ndarray will simply try its best to be performant. |
How to benchmark: Record benchmark results into a file with and without the blas feature. Then use cargo benchcmp to compare. I'll give it a spin. |
Re: Ah yes! I was looking at the generic version which always comes with Should |
.clone() doesn't automatically mean allocations. Ix2's clone is copying a [usize; 2] |
TBH I haven't looked in detail of every line of the PR yet. I'm meaning to look through with a thought towards integer maximum values in blas first. |
This is the 64x64 f32 case in benchmarks. However! This is only with default codegen
If I use
|
(Patch used to compile without blas) diff --git a/benches/bench1.rs b/benches/bench1.rs
index 08f70a9..b71341f 100644
--- a/benches/bench1.rs
+++ b/benches/bench1.rs
@@ -1,6 +1,7 @@
#![feature(test)]
#![allow(unused_imports)]
+#[cfg(feature = "blas")]
extern crate blas_sys;
extern crate test;
#[macro_use(s)] |
General before / after without blas for this PR. Just to see if there are any changes due to the Dimension changes, and what I can see there aren't any. Just fluctuations. https://gist.github.com/bluss/1e2b7fd3141da40603d7d98885299525 |
I wasn't aware how much
|
What happened here?
|
Random fluctuations. There simply isn't one benchmark run that is consistent in all the variables. To find a regression you have to find a suspicious case and then try to find more evidence for a regression. In this case there is none. A different run:
|
I don't think anyone escapes CPUs that change their cpufreq dynamically nowadays. |
blas wins if we use equispaced by 2 arrays (scaled_add_2d_f32_stride):
const SCALE_ADD_SZ: usize = 64;
#[bench]
fn scaled_add_2d_f32_regular(bench: &mut test::Bencher)
{
let mut av = Array::<f32, _>::zeros((SCALE_ADD_SZ, SCALE_ADD_SZ));
let bv = Array::<f32, _>::zeros((SCALE_ADD_SZ, SCALE_ADD_SZ));
let scalar = 3.1415926535;
bench.iter(|| {
av.scaled_add(scalar, &bv);
});
}
#[bench]
fn scaled_add_2d_f32_stride(bench: &mut test::Bencher)
{
let mut av = Array::<f32, _>::zeros((SCALE_ADD_SZ, 2 * SCALE_ADD_SZ));
let bv = Array::<f32, _>::zeros((SCALE_ADD_SZ, 2 * SCALE_ADD_SZ));
let mut av = av.slice_mut(s![.., ..;2]);
let bv = bv.slice(s![.., ..;2]);
let scalar = 3.1415926535;
bench.iter(|| {
av.scaled_add(scalar, &bv);
});
} |
It took me by surprise yesterday that OpenBlas turns out to be
multithreaded. Could you redo the bench with threading disabled? See this
answer in the OpenBlas FAQs:
https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
…On Mar 20, 2017 09:46, "bluss" ***@***.***> wrote:
blas wins if we use equispaced by 2 arrays (scaled_add_2d_f32_stride):
cargo-benchcmp noblas_stride.1 blas_stride.1
name noblas_stride.1 ns/iter blas_stride.1 ns/iter diff ns/iter diff %
scaled_add_2d_f32_regular 515 506 -9 -1.75%
scaled_add_2d_f32_stride 2,801 2,068 -733 -26.17%
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#278 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqy-VmfHbrrfxZFbQ8u4kGzPIfEAwQ9ks5rnjzRgaJpZM4MhkTI>
.
|
I cranked up
|
It further looks like OpenBLAS is smart when it comes to parallelizing. For
So what's happening between the regular and the strided arrays that ndarray loses so much speed? Note: I compiled all these benches with
|
Right. I forgot about openblas's threading. ndarray doesn't contain a lot of special case code. As you know we have efficient common cases (array is contiguous) and fall back to more general loops for other cases. |
Besides, isn't it clear that blas is also losing a lot due to the strided case? It's pretty simple I think: If we have contiguous data, we can load simd words of 4 or 8 or 16 values at once. Can't do that if there are “holes”. Unfortunately an equidistance=2 array has to act as if something could be writing to the other elements in between, so we're not allowed to even read those IIUC. |
Clearly we should go for using blas here, right? |
For matrix multiplication we always pack the matrices into contiguous buffers, so we can always use SIMD. |
However that is a particularity of mm, that we have enough arithmetic operations on the data that it makes sense to do it. |
I guess there is no way to do a |
there can always be more cases, but I'm not sure that's the problem. |
Consider existing overhead in the iterators / loops of the current cases, for example we don't have zip specialization. |
What changes would you like on the PR to merge it? I guess I will add the strided |
benches/bench1.rs
Outdated
@@ -1,6 +1,7 @@ | |||
#![feature(test)] | |||
#![allow(unused_imports)] | |||
|
|||
extern crate blas_sys; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfg(feature = "blas") on top of this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line wasn't actually needed, was it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that one was no necessary.
src/linalg/impl_linalg.rs
Outdated
|
||
if a.len() > blas_index::max_value() as usize { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be a.len()
really but some kind of upper bound on the size * stride. I'll have to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at saxpy.f
:
DO i = 1,n
sy(iy) = sy(iy) + sa*sx(ix)
ix = ix + incx
iy = iy + incy
END DO
It looks like we need a.len() * stride
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and ndarray will already guarantee that len * stride does not overflow, so we can just multiply them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not overflow usize that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it around a bit. I am not sure about whether we should cast to usize
or isize
though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, stride signedness. I'd use stride.abs() and usize.
I also see how you mean it dispatches at runtime. It does; the type lookup is all compile time (it's statically known), but the size information is at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit confused: do you want me to change to stride.abs()
and cast to usize
, or is the PR approved as is now?
return false; | ||
} | ||
|
||
match D::equispaced_stride(&a.raw_dim(), &a.strides) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that this stride is 0 here? I don't think blas supports that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends, can the stride ever be 0 in any dimension and for any kind of vector? A n-dimensional vector with one dimension set to 0 doesn't really make sense, so we should make sure that such a state can never be reached.
A special case is a 0-dimensional array. It looks like its stride is set to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's very common, that's how broadcasting works (same in numpy)
>>> import numpy as np
>>> np.broadcast_to(np.arange(9.), (9, 9)).strides
(0, 8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the way the equispaced_stride
method is implemented right now is that it grabs the smallest stride and calculates the rest based on that. That means that if we indeed have one stride set to 0, the demanded/calculated stride will always be 0 irrespective of the dimension (because it predicts the necessary stride s_i+1
as s_i * d_i
). It follows that the demanded stride will never match the rest of the stride info in the struct. Thus the function will return None
and we won't be able to use blas.
Thank you! Merging to fixup the last bits myself like we talked about. Will need to document this a bit more explicitly. |
E: Dimension, | ||
{ | ||
debug_assert_eq!(self.len(), rhs.len()); | ||
assert!(self.len() == rhs.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, I missed a lot. This is not what we want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't? :( I basically took this from dot_impl
:
if self.len() >= DOT_BLAS_CUTOFF {
debug_assert_eq!(self.len(), rhs.len());
assert!(self.len() == rhs.len());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rhs is broadcast to self's dim if they mismatch. Yeah, dot doesn't broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the mess is on my side. I've added more rigorous tests so that it has something to aspire to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the assertion there doesn't make any sense. With broadcasting, one stride would be 0
so that equispaced_stride
returns None
(as per the argument above), which would skip axpy
and fall through to the default implementation. The assertion would simply prevent the broadcasting with the generic scaled_add
to trigger.
It's basically stuck on the fact that equispaced stride (just like is contig) does not handle negative strides. If I don't find more time to fix this, I'll reduce the blas case to be less general until we can implement equispaced stride better. |
This adds a method to use
saxpy
anddaxpy
when doing scaled addition. It gives a nice performance boost:The main addition in this code is
Dimension::equispaced_stride
, which is a generalization of the code formerly insideDimension::is_contiguous
: if all elements of an array are one ptr-width away from each other, we call the array contiguous. If all elements of an array are exactlyn
ptr-widths away from each other, we call the array equispaced (not sure if that's the right terminology). Ifn = 1
, then an equispaced array is contiguous. As long as an array is equispaced, it is admissable to blas via theincx
andincy
variables (in case of*axpy
).I think we might be able to use this for the
gemm
implementations, too.EDIT: I notice that there are inlined versions of the
Dimensions
trait functions for 1 to 3 dimensional arrays. I guess I should extend those too?EDIT 2: I am also not sure I like the extra allocations in that continuity/equidistance check. But I guess it doesn't matter much for large arrays.