Description
Is there an optimization available for a matrix multiplication with its transpose? I'm trying to optimize a program where the slowest part is m.t() * m
(a somewhat big matrix in the most inner loop). I read more about this and it always give a symmetrical matrix, which would make the operation n(n+1)/2
instead of n^2
. The lapack function dsyrk
is supposed to handle that. I don't know if it would actually help but I'm curious to test.
Also, is there a way to know if I'm using lapack? I didn't give any special feature to ndarray in my cargo.toml file. A perf
told me 29.26% _ZN14matrixmultiply4gemm13masked_kernel
so I think I'm using it because gemm
is a lapack name. But is there a simpler way?
EDIT: Oh, sorry, I meant BLAS
everywhere in my text. I wasn't aware of the difference :)