Giter Site home page Giter Site logo

bluss / matrixmultiply Goto Github PK

View Code? Open in Web Editor NEW
203.0 4.0 26.0 904 KB

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.

Home Page: https://docs.rs/matrixmultiply/

License: Apache License 2.0

Rust 97.65% Python 2.12% Shell 0.23%
rust rust-sci

matrixmultiply's Introduction

matrixmultiply

General matrix multiplication for f32, f64, and complex matrices. Operates on matrices with general layout (they can use arbitrary row and column stride).

Please read the API documentation here

We presently provide a few good microkernels, portable and for x86-64 and AArch64 NEON, and only one operation: the general matrix-matrix multiplication (“gemm”).

This crate was inspired by the macro/microkernel approach to matrix multiplication that is used by the BLIS project.

crates

Development Goals

  • Code clarity and maintainability
  • Portability and stable Rust
  • Performance: provide target-specific microkernels when it is beneficial
  • Testing: Test diverse inputs and test and benchmark all microkernels
  • Small code footprint and fast compilation
  • We are not reimplementing BLAS.

Benchmarks

  • cargo bench is useful for special cases and small matrices
  • The best gemm and threading benchmark is is examples/benchmarks.rs which supports custom sizes, some configuration, and csv output. Use the script benches/benchloop.py to run benchmarks over parameter ranges.

Blog Posts About This Crate

Recent Changes

  • 0.3.8

    • Lower alignment requirement for thread local storage value on macos, since it's was not respected and caused a debug assertion. (Previous issue #55)
  • 0.3.7

    • Rename a directory, avoiding spaces in filenames, to be compatible with Bazel. By @xander-zitara
  • 0.3.6

    • Fix the build for the combination of cgemm and no_std (#76)
  • 0.3.5

    • Significant improvements to complex matrix packing and kernels (#75)
    • Use a specialized AVX2 matrix packing function for sgemm, dgemm when this feature is detected on x86-64
  • 0.3.4

    • Sgemm, dgemm microkernel implementations for AArch64 NEON (ARM)

      Matrixmultiply now uses autocfg to detect rust version to enable these kernels when AArch64 intrinsics are available from Rust 1.61.

    • Small change to matrix packing functions so that they in some cases optimize better due to improvements to pointer alias information.

  • 0.3.3

    • Attempt to fix macos bug #55 again (manifesting as a debug assertion, only in debug builds.)
    • Updated comments for x86 kernels by @Tastaturtaste
    • Updates to MIRI/CI by @jturner314
    • Silenced Send/Sync future compatibility warnings for a raw pointer wrapper
  • 0.3.2

    • Add optional feature cgemm for complex matmult functions cgemm and zgemm
    • Add optional feature constconf for compile-time configuration of matrix kernel parameters for chunking. Improved scripts for benchmarking over ranges of different settings. With thanks to @DutchGhost for the const-time parsing functions.
    • Improved benchmarking and testing.
    • Threading is now slightly more eager to threads (depending on matrix element count).
  • 0.3.1

    • Attempt to fix bug #55 were the mask buffer in TLS did not seem to get its requested alignment on macos. The mask buffer pointer is now aligned manually (again, like it was in 0.2.x).
    • Fix a minor issue where we were passing a buffer pointer as &T when it should have been &[T].
  • 0.3.0

    • Implement initial support for threading using a bespoke thread pool with little contention. To use, enable feature threading (and configure number of threads with the variable MATMUL_NUM_THREADS).

      Initial support is for up to 4 threads - will be updated with more experience in coming versions.

    • Added a better benchmarking program for arbitrary size and layout, see examples/benchmark.rs for this; it supports csv output for better recording of measurements

    • Minimum supported rust version is 1.41.1 and the version update policy has been updated.

    • Updated to Rust 2018 edition

    • Moved CI to github actions (so long travis and thanks for all the fish).

  • 0.2.4

    • Support no-std mode by @vadixidav and @jturner314 New (default) feature flag "std"; use default-features = false to disable and use no-std. Note that runtime CPU feature detection requires std.
    • Fix tests so that they build correctly on non-x86 #49 platforms, and manage the release by @bluss
  • 0.2.3

    • Update rawpointer dependency to 0.2
    • Minor changes to inlining for -Ctarget-cpu=native use (not recommended - use automatic runtime feature detection.
    • Minor improvements to kernel masking (#42, #41) by @bluss and @SuperFluffy
  • 0.2.2

    • New dgemm avx and fma kernels implemented by R. Janis Goldschmidt (@SuperFluffy). With fast cases for both row and column major output.

      Benchmark improvements: Using fma instructions reduces execution time on dgemm benchmarks by 25-35% compared with the avx kernel, see issue #35

      Using the avx dgemm kernel reduces execution time on dgemm benchmarks by 5-7% compared with the previous version's autovectorized kernel.

    • New fma adaption of the sgemm avx kernel by R. Janis Goldschmidt (@SuperFluffy).

      Benchmark improvement: Using fma instructions reduces execution time on sgemm benchmarks by 10-15% compared with the avx kernel, see issue #35

    • More flexible kernel selection allows kernels to individually set all their parameters, ensures the fallback (plain Rust) kernels can be tuned for performance as well, and moves feature detection out of the gemm loop.

      Benchmark improvement: Reduces execution time on various benchmarks by 1-2% in the avx kernels, see #37.

    • Improved testing to cover input/output strides of more diversity.

  • 0.2.1

    • Improve matrix packing by taking better advantage of contiguous inputs.

      Benchmark improvement: execution time for 64×64 problem where inputs are either both row major or both column major changed by -5% sgemm and -1% for dgemm. (#26)

    • In the sgemm avx kernel, handle column major output arrays just like it does row major arrays.

      Benchmark improvement: execution time for 32×32 problem where output is column major changed by -11%. (#27)

  • 0.2.0

    • Use runtime feature detection on x86 and x86-64 platforms, to enable AVX-specific microkernels at runtime if available on the currently executing configuration.

      This means no special compiler flags are needed to enable native instruction performance!

    • Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up matrix multiplication by another 25%.

    • Use std::alloc for allocation of aligned packing buffers

    • We now require Rust 1.28 as the minimal version

  • 0.1.15

    • Fix bug where the result matrix C was not updated in the case of a M × K by K × N matrix multiplication where K was zero. (This resulted in the output C potentially being left uninitialized or with incorrect values in this specific scenario.) By @jturner314 (PR #21)
  • 0.1.14

    • Avoid an unused code warning
  • 0.1.13

    • Pick 8x8 sgemm (f32) kernel when AVX target feature is enabled (with Rust 1.14 or later, no effect otherwise).
    • Use rawpointer, a µcrate with raw pointer methods taken from this project.
  • 0.1.12

    • Internal cleanup with retained performance
  • 0.1.11

    • Adjust sgemm (f32) kernel to optimize better on recent Rust.
  • 0.1.10

    • Update doc links to docs.rs
  • 0.1.9

    • Workaround optimization regression in rust nightly (1.12-ish) (#9)
  • 0.1.8

    • Improved docs
  • 0.1.7

    • Reduce overhead slightly for small matrix multiplication problems by using only one allocation call for both packing buffers.
  • 0.1.6

    • Disable manual loop unrolling in debug mode (quicker debug builds)
  • 0.1.5

    • Update sgemm to use a 4x8 microkernel (“still in simplistic rust”), which improves throughput by 10%.
  • 0.1.4

    • Prepare support for aligned packed buffers
    • Update dgemm to use a 8x4 microkernel, still in simplistic rust, which improves throughput by 10-20% when using AVX.
  • 0.1.3

    • Silence some debug prints
  • 0.1.2

    • Major performance improvement for sgemm and dgemm (20-30% when using AVX). Since it all depends on what the optimizer does, I'd love to get issue reports that report good or bad performance.
    • Made the kernel masking generic, which is a cleaner design
  • 0.1.1

    • Minor improvement in the kernel

matrixmultiply's People

Contributors

atouchet avatar bluss avatar felixrabe avatar ignatenkobrain avatar jturner314 avatar superfluffy avatar tastaturtaste avatar xander-zitara avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

matrixmultiply's Issues

ICE's on nightly rust: resolving bounds after type-checking

Reason is bug rust-lang/rust#43357


error: internal compiler error: /checkout/src/librustc/infer/mod.rs:573: Encountered errors [FulfillmentError(Obligation(predicate=Binder(TraitPredicate(<K as kernel::GemmKernel>)),depth=0),Unimplemented)] resolving bounds after type-checking

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports

note: rustc 1.21.0-nightly (599be0d18 2017-07-26) running on x86_64-unknown-linux-gnu

0.3.5 fails to build with `--no-default-features --features cgemm`

The errors look like this:

cargo build --no-default-features --features cgemm
   Compiling matrixmultiply v0.3.5 (/home/autarch/projects/matrixmultiply)
error[E0433]: failed to resolve: use of undeclared crate or module `std`
  --> src/cgemm_common.rs:10:5
   |
10 | use std::ptr::copy_nonoverlapping;
   |     ^^^ use of undeclared crate or module `std`

error[E0432]: unresolved import `std`
 --> src/cgemm_common.rs:9:5
  |
9 | use std::mem;
  |     ^^^ use of undeclared crate or module `std`

error[E0599]: no method named `mul_add` found for type `f32` in the current scope
   --> src/cgemm_common.rs:21:23
    |
21  |               $dst = $a.mul_add($b, $dst);
    |                         ^^^^^^^ method not found in `f32`
    |
   ::: src/cgemm_kernel.rs:199:1
    |
199 | / kernel_fallback_impl_complex! {
200 | |     // instantiate separately
201 | |     [inline target_feature(enable="avx2") target_feature(enable="fma")] [fma_yes]
202 | |     kernel_target_avx2, T, TReal, KernelAvx2::MR, KernelAvx2::NR, 4
203 | | }
    | |_- in this macro invocation
    |
    = note: this error originates in the macro `fmuladd` which comes from the expansion of the macro `kernel_fallback_impl_complex` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0599]: no method named `mul_add` found for type `f64` in the current scope
   --> src/cgemm_common.rs:21:23
    |
21  |               $dst = $a.mul_add($b, $dst);
    |                         ^^^^^^^ method not found in `f64`
    |
   ::: src/zgemm_kernel.rs:196:1
    |
196 | / kernel_fallback_impl_complex! {
197 | |     // instantiate fma separately
198 | |     [inline target_feature(enable="fma") target_feature(enable="avx2")] [fma_yes]
199 | |     kernel_target_avx2, T, TReal, KernelAvx2::MR, KernelAvx2::NR, 4
200 | | }
    | |_- in this macro invocation
    |
    = note: this error originates in the macro `fmuladd` which comes from the expansion of the macro `kernel_fallback_impl_complex` (in Nightly builds, run with -Z macro-backtrace for more info)

Some errors have detailed explanations: E0432, E0433, E0599.
For more information about an error, try `rustc --explain E0432`.
error: could not compile `matrixmultiply` due to 14 previous errors

Panic when benchmarking with target-feature=sse

I wanted to check performance differences between avx and sse targets, and noticed that sse targets fail with a strange index out of bounds error:

% RUST_BACKTRACE=1 RUSTFLAGS="-C target-feature=sse" cargo bench ccc
    Finished release [optimized] target(s) in 0.00s
     Running target/release/deps/benchmarks-71c575834a86be6e

running 3 tests
test layout_f32_032::ccc ... thread 'main' panicked at 'index out of bounds: the len is 50 but the index is 9223372036854775808', /home/janis/.cargo/registry/src/github.com-1ecc6299db9ec823/bencher-0.1.5/stats.rs:300:14
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:211
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:476
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:390
   6: rust_begin_unwind
             at src/libstd/panicking.rs:325
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:77
   8: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:59
   9: bencher::stats::percentile_of_sorted
  10: bencher::stats::winsorize
  11: bencher::Bencher::auto_bench
  12: bencher::run_tests_console
  13: benchmarks::main
  14: std::rt::lang_start::{{closure}}
  15: std::panicking::try::do_call
             at src/libstd/rt.rs:59
             at src/libstd/panicking.rs:310
  16: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:102
  17: std::rt::lang_start_internal
             at src/libstd/panicking.rs:289
             at src/libstd/panic.rs:398
             at src/libstd/rt.rs:58
  18: main
  19: __libc_start_main
  20: _start
error: bench failed
zsh: exit 101   RUST_BACKTRACE=1 RUSTFLAGS="-C target-feature=sse" cargo bench ccc

This does not happen when building with target-feature=avx.

Test involving many 6x6 matrices fails randomly on Mac OS

As part of a bigger project, I have some test code that multiplies matrices of various sizes over and over. One of the tests multiplies 6×6 matrices with one another over and over, about 200k times or so. Even though the test typically takes 15s to complete on other platforms, on Mac OS specifically, about half of the time, it panics on the following line at 2s:

debug_assert_eq!(ptr as usize % KERNEL_MAX_ALIGN, 0);

Here's a link to the backtrace in one specific run where it failed.

I'm using nalgebra as a dependency, but due to the OS specific nature of the bug, I've discarded it being a bug over there. I haven't been able to isolate exactly how to reproduce this bug, but hopefully the backtrace helps to that avail.

Don't shadow c in sgemm_kernel::kernel_x86_avx

What's happening in this chunk of code is enormously confusing: https://github.com/bluss/matrixmultiply/blob/master/src/sgemm_kernel.rs#L301-L318:

macro_rules! c {
    ($i:expr, $j:expr) => (c.offset(rsc * $i as isize + csc * $j as isize));
}
// C ← α A B + β C
let mut c = [_mm256_setzero_ps(); MR];
let betav = _mm256_set1_ps(beta);
if beta != 0. {
    // Read C
    if csc == 1 {
        loop_m!(i, c[i] = _mm256_loadu_ps(c![i, 0]));
    // Handle rsc == 1 case with transpose?
    } else {
        loop_m!(i, c[i] = _mm256_set_ps(*c![i, 7], *c![i, 6], *c![i, 5], *c![i, 4], *c![i, 3], *c![i, 2], *c![i, 1], *c![i, 0]));
    }
// Compute β C
    loop_m!(i, c[i] = _mm256_mul_ps(c[i], betav));
}

What I assume happens (must happen): at the definition site of the c! macro, the variable c refers to the *mut T passed in as a function argument to the kernel. Later on though, c refers to the MR-sized array containing simd packed f32 8-vectors.

For a reader trying to get familiar with the code base, this shadowing of the variables is enormously confusing and should probably be changed.

Integer matrices

Would you consider also implementing matrix multiplication for integer matrices, or do you want to keep this purely floating point?

Use fma, fused multiply add, for architectures supporting fma

Modern Intel architectures supporting fma instruction sets can perform the first loop calculating the matrix-matrix product between panels a and b in one go using _mm256_fmadd_pd. We should implement these and see how they affect performance.

ref_mat_mul not always the slower version

I'm getting these results on ARM, NEON:

running 18 tests
test mat_mul_f32::m004     ... bench:       2,026 ns/iter (+/- 198)
test mat_mul_f32::m005     ... bench:       2,852 ns/iter (+/- 112)
test mat_mul_f32::m006     ... bench:       2,955 ns/iter (+/- 55)
test mat_mul_f32::m007     ... bench:       3,191 ns/iter (+/- 85)
test mat_mul_f32::m008     ... bench:       3,404 ns/iter (+/- 89)
test mat_mul_f32::m009     ... bench:       6,722 ns/iter (+/- 313)
test mat_mul_f32::m012     ... bench:       8,575 ns/iter (+/- 1,451)
test mat_mul_f32::m016     ... bench:      12,836 ns/iter (+/- 255)
test mat_mul_f32::m032     ... bench:      72,610 ns/iter (+/- 3,202)

test ref_mat_mul_f32::m004 ... bench:         611 ns/iter (+/- 7)
test ref_mat_mul_f32::m005 ... bench:       1,076 ns/iter (+/- 28)
test ref_mat_mul_f32::m006 ... bench:       1,745 ns/iter (+/- 28)
test ref_mat_mul_f32::m007 ... bench:       2,653 ns/iter (+/- 142)
test ref_mat_mul_f32::m008 ... bench:       3,871 ns/iter (+/- 81)
test ref_mat_mul_f32::m009 ... bench:       5,705 ns/iter (+/- 133)
test ref_mat_mul_f32::m012 ... bench:      12,146 ns/iter (+/- 237)
test ref_mat_mul_f32::m016 ... bench:      27,906 ns/iter (+/- 1,232)
test ref_mat_mul_f32::m032 ... bench:     213,525 ns/iter (+/- 4,080)

Coming from ver 0.1.6

Use optimal kernel parameters (architectures, matrix layouts)

I am trying to figure out what to use as optimal kernel parameter for different architectures.

For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a _mm256 and a _mm operation at the same time?

Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?

Building tests fails on non-x86 architectures

error[E0433]: failed to resolve: use of undeclared type or module `KernelAvx`
   --> src/sgemm_kernel.rs:507:27
    |
507 |             Alloc::new(n, KernelAvx::align_to()).init_with(elt)
    |                           ^^^^^^^^^ use of undeclared type or module `KernelAvx`
error: cannot find macro `loop_m!` in this scope
   --> src/sgemm_kernel.rs:542:9
    |
542 |         loop_m!(i, loop_n!(j, m[i][j] += 1));
    |         ^^^^^^ help: you could try the macro: `loop_n`
error[E0433]: failed to resolve: use of undeclared type or module `KernelAvx`
   --> src/sgemm_kernel.rs:541:26
    |
541 |         let mut m = [[0; KernelAvx::NR]; KernelAvx::MR];
    |                          ^^^^^^^^^ use of undeclared type or module `KernelAvx`
error[E0433]: failed to resolve: use of undeclared type or module `KernelAvx`
   --> src/sgemm_kernel.rs:541:42
    |
541 |         let mut m = [[0; KernelAvx::NR]; KernelAvx::MR];
    |                                          ^^^^^^^^^ use of undeclared type or module `KernelAvx`
error[E0433]: failed to resolve: use of undeclared type or module `KernelAvx`
   --> src/dgemm_kernel.rs:820:27
    |
820 |             Alloc::new(n, KernelAvx::align_to()).init_with(elt)
    |                           ^^^^^^^^^ use of undeclared type or module `KernelAvx`
error: cannot find macro `loop_m!` in this scope
   --> src/dgemm_kernel.rs:855:9
    |
855 |         loop_m!(i, loop4!(j, m[i][j] += 1));
    |         ^^^^^^ help: you could try the macro: `loop_n`
error[E0433]: failed to resolve: use of undeclared type or module `KernelAvx`
   --> src/dgemm_kernel.rs:854:30
    |
854 |         let mut m = [[0; 4]; KernelAvx::MR];
    |                              ^^^^^^^^^ use of undeclared type or module `KernelAvx`

Give the user the ability to allocate all memory himself

Hello! I noticed that each gemm call allocates memory on the heap for "packing buffers".
So maybe there is a way to give the user the ability to pre-allocate all the necessary memory and simply pass it as an argument without using a global allocator?

For example, add helper function(s) to the public api to calculate the required memory for a particular type/shape/kernel, and then initialize aligned_alloc::Alloc struct from a pointer or something like that.

I think this feature can help use the crate in real-time programs where hidden allocations are not welcome

Bazel: link or target filename contains space

Hi 👋

I just ran into an issue after a bump of dep, it seems that Bazel doesn't handle paths with spaces. Would you mind replacing the spaces with an underscore or some other character?

link or target filename contains space on line 17: '.../external/crate_index__matrixmultiply-0.3.6/spare kernels/aarch64_neon_4x4.rs

Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps

While implementing the dgemm kernel, I noticed that one can choose to either a) use _mm256_blend_ps followed by _mm256_permute2f128_ps or b) use _mm256_shuffle_ps followed by _mm256_permute2f128_ps to achieve the same goal (this is at the end, when scaling the product of a and b by alpha, and c by beta).

Doing the first operation leads to packed simd vectors containing a column (of 8 rows) each, while doing the second operation gives rows (containing 8 columns each).

Currently, the sgemm kernel implements option b), where _mm256_shuffle_ps has latency 1 and throughput 1. Doing option a) we'd get latency 1 but throughput 0.33 (on most Intel architectures).

It's worth investigating if this improves performance.

Transposed operations segfault

I'm using matrixmultiply for machine learning, and I'm trying to perform the operation:

GB = A^T * GC

where:
GB = (6272, 100)
A = (100, 6272)
GC = (100, 100)

After transposition of A, we end up with a valid matrix multiplication.

However, when I run the following function, I get SIGSEGV: Invalid Memory Reference. I'm not sure if I'm doing something wrong or if its an internal error, so I thought I'd post here. I was under the impression that for transposition, all you did was swap the rsa and csa, but again I could be wrong.

#[inline]
pub unsafe fn matmul_wrt_b(
    a: *const f32,
    gc: *const f32,
    gb: *mut f32,
    a_dim: [usize; 2],
    b_dim: [usize; 2],
    beta: f32,
) {
    // Check that the dimensions are compatible for multiplication
    assert_eq!(a_dim[1], b_dim[0]);

   // Dimensions of GC are (a_dim[0], b_dim[1])
   // Dimensions of GB are the same as B. 
  
    // Compute the gradient of the matrix product
    sgemm(
        a_dim[0],                    
        a_dim[1],                    
        b_dim[1],                    
        1.0,                         
        a,                                              
        1,         
        a_dim[1] as isize, 
        gc,                           
        a_dim[1] as isize,           
        1,                           
        beta,                        
        gb,                          
        b_dim[1] as isize,           
        1,                           
    );
}

Test fma instructions on travis

It looks like not all build servers travis uses support modern fma instruction, see for example #36 (comment)

One workaround is to specify os: linux, sudo: true, and dist: trusty, which turns out to activate fma. The reason why is unknown, but it was first documented here: https://github.com/uclouvain/openjpeg/blob/master/.travis.yml#L29-L33 Are these images for whatever reason tied to certain servers with a modern architecture?

This is prone to fail in the future, should this trick stop working.

error: array lengths can't depend on generic parameters

ran into this error with nightly (via $ cargo check on master at 4ac62c2):

error: array lengths can't depend on generic parameters
   --> src/sgemm_kernel.rs:223:40
    |
223 |     let mut ab = [_mm256_setzero_ps(); MR];
    |                                        ^^
$ rustc --version
rustc 1.42.0-nightly (a9dd56ff9 2019-12-30)

haven't been able to make much headway in understanding what changed in rustc, but it seems likely this is a compiler bug.

It looks like substituting a const fn implemented on the struct does work (current implementation seems to rely on trait associated constants).

update: for a quick workaround, roll-back to nightly-2019-12-25 (ho ho ho!)

Allow operations on transposed matrices, i.e. Op(A) and Op(B), and DSYRK

I wanted to use the integer gemm code from 0430cf0, and realized that there currently is no way of performing an operation on transposed matrices while I wanted to perform A^t A. In the BLAS context, the transpose or complex conjugate of a matrix is usually expressed as Op(A), where Op is expressed through the parameter TRANSA given by the character 'N', 'T', or 'C'.

I realize that since we actually have dimensions and slices as part of our matrix ArrayBase structures, we can just circumvent the issue by doing a transpose of the matrix view via fn t(mut self). The questions are:

  • Performance: does code specific to a transposed matrix with unchanged memory layout have the same performance as generic code given different stride information?
  • In how far is the gemm kernel for transposed matrices different from that for non-transposed matrices?

While addressing this issue, it's probably also worth investigating how DSYRK for the specific case of A^t A is implemented different in the BLIS library.

I have a hard time understanding how BLIS defines its kernels, specifically how the different cases of Op(A) Op(B) are implemented. I am happy do dig in and write a benchmark comparing the current ndarray approach to writing specific kernel. Can you point me to the right spot to look at?

Revert the workaround for array zeroing

See issue #9

The base issue is now fixed in current nightly channel, but remains in beta channel for now, so the workaround has to stay. It may be backported to beta, otherwise we have to wait until current nightly graduates to stable.

Tiny matmul discrepancy between single- and multi- thread(?)

There seems to be a slight discrepancy in the matmul results between using a single thread and multiple threads. At first, it felt like a rounding issue, but since I couldn't find anything relevant inside the code, I then managed to sort of replicate the results by enabling constconf and playing with manually setting the cache-sensitive MC, KC and NC of the sgemm/dgemm algorithm.

Can it be that single/multithreads set internally their defaults for the three constants of the algorithm differently (also differently between AMD and Intel-MKL) and hence the tiny discrepancies? Do we have an idea what the culprit of this might be?

The related issue where this was observed: coreylowman/dfdx#560

SNB Performance

Hi Bluss,

I've fiddled with library a bit and managed to get a ~25% performance boost on sandy bridge.
I had to rearrange things a bit to get it to work (llvm is truly capricious) so I thought I'd see if it works for other setups before sending a PR.

I was also thinking that the ~b packing could be combined into a single step with im2col for reasonably fast low memory convolutions. I might give it a go some time soon along with rayon multithreading. Any thoughts on the intended scope for the library?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.