Comments (10)
@dholladay00 : Can we discuss this more in e-mails ? Include me in the e-mail chain with @kyungjoo-kim . This will help us plan for Kokkoskernels.
from kokkos-kernels.
include me as well
from kokkos-kernels.
@dholladay00 You're welcome to include me if you like.
from kokkos-kernels.
What is the problem sizes of interest ? When you mention team or thread level functor interface for dense linear algebra, you probably want to do some small or mid-range problem. Depending on the problem sizes, the implementation may different with using fast memory. Do you also need to solve same size problems or different problem sizes across teams ?
From my experience, team level interface is effective on GPUs but on KNL, MKL already provides good performance on almost all problem sizes (except for tiny problems dimensions are 3, 5 10 etc.).
Please let me know the application and workflow scenario. The advantage of using kokkoskernels is on understanding workflow (not providing generic version of libraries which are already exist).
from kokkos-kernels.
Each team must solve a block tri-diagonal linear system with non-uniform block sizes (the size of block row 1 could be different from the size of block row 2, etc.). Those sizes tend to range from ~10 up to ~1000. While at some point in the problem, each team will have the same sized block-tridiagonal linear system, later on those sizes could be different, so it is probably best to assume that each team is solving a different sized matrix.
I currently use MKL for LU decomposition (dgetrf and dgetrs), but I use a hand-written team level function for dgemm and dgemv. However, I might go back to using mkl for everything as I have been running into issues on machines that have more than 1 thread/core, despite enforcing a team size of 1 (non-deterministic, difficult to reproduce in tests, etc.)
from kokkos-kernels.
I see.
- there are multiple tridiagonal systems (parallel for can be used)
- each tridiagonal sytem is composed of irregular blocks ranging between 10 and 100.
- however, those tridiagonal systems have the same length and same internal pattern (which possiblly allows stacking and vectorizing across those tridiagonal systems).
Do you get any performance benefits from your team-level hand-written code compared to MKL ? Since this is dense tridiagonal factorization and solve, do you measure the performance on KNL in terms of gflop/s ? We can go emails for detailed information.
from kokkos-kernels.
Recent versions of MKL have batched BLAS for DGEMM at least. You might just be able to call that.
from kokkos-kernels.
I vote emails for much of this.
But to answer some questions:
- Yes, I am using a parallel for with a team policy.
- That is roughly correct, could be > 100 but probably < 1000.
- I'll send an email regarding this as its somewhat complicated.
The majority of time is spent calculating the matrix elements, so it is difficult to figure out, but either way the performance differences between MKL and my version are in the noise of total calculation time. This is due to the fact that matrix build = * N * N, matrix solve = * N * N * N. When N is small, the large number is large enough to still take more time.
This project started with the idea of using batched BLAS, but we have since moved away b/c we cannot always rely on each team having the same matrix sizes.
from kokkos-kernels.
@mhoemmen Batched BLAS does not make sense in this tridiagonal factorization. Batch operation is to apply BLAS operation for "a set of matrices". Multiple (parallel) tridiagonal factorization can be implemented in a sequence of batch GETRF, TRSM and GEMM. Using batched BLAS, we do not exploit data locality at all even if the sequence of operations completely reuse previous computation result. That is why we need functor-level interface around the parallel for.
We have a compact batch for tridiagonal factorization (LU is implemented without pivot as the tridiagonal factorization is used as preconditioner. do you really need pivoting ?). That is more optimized for problem sizes < 32. For the range of problem sizes between 100 and 1000, I need to repack data (this is not yet implemented).
from kokkos-kernels.
While we could get away without pivoting in most cases, it would be preferable to have pivoting. Also, @kyungjoo-kim I sent you an email. @mhoemmen do you wish to be included in the emails?
There were ways to include batching, but it eats up one of our levels of parallelism (each thread team gets a batch of inputs rather than a single set of inputs). When certain physics is enabled, it allows each element of the batch to have a different matrix size and structure, removing the ability to use batched calls.
from kokkos-kernels.
Related Issues (20)
- `KokkosBlas::Impl::MV_Reciprocal_Generic`: `g++-12` internal compiler failure with `-O3 -march=skylake-avx512` HOT 3
- rocSPARSE 3.0.2 for ROCm 6.0 breaking changes HOT 3
- Nightly test failures with cusolver tpl enabled, Cuda.svd_* unit tests HOT 6
- Nightly test failures, Cuda.svd_* and MKL DGEMM HOT 5
- Nightly test failures, builds with gcc/8.3.0 as host compiler: cc1plus: error with KokkosSparse::Impl::Sequential::TrsvWrap<...>::divide HOT 3
- Trilinos nightly failure, Cuda+UVM build, ifpack2/stokhos/sacado interaction in Ifpack2_LocalSparseTriangularSolver_def
- Lapack cuda.gesv_double test failing
- SYCL/PVC: native spmv, spmv_mv fail for complex_double
- Intel/2023.1.0 OpenMP, Serial test failures on SPR HOT 6
- spmv follow on changes: enable/disable deprecated code
- Nightly test failures, Cuda with TPLs, float types, in spiluk HOT 12
- rocm 5.6.0 PR testing build failing - module not found HOT 14
- nested namespace holding kk mkl implementation HOT 1
- Add back special path for 1-column rank-2 bsr spmv
- Nightly test failure, Trilinos builds with intel (icpc): sparse_ioutils "Problem reading HB file test_sparse_ioutils_asym.hb..." HOT 4
- Merge path doesn't zero out NaNs when beta=0 HOT 3
- Nightly build error, intel/2023.1+mkl builds HOT 1
- linker/compiler error during build HOT 13
- Magma and cuBLAS incompatible in unit tests HOT 3
- Integer division by zero error in sparse trsv
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kokkos-kernels.