Comments (39)
From Discussion with LANL
TeamLevel: GEMM, GETRF, GETRS, GEMV (solving block tridiagonal matrix)
from kokkos-kernels.
@jeffhammond : Our tests compare against LIBXSMM where available, but we don't give that as a callable option within a thread/team yet.
from kokkos-kernels.
@dholladay00 The beginning of a hand-rolled QR lives in Trilinos/packages/tpetra/tsqr
. I also have a strong interest in finishing this for Kokkos, and a bit of funding to do it, but time is short. I'm in all-day meetings this week, but if you want to talk more about it, just let me know.
from kokkos-kernels.
@mhoemmen I'll message you on the Kokkos users slack and continue this conversation there.
from kokkos-kernels.
From Email:
Christian,
At our last CoPA telecon, you had asked which BLAS calls we were using in our QMD code and libs. We currently only use the following.
It sounded like there is a Kokkos wrapper for BLAS or something like that? We would be interested in looking at that in our ExaSP2 proxy work. Thanks.
- Sue
BLAS/LAPACK routines used:
{s,d,c,z}gemm
{s,d,c,z}scal
{s,d,c,z}axpy
{s,d}syev
{c,z}heevr
from kokkos-kernels.
@crtrott are these batched or plain or both?
from kokkos-kernels.
GETRI (global, team, and thread) would be useful too. That should actually reduce priority of optimizing GETRS. (GETRI does explicit inverses; use that in the setup phase, and you only need GEMV in the solve phase, instead of GETRS.)
from kokkos-kernels.
For now the requested ones were just plain (i.e. non-batched). The Team ones could be used to build a batch one obviously, but the actual usecase we talked about builds the matrices on the fly for the solve.
from kokkos-kernels.
@nmhamster BlockCrsMatrix could use either global batched or team.
from kokkos-kernels.
Matt wants:
@bathmatt
Team Level:
Blas
- gemm (with transpose)
Lapack:
- getf2
- getri
Really cares about A_inverse from A. A is generated on the fly in the team.
from kokkos-kernels.
from kokkos-kernels.
In the low level concept, there is no need to distinct batch from plain because batch means that we use parallel for and put team or serial interface inside. For global ones, they are simply wrapper as we interface them to mkl and cublas.
@bathmatt , I do not think that it is necessary to specialize "h". For double/float, conjugate just return the same value. In implementation level, it does not do different thing either as conj(val) would return val if value type is double.
I am kind of strongly against using character as function arguments. The use of character in blas and lapack is just inherited from fortran and we do not really follow that way.
from kokkos-kernels.
@kyungjoo-kim It's nice to specialize "h" even for real value types, so that people can write a code for any value type.
from kokkos-kernels.
from kokkos-kernels.
So, do we want to have Kokkos::conj(v) work for a double then? That works for me.
It does work for a double. Users can always call T instead of H if they don't want to call conj on doubles.
from kokkos-kernels.
@mhoemmen when we work on this collection of functionalities, we should care about code complexity and better not have many specializations. Since we mostly need implemntation of serial and team level implementation (suppose that they are decorated as kokkos inline function), conv(v) with double will strip out the conv function anyway.
from kokkos-kernels.
Hi, I would be interested in dense GETRF, GETRS, GEMV at thread level for small matrices, up to 3by3 in order to do linear interpolation between mesh nodes. This would look like solving the following:
x = sum_i [N_i(xi, eta, zeta)*x_i]
y = sum_i [N_i(xi, eta, zeta)*y_i]
z = sum_i [N_i(xi, eta, zeta)*z_i]
where my unknowns are xi, eta and zeta and a linearization is applied (N_i functions have cross terms xietazeta).
These calls would be wrapped inside a functor used in a parfor that sweeps over my mesh so would need to work at thread level.
from kokkos-kernels.
MueLu would furthermore need:
Team-level:
Lapack:
- GEQRF
- UNGQR
to do local QR decomposition (we need both R and Q built for the TentativePFactory).
So far, i have my own Kokkos implementation of a QR decomposition which does the job for us, but it might make sense to optimize it and adapt the interface to fit to the Lapack calls...
from kokkos-kernels.
We would need the following BLAS routines at Team and Thread level:
- Dgemm
- Dgemv
- Daxpy
- Ddot
from kokkos-kernels.
Jan: is there some optimization point with respect to size of the problems you need?
from kokkos-kernels.
That would be ideal, I guess. Depending on the polynomial order of our elements the problem size can vary quite significantly.
from kokkos-kernels.
@crtrott It would be really cool if Kokkos used LIBXSMM under the hood on supported platforms. LIBXSMM doesn't use threads so there should be no composition issues with Kokkos.
There are a bunch of nice demonstrations of LIBXSMM for spectral/finite element codes, so this would likely align with @JanRobertE's request.
from kokkos-kernels.
I'd love you to help out with this ;-)
If you look at the blas interface you will see that we got a pretty generic way of plugging in TPLs. Putting libxsmm in is certainly an option we should consider. The basic idea for team_level / thread_level versions (and it looks like libxsmm is targeted only at the latter right now) would be something like this:
void team_gemm(team_handle& team, ViewA A, ViewB B, ViewC) {
Impl::TeamGEMM<ViewA,ViewB,ViewC>::team_gemm(team,a,b,c);
}
template<class ViewA, class ViewB, class ViewC
TeamGEMM_TPL_Avail {
enum {value = 0};
}
template<class ViewA, class ViewB, class ViewC, int TPL_AVAIL = TeamGEMM_TPL_AVAIL<ViewA,ViewB,ViewC>::value >
TeamGEMM {
static inline
void team_gemm(ViewA a, ViewB b, ViewC c) {}
}
Plugging in another TPL would basically require you to specialize the TeamGEMM_TPL_AVAIL and the TeamGEMM struct.
from kokkos-kernels.
GETRF and GETRS for applying the inverse of a sparse matrix's block diagonal. For the application I'm currently considering, blocks are 5x5. An application with more physics would cause these blocks to grow.
from kokkos-kernels.
@jhux2 mentioned yesterday wanting batch versions of those functions. Would all blocks in a batch have the same dimensions, or could they possibly differ?
from kokkos-kernels.
Would all blocks in a batch have the same dimensions, or could they possibly differ?
@mhoemmen For the case I'm currently considering, all the blocks would have the same dimensions.
from kokkos-kernels.
We still need a fundamental decision on whether to provide a LAPACK interface. My personal preference is yes, but only stuff people actually ask for, and (at least initially) only calling through to TPL LAPACK.
from kokkos-kernels.
If you do LAPACK, I would be interested in xGESVD, xGESDD.
from kokkos-kernels.
@crtrott For the batch interface, will we assume that the LAPACK implementation can be called in a thread-parallel context? That's a bit risky if it's very old; it wasn't long ago that DLAMCH
and friends had the equivalent of static variables in them that were set on first call to whatever routine needed them.
@egboman The above paragraph is directly relevant to you. Eigensolvers are hard, and the logical first parallel batch implementation of dense eigensolvers for non-GPU architectures would just call LAPACK in parallel.
from kokkos-kernels.
It seems likely that I will be moving away from LU and moving to QR decomposition, which means that instead of needing getr{s,f}
, I will be using geqrf
, ormqr
, and trs{m,v}
at the team level. Other requests remain unchanged.
I see that these have been requested by @tawiesn. I have non-pivoted getr{f,s}
so that I could run on cuda backend. This change will render cuda backend unusable again. Any chance the kokkos team level qr decomposition is available somewhere and/or references to see how it could be implemented? Info on this would be greatly appreciated.
from kokkos-kernels.
@dholladay00 Can you explain little bit more about your use case ? @tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion (@tawiesn if you are still checking github and found I am wrong, please correct me).
On the other hand, @dholladay00 your use case seems that you just want to invert a matrix or solve a problem. What is the range of your problem sizes ? Variable batch ? or Fixed size batch ? If you don't mind, could you also point out your project ? I am wondering how these developed functions are used.
from kokkos-kernels.
@tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion
@kyungjoo-kim You are correct, the use case for the prolongator is small blocks, many with the same repeated pattern. I'm not sure what you mean by updating.
from kokkos-kernels.
@kyungjoo-kim I solve a block tridiagonal matrix in which each block row can have a different size. The size of a given block matrix will generally be from 10x10-1000x1000. The solve is accomplished via the Thomas algorithm and it requires applying the inverse (was done by LU, now done by QR due to increased stability for ill conditioned problems).
Each thread team (Kokkos thread team) is in charge of a block tridiagonal matrix, so I would like to pass a team handle into these calls so that they can be accomplished with the appropriate level of parallelism. With the openmp backend, I currently place the blas/lapack call inside of a Kokkos::single(PerTeam(...) ...)
. On the cuda backend (currently), the wrapper calls in to a handwritten version that uses the correct level of parallelism (teamthread and threadvector), but I don't have handwritten geqrf/ormqr.
from kokkos-kernels.
@dholladay00 wrote:
but I don't have handwritten geqrf/ormqr.
I have the start of that, but it's not at all parallel. I also need this. It's a "team collective" interface; the intent on CPU is to pick cache-block-sized chunks on which to work.
from kokkos-kernels.
Is Tpetra planing to be the place for team based, batched QR ? What are the design decisions in having this in Tpetra ? @kyungjoo-kim is implementing the compact QR in Kokkos Kernels as as well. It is good to synch up.
from kokkos-kernels.
The point is not to put this in Tpetra; the point is that a partial implementation exists in Tpetra that implementers could use as a starting point.
from kokkos-kernels.
Point taken. My concerns on the scope-creep remain. Plus I think the use case here is not TS.
from kokkos-kernels.
There is no notion of compact QR. All I wrote under the KokkosBatched name space is just generic implementations of serial and team interface of subset of bias and lapack. Compact approach via KokkosBatched code is one use case. I want to note that different algorithms and optimizations are needed for different use cases. One may want to use the serial and team code in KokkosBatched routines but the developed codes may not be suited for your use cases. I am writing an eigenvalue solver targeting problem sizes O(10~100). I already have a householder QR which is a by-product from the eigenvalue solver. I will merge them later.
from kokkos-kernels.
I have a use case where my matrices start small and grow larger than the currently supported team level versions of these routines would support efficiently.
I have an interest in a unified interface so Kokkos can be called for the device level calls too instead of dispatching directly to a device specific cuSOLVER or similar where needed.
Device level:
Lapack:
- geqrf (unpivoted QR)
- unmqr (apply Q)
(Also referencing #567 which is another request for the same kernels)
from kokkos-kernels.
Related Issues (20)
- `KokkosBlas::Impl::MV_Reciprocal_Generic`: `g++-12` internal compiler failure with `-O3 -march=skylake-avx512` HOT 3
- rocSPARSE 3.0.2 for ROCm 6.0 breaking changes HOT 3
- Nightly test failures with cusolver tpl enabled, Cuda.svd_* unit tests HOT 6
- Nightly test failures, Cuda.svd_* and MKL DGEMM HOT 5
- Nightly test failures, builds with gcc/8.3.0 as host compiler: cc1plus: error with KokkosSparse::Impl::Sequential::TrsvWrap<...>::divide HOT 3
- Trilinos nightly failure, Cuda+UVM build, ifpack2/stokhos/sacado interaction in Ifpack2_LocalSparseTriangularSolver_def
- Lapack cuda.gesv_double test failing
- SYCL/PVC: native spmv, spmv_mv fail for complex_double
- Intel/2023.1.0 OpenMP, Serial test failures on SPR HOT 6
- spmv follow on changes: enable/disable deprecated code
- Nightly test failures, Cuda with TPLs, float types, in spiluk HOT 12
- rocm 5.6.0 PR testing build failing - module not found HOT 14
- nested namespace holding kk mkl implementation HOT 1
- Add back special path for 1-column rank-2 bsr spmv
- Nightly test failure, Trilinos builds with intel (icpc): sparse_ioutils "Problem reading HB file test_sparse_ioutils_asym.hb..." HOT 4
- Merge path doesn't zero out NaNs when beta=0 HOT 3
- Nightly build error, intel/2023.1+mkl builds HOT 1
- linker/compiler error during build HOT 13
- Magma and cuBLAS incompatible in unit tests HOT 3
- Integer division by zero error in sparse trsv
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kokkos-kernels.