Giter Site home page Giter Site logo

Comments (39)

crtrott avatar crtrott commented on June 5, 2024 1

From Discussion with LANL
TeamLevel: GEMM, GETRF, GETRS, GEMV (solving block tridiagonal matrix)

from kokkos-kernels.

srajama1 avatar srajama1 commented on June 5, 2024 1

@jeffhammond : Our tests compare against LIBXSMM where available, but we don't give that as a callable option within a thread/team yet.

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024 1

@dholladay00 The beginning of a hand-rolled QR lives in Trilinos/packages/tpetra/tsqr. I also have a strong interest in finishing this for Kokkos, and a bit of funding to do it, but time is short. I'm in all-day meetings this week, but if you want to talk more about it, just let me know.

from kokkos-kernels.

dholladay00 avatar dholladay00 commented on June 5, 2024 1

@mhoemmen I'll message you on the Kokkos users slack and continue this conversation there.

from kokkos-kernels.

crtrott avatar crtrott commented on June 5, 2024

From Email:

Christian,

At our last CoPA telecon, you had asked which BLAS calls we were using in our QMD code and libs. We currently only use the following.
It sounded like there is a Kokkos wrapper for BLAS or something like that? We would be interested in looking at that in our ExaSP2 proxy work. Thanks.

- Sue

BLAS/LAPACK routines used:

  {s,d,c,z}gemm
  {s,d,c,z}scal
  {s,d,c,z}axpy

  {s,d}syev
  {c,z}heevr

from kokkos-kernels.

nmhamster avatar nmhamster commented on June 5, 2024

@crtrott are these batched or plain or both?

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

GETRI (global, team, and thread) would be useful too. That should actually reduce priority of optimizing GETRS. (GETRI does explicit inverses; use that in the setup phase, and you only need GEMV in the solve phase, instead of GETRS.)

from kokkos-kernels.

crtrott avatar crtrott commented on June 5, 2024

For now the requested ones were just plain (i.e. non-batched). The Team ones could be used to build a batch one obviously, but the actual usecase we talked about builds the matrices on the fly for the solve.

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

@nmhamster BlockCrsMatrix could use either global batched or team.

from kokkos-kernels.

crtrott avatar crtrott commented on June 5, 2024

Matt wants:
@bathmatt

Team Level:
Blas

  • gemm (with transpose)

Lapack:

  • getf2
  • getri

Really cares about A_inverse from A. A is generated on the fly in the team.

from kokkos-kernels.

bathmatt avatar bathmatt commented on June 5, 2024

from kokkos-kernels.

kyungjoo-kim avatar kyungjoo-kim commented on June 5, 2024

In the low level concept, there is no need to distinct batch from plain because batch means that we use parallel for and put team or serial interface inside. For global ones, they are simply wrapper as we interface them to mkl and cublas.

@bathmatt , I do not think that it is necessary to specialize "h". For double/float, conjugate just return the same value. In implementation level, it does not do different thing either as conj(val) would return val if value type is double.

I am kind of strongly against using character as function arguments. The use of character in blas and lapack is just inherited from fortran and we do not really follow that way.

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

@kyungjoo-kim It's nice to specialize "h" even for real value types, so that people can write a code for any value type.

from kokkos-kernels.

bathmatt avatar bathmatt commented on June 5, 2024

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

So, do we want to have Kokkos::conj(v) work for a double then? That works for me.

It does work for a double. Users can always call T instead of H if they don't want to call conj on doubles.

from kokkos-kernels.

kyungjoo-kim avatar kyungjoo-kim commented on June 5, 2024

@mhoemmen when we work on this collection of functionalities, we should care about code complexity and better not have many specializations. Since we mostly need implemntation of serial and team level implementation (suppose that they are decorated as kokkos inline function), conv(v) with double will strip out the conv function anyway.

from kokkos-kernels.

lucbv avatar lucbv commented on June 5, 2024

Hi, I would be interested in dense GETRF, GETRS, GEMV at thread level for small matrices, up to 3by3 in order to do linear interpolation between mesh nodes. This would look like solving the following:
x = sum_i [N_i(xi, eta, zeta)*x_i]
y = sum_i [N_i(xi, eta, zeta)*y_i]
z = sum_i [N_i(xi, eta, zeta)*z_i]
where my unknowns are xi, eta and zeta and a linearization is applied (N_i functions have cross terms xietazeta).

These calls would be wrapped inside a functor used in a parfor that sweeps over my mesh so would need to work at thread level.

from kokkos-kernels.

tawiesn avatar tawiesn commented on June 5, 2024

MueLu would furthermore need:

Team-level:

Lapack:

  • GEQRF
  • UNGQR

to do local QR decomposition (we need both R and Q built for the TentativePFactory).

So far, i have my own Kokkos implementation of a QR decomposition which does the job for us, but it might make sense to optimize it and adapt the interface to fit to the Lapack calls...

from kokkos-kernels.

JanRobertE avatar JanRobertE commented on June 5, 2024

We would need the following BLAS routines at Team and Thread level:

  • Dgemm
  • Dgemv
  • Daxpy
  • Ddot

from kokkos-kernels.

crtrott avatar crtrott commented on June 5, 2024

Jan: is there some optimization point with respect to size of the problems you need?

from kokkos-kernels.

JanRobertE avatar JanRobertE commented on June 5, 2024

That would be ideal, I guess. Depending on the polynomial order of our elements the problem size can vary quite significantly.

from kokkos-kernels.

jeffhammond avatar jeffhammond commented on June 5, 2024

@crtrott It would be really cool if Kokkos used LIBXSMM under the hood on supported platforms. LIBXSMM doesn't use threads so there should be no composition issues with Kokkos.

There are a bunch of nice demonstrations of LIBXSMM for spectral/finite element codes, so this would likely align with @JanRobertE's request.

from kokkos-kernels.

crtrott avatar crtrott commented on June 5, 2024

I'd love you to help out with this ;-)

If you look at the blas interface you will see that we got a pretty generic way of plugging in TPLs. Putting libxsmm in is certainly an option we should consider. The basic idea for team_level / thread_level versions (and it looks like libxsmm is targeted only at the latter right now) would be something like this:

void team_gemm(team_handle& team, ViewA A, ViewB B, ViewC) {
  Impl::TeamGEMM<ViewA,ViewB,ViewC>::team_gemm(team,a,b,c);
}

template<class ViewA, class ViewB, class ViewC
TeamGEMM_TPL_Avail {
  enum {value = 0};
}

template<class ViewA, class ViewB, class ViewC, int TPL_AVAIL = TeamGEMM_TPL_AVAIL<ViewA,ViewB,ViewC>::value >
TeamGEMM {
   static inline
   void team_gemm(ViewA a, ViewB b, ViewC c) {}
}

Plugging in another TPL would basically require you to specialize the TeamGEMM_TPL_AVAIL and the TeamGEMM struct.

from kokkos-kernels.

jhux2 avatar jhux2 commented on June 5, 2024

GETRF and GETRS for applying the inverse of a sparse matrix's block diagonal. For the application I'm currently considering, blocks are 5x5. An application with more physics would cause these blocks to grow.

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

@jhux2 mentioned yesterday wanting batch versions of those functions. Would all blocks in a batch have the same dimensions, or could they possibly differ?

from kokkos-kernels.

jhux2 avatar jhux2 commented on June 5, 2024

Would all blocks in a batch have the same dimensions, or could they possibly differ?

@mhoemmen For the case I'm currently considering, all the blocks would have the same dimensions.

from kokkos-kernels.

crtrott avatar crtrott commented on June 5, 2024

We still need a fundamental decision on whether to provide a LAPACK interface. My personal preference is yes, but only stuff people actually ask for, and (at least initially) only calling through to TPL LAPACK.

from kokkos-kernels.

egboman avatar egboman commented on June 5, 2024

If you do LAPACK, I would be interested in xGESVD, xGESDD.

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

@crtrott For the batch interface, will we assume that the LAPACK implementation can be called in a thread-parallel context? That's a bit risky if it's very old; it wasn't long ago that DLAMCH and friends had the equivalent of static variables in them that were set on first call to whatever routine needed them.

@egboman The above paragraph is directly relevant to you. Eigensolvers are hard, and the logical first parallel batch implementation of dense eigensolvers for non-GPU architectures would just call LAPACK in parallel.

from kokkos-kernels.

dholladay00 avatar dholladay00 commented on June 5, 2024

It seems likely that I will be moving away from LU and moving to QR decomposition, which means that instead of needing getr{s,f}, I will be using geqrf, ormqr, and trs{m,v} at the team level. Other requests remain unchanged.

I see that these have been requested by @tawiesn. I have non-pivoted getr{f,s} so that I could run on cuda backend. This change will render cuda backend unusable again. Any chance the kokkos team level qr decomposition is available somewhere and/or references to see how it could be implemented? Info on this would be greatly appreciated.

from kokkos-kernels.

kyungjoo-kim avatar kyungjoo-kim commented on June 5, 2024

@dholladay00 Can you explain little bit more about your use case ? @tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion (@tawiesn if you are still checking github and found I am wrong, please correct me).

On the other hand, @dholladay00 your use case seems that you just want to invert a matrix or solve a problem. What is the range of your problem sizes ? Variable batch ? or Fixed size batch ? If you don't mind, could you also point out your project ? I am wondering how these developed functions are used.

from kokkos-kernels.

jhux2 avatar jhux2 commented on June 5, 2024

@tawiesn 's use case is a bit different from the general use case of the QR factorization. His use case is more or less updating QR is more efficient way to compute some prolongation operator. Recalling the conversion with him, he has a repeated pattern in a matrix and he uses QR in a conventional way but updating is probably better in my opinion

@kyungjoo-kim You are correct, the use case for the prolongator is small blocks, many with the same repeated pattern. I'm not sure what you mean by updating.

from kokkos-kernels.

dholladay00 avatar dholladay00 commented on June 5, 2024

@kyungjoo-kim I solve a block tridiagonal matrix in which each block row can have a different size. The size of a given block matrix will generally be from 10x10-1000x1000. The solve is accomplished via the Thomas algorithm and it requires applying the inverse (was done by LU, now done by QR due to increased stability for ill conditioned problems).

Each thread team (Kokkos thread team) is in charge of a block tridiagonal matrix, so I would like to pass a team handle into these calls so that they can be accomplished with the appropriate level of parallelism. With the openmp backend, I currently place the blas/lapack call inside of a Kokkos::single(PerTeam(...) ...). On the cuda backend (currently), the wrapper calls in to a handwritten version that uses the correct level of parallelism (teamthread and threadvector), but I don't have handwritten geqrf/ormqr.

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

@dholladay00 wrote:

but I don't have handwritten geqrf/ormqr.

I have the start of that, but it's not at all parallel. I also need this. It's a "team collective" interface; the intent on CPU is to pick cache-block-sized chunks on which to work.

from kokkos-kernels.

srajama1 avatar srajama1 commented on June 5, 2024

Is Tpetra planing to be the place for team based, batched QR ? What are the design decisions in having this in Tpetra ? @kyungjoo-kim is implementing the compact QR in Kokkos Kernels as as well. It is good to synch up.

from kokkos-kernels.

mhoemmen avatar mhoemmen commented on June 5, 2024

The point is not to put this in Tpetra; the point is that a partial implementation exists in Tpetra that implementers could use as a starting point.

from kokkos-kernels.

srajama1 avatar srajama1 commented on June 5, 2024

Point taken. My concerns on the scope-creep remain. Plus I think the use case here is not TS.

from kokkos-kernels.

kyungjoo-kim avatar kyungjoo-kim commented on June 5, 2024

There is no notion of compact QR. All I wrote under the KokkosBatched name space is just generic implementations of serial and team interface of subset of bias and lapack. Compact approach via KokkosBatched code is one use case. I want to note that different algorithms and optimizations are needed for different use cases. One may want to use the serial and team code in KokkosBatched routines but the developed codes may not be suited for your use cases. I am writing an eigenvalue solver targeting problem sizes O(10~100). I already have a householder QR which is a by-product from the eigenvalue solver. I will merge them later.

from kokkos-kernels.

wlruys avatar wlruys commented on June 5, 2024

I have a use case where my matrices start small and grow larger than the currently supported team level versions of these routines would support efficiently.

I have an interest in a unified interface so Kokkos can be called for the device level calls too instead of dispatching directly to a device specific cuSOLVER or similar where needed.

Device level:
Lapack:

  • geqrf (unpivoted QR)
  • unmqr (apply Q)

(Also referencing #567 which is another request for the same kernels)

from kokkos-kernels.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.