drtimothyaldendavis / graphblas Goto Github PK

SuiteSparse:GraphBLAS: graph algorithms in the language of linear algebra. For production: (default) STABLE branch. Code development: ask me for the right branch before submitting a PR. video intro: https://youtu.be/Tj5y6d7FegI .

Home Page: http://faculty.cse.tamu.edu/davis/GraphBLAS.html

License: Other

CMake 0.31% Makefile 0.05% Shell 0.04% C 91.78% MATLAB 4.78% C++ 2.27% Cuda 0.43% Awk 0.01% HTML 0.30% Dockerfile 0.01% Assembly 0.03%

graphblas's Introduction

SuiteSparse:GraphBLAS

SPDX-License-Identifier: Apache-2.0

VERSION 9.2.0, May 22, 2024

SuiteSparse:GraphBLAS is a complete implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring.

SuiteSparse:GraphBLAS is used heavily in production. It appears as the underlying graph engine in the RedisGraph database by Redis, and as the built-in sparse matrix multiply in MATLAB R2021a, where C=A*B is now up to 30x faster than in prior versions of MATLAB (on my 20-core NVIDIA DGX Station).

The development of this package is supported by Intel, NVIDIA (including the donation of the 20-core DGX Station), Redis, MIT Lincoln Lab, MathWorks, IBM, and Julia Computing.

See the user guide in Doc/GraphBLAS_UserGuide.pdf for documentation on the SuiteSparse implementation of GraphBLAS, and how to use it in your applications.

See http://graphblas.org for more information on GraphBLAS, including the GraphBLAS C API. See https://github.com/GraphBLAS/GraphBLAS-Pointers for additional resources on GraphBLAS.

QUICK START: To compile and install, do these commands in this directory:

make
sudo make install

Please be patient; some files can take several minutes to compile. Requires a C11 compiler, so cmake will fail if your compiler is not C11 compliant. See the User Guide PDF in Doc/ for directions on how to use another compiler.

For faster compilation, do this instead of just "make", which uses 32 parallel threads to compile the package:

make JOBS=32

The output of the demo programs will be compared with their expected output.

To remove all compiled files:

make clean

To compile and run the demos:

make demos

See the GraphBLAS/ subfolder for the Octave/MATLAB interface, which contains a README.md file with further details.

Files and folders in this GraphBLAS directory:

CMakeLists.txt: cmake instructions to compile GraphBLAS

cmake_modules: additional cmake files

Config: version-dependent files used by CMake

Demo: a set of demos on how to use GraphBLAS

Doc: SuiteSparse:GraphBLAS User Guide and license

GraphBLAS: the @GrB Octave/MATLAB interface, including its test suite and demos. This folder is called 'GraphBLAS' so that typing 'help graphblas' or 'doc graphblas' in the Octave or MATLAB Command Window can locate the Contents.m file.

Include: user-accessible include file, GraphBLAS.h

Makefile: to compile the SuiteSparse:GraphBLAS library and demos

README.md: this file

Source: source files of the SuiteSparse:GraphBLAS library.

Tcov: test coverage, requires Octave or MATLAB. See the log...txt files, which certify 100% test coverage.

Test: Extensive tests, not meant for general usage. To compile and run, go to this directory and type make;testall in Octave or MATLAB. Requires Octave or MATLAB

build: build directory for CMake, initially empty

CUDA: GPU interface, a work in progress. This is being developed in collaboration with Joe Eaton, Corey Nolet and others at NVIDIA, with support from NVIDIA. It appears in this release but the CUDA folder is a draft that isn't ready to use yet.

CONTRIBUTOR-LICENSE.txt: how to contribute to SuiteSparse:GraphBLAS

cpu_features: (c) Google.com, Apache 2.0 license.

logo: the (awesome!) GraphBLAS logo by Jakab Rokob, CC BY 4.0 license

rmm_wrap: Rapids Memory Manager, (c) NVIDIA, to use with CUDA. (draft; not yet in use)

JITpackage: a small program that packages the GraphBLAS source code into the GraphBLAS library itself so it can compile the JIT kernels. If you edit the GraphBLAS source code, see the README.txt file in this director for instructions.

LICENSE: licenses for GraphBLAS and its 3rd party dependencies

PreJIT: a folder for JIT kernels that are to be integrated into the compiled GraphBLAS library.

GraphBLAS C API Specification:

Versions v5.2.0 and earlier conform to the version 1.3.0 (Sept 25, 2019) of the GraphBLAS C API Specification. Versions v6.0.0 and later conform to the version 2.0.0 (Nov, 2021) of the GraphBLAS C API Specification. Versions 9.0.0 and later conform to the v2.1.0 C API. This library also includes several additional functions and features as extensions to the spec.

All functions, objects, and macros with the prefix GxB are extensions to the spec. Functions, objects, and macros with prefix GB must not be accessed by user code. They are for internal use in GraphBLAS only.

About Benchmarking

Do not use the demos in GraphBLAS/Demos for benchmarking or in production. Those are simple methods for illustration only, and can be slow. Use LAGraph for benchmarking and production uses.

I have tested this package extensively on multicore single-socket systems, but have not yet optimized it for multi-socket systems with a NUMA architecture. That will be done in a future release. If you publish benchmarks with this package, please state the SuiteSparse:GraphBLAS version, and a caveat if appropriate. If you see significant performance issues when going from a single-socket to multi-socket system, I would like to hear from you so I can look into it.

Contact me at [email protected] for any questions about benchmarking SuiteSparse:GraphBLAS and LAGraph.

Contributing to SuiteSparse:GraphBLAS

To add an issue for a bug report (gasp!) or a feature request, you can use the issue tracker on github.com, at [https://github.com/DrTimothyAldenDavis/GraphBLAS/issues] (https://github.com/DrTimothyAldenDavis/GraphBLAS/issues) or [https://github.com/DrTimothyAldenDavis/SuiteSparse/issues] (https://github.com/DrTimothyAldenDavis/SuiteSparse/issues).

To contribute code, you can submit a pull request. To do so, you must first agree to the Contributor License Agreement CONTRIBUTOR-LICENSE.txt. Print a copy of the txt file (as a PDF), sign and date it, and email it to me at [email protected]. Pull requests will only be included into SuiteSparse after I receive your email with the signed PDF.

Licensing and supporting SuiteSparse:GraphBLAS

SuiteSparse:GraphBLAS is released primarily under the Apache-2.0 license, because of how the project is supported by many organizations (NVIDIA, Redis, MIT Lincoln Lab, Intel, IBM, and Julia Computing), primarily through gifts to the Texas A&M Foundation. Because of this support, and to facilitate the wide-spread use of GraphBLAS, the decision was made to give this library a permissive open-source license (Apache-2.0). Currently all source code required to create the C-callable library libgraphblas.so is licensed with Apache-2.0, and there are no plans to change this.

However, just because this code is free to use doesn't make it zero-cost to create. If you are using GraphBLAS in a commercial closed-source product and are not supporting its development, please consider supporting this project to ensure that it will continue to be developed and enhanced in the future.

To support the development of GraphBLAS, contact the author ([email protected]) or the Texas A&M Foundation (True Brown, [email protected]; or Kevin McGinnis, [email protected]) for details.

For distro maintainers (Linux, homebrew, spack, R, Octave, Trilinos, ...):

Thanks for packaging SuiteSparse! Here are some suggestions:

* GraphBLAS takes a long time to compile because it creates many fast
    "FactoryKernels" at compile-time.  If you want to reduce the compile
    time and library size, enable the GRAPHBLAS_COMPACT mode, but keep the
    JIT enabled.  Then GraphBLAS will compile the kernels it needs at
    run-time, via its JIT.  Performance will be the same as the
    FactoryKernels once the JIT kernels are compiled.  User compiled
    kernels are placed in ~/.SuiteSparse, by default.  You do not need to
    distribute the source for GraphBLAS to enable the JIT: just
    libgraphblas.so and GraphBLAS.h is enough.

* GraphBLAS needs OpenMP!  It's fundamentally a parallel code so please
    distribute it with OpenMP enabled.  Performance will suffer
    otherwise.

References:

To cite this package, please use the following:

T. A. Davis. Algorithm 1037: SuiteSparse:GraphBLAS: Parallel Graph
Algorithms in the Language of Sparse Linear Algebra. ACM Trans. Math.
Softw. 49, 3, Article 28 (September 2023), 30 pages.
https://doi.org/10.1145/3577195

T. Davis, Algorithm 1000: SuiteSparse:GraphBLAS: graph algorithms in
the language of sparse linear algebra, ACM Trans on Mathematical
Software, vol 45, no 4, Dec. 2019, Article No 44.
https://doi.org/10.1145/3322125.

Software Acknowledgements

SuiteSparse:GraphBLAS relies on the following packages (details in the LICENSE file, and in the GraphBLAS User Guide):

(1) LZ4, xxHash, and ZSTD by Yann Collet, appearing here under the BSD2 or BSD3 licenses.

(2) cpu_features (c) Google, Apache 2.0 license with components (c) IBM and Intel (also Apache 2.0), and the cpu_featurer/ndk_compat component (c) The Android Open Source Project (BSD-2-clause)

graphblas's People

Contributors

Stargazers

Watchers

Forkers

jim22k michelp doc22940 szarnyasg cjh10644 liukangling eminsight drugintelligence intel pranasblk eriknw wangfeng012316 curiosityyy rayegun filipecosta90 lacrymose yslan nanding0701 lnsongxf macrofinancehub zeta1999 jschueller cjnolet laplacekorea vishalbelsare borueihong liyang-whu mfkiwl russellea sleepybishop artaxerces xantares vkutuev 5gapp rnshah9 straiforos jamesetsmith mrcodechef zicheng111 adas1994 swilly22 lipengyuebird srki mmuetzel code-sam dvirdukhan shyding jinyang-guo onecontainer skn123 joe2hpimn cquil11 loongson-zjl olumayowa-olowomeye yvdriess vidithm sparselinearalgebra gerhobbelt senliontec

graphblas's Issues

GxB_Vector_diag should respect transpose

On the main diagonal setting desc.in0=transpose should obviously do nothing. But if k != 0 it seems to me that desc.in0=transpose should perform the transpose right?

OpenMP issue in Windows (doesn't like `int64_t` in for loop initialization)

We are trying to build 6.1.3 and latest master branch on Windows and encountered the following OpenMP error:

[ 28%] Building C object CMakeFiles/graphblas.dir/Source/Generated2/GB_AxB__bxnor_band_uint16.c.obj
GB_AxB__bxnor_band_uint16.c
d:\bld\graphblas_1641574797935\work\source\template\GB_AxB_dot4_template.c(592): error C3015: initialization in OpenMP 'for' statement has improper form
d:\bld\graphblas_1641574797935\work\source\template\GB_AxB_dot4_template.c(650): error C3015: initialization in OpenMP 'for' statement has improper form
d:\bld\graphblas_1641574797935\work\source\template\GB_AxB_dot4_template.c(712): error C3015: initialization in OpenMP 'for' statement has improper form
NMAKE : fatal error U1077: 'D:\bld\graphblas_1641574797935\_build_env\Library\bin\cmake.exe' : return code '0x2'
Stop.
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX64\x64\nmake.exe"' : return code '0x2'
Stop.
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX64\x64\nmake.exe"' : return code '0x2'
Stop.

as can be seen here:
https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=437194&view=logs&j=a70f640f-cc53-5cd3-6cdc-236a1aa90802&t=f5d15007-a01c-5ad8-c9ce-4d519d3b275f&l=1985

I was able to fix it by removing int64_t from the for loop initialization (for example, changing for (int64_t k = 0 ; k < vlen ; k++) to for (k = 0 ; k < vlen ; k++). I used the patch below:

diff --git a/Source/Template/GB_AxB_dot4_template.c b/Source/Template/GB_AxB_dot4_template.c
index 5ab19ce822..61f5bc32ee 100644
--- a/Source/Template/GB_AxB_dot4_template.c
+++ b/Source/Template/GB_AxB_dot4_template.c
@@ -588,14 +588,17 @@
                     //----------------------------------------------------------

                     GB_BTYPE *restrict G = W ;
-                    #pragma omp parallel for num_threads(nthreads) \
-                        schedule(static)
-                    for (int64_t k = 0 ; k < vlen ; k++)
                     {
-                        // G (k,0:1) = B (k,j1:j1+1)
-                        const int64_t k2 = k << 1 ;
-                        G [k2    ] = Bx [k + (j1    ) * vlen] ;
-                        G [k2 + 1] = Bx [k + (j1 + 1) * vlen] ;
+                        int64_t k;
+                        #pragma omp parallel for num_threads(nthreads) \
+                            schedule(static)
+                        for (k = 0 ; k < vlen ; k++)
+                        {
+                            // G (k,0:1) = B (k,j1:j1+1)
+                            const int64_t k2 = k << 1 ;
+                            G [k2    ] = Bx [k + (j1    ) * vlen] ;
+                            G [k2 + 1] = Bx [k + (j1 + 1) * vlen] ;
+                        }
                     }

                     //----------------------------------------------------------
@@ -646,15 +649,18 @@
                     //----------------------------------------------------------

                     GB_BTYPE *restrict G = W ;
-                    #pragma omp parallel for num_threads(nthreads) \
-                        schedule(static)
-                    for (int64_t k = 0 ; k < vlen ; k++)
                     {
-                        // G (k,0:2) = B (k,j1:j1+2)
-                        const int64_t k3 = k * 3 ;
-                        G [k3    ] = Bx [k + (j1    ) * vlen] ;
-                        G [k3 + 1] = Bx [k + (j1 + 1) * vlen] ;
-                        G [k3 + 2] = Bx [k + (j1 + 2) * vlen] ;
+                        int64_t k;
+                        #pragma omp parallel for num_threads(nthreads) \
+                            schedule(static)
+                        for (k = 0 ; k < vlen ; k++)
+                        {
+                            // G (k,0:2) = B (k,j1:j1+2)
+                            const int64_t k3 = k * 3 ;
+                            G [k3    ] = Bx [k + (j1    ) * vlen] ;
+                            G [k3 + 1] = Bx [k + (j1 + 1) * vlen] ;
+                            G [k3 + 2] = Bx [k + (j1 + 2) * vlen] ;
+                        }
                     }

                     //----------------------------------------------------------
@@ -708,16 +714,19 @@
                     //----------------------------------------------------------

                     GB_BTYPE *restrict G = W ;
-                    #pragma omp parallel for num_threads(nthreads) \
-                        schedule(static)
-                    for (int64_t k = 0 ; k < vlen ; k++)
                     {
-                        // G (k,0:3) = B (k,j1:j1+3)
-                        const int64_t k4 = k << 2 ;
-                        G [k4    ] = Bx [k + (j1    ) * vlen] ;
-                        G [k4 + 1] = Bx [k + (j1 + 1) * vlen] ;
-                        G [k4 + 2] = Bx [k + (j1 + 2) * vlen] ;
-                        G [k4 + 3] = Bx [k + (j1 + 3) * vlen] ;
+                        int64_t k;
+                        #pragma omp parallel for num_threads(nthreads) \
+                            schedule(static)
+                        for (k = 0 ; k < vlen ; k++)
+                        {
+                            // G (k,0:3) = B (k,j1:j1+3)
+                            const int64_t k4 = k << 2 ;
+                            G [k4    ] = Bx [k + (j1    ) * vlen] ;
+                            G [k4 + 1] = Bx [k + (j1 + 1) * vlen] ;
+                            G [k4 + 2] = Bx [k + (j1 + 2) * vlen] ;
+                            G [k4 + 3] = Bx [k + (j1 + 3) * vlen] ;
+                        }
                     }

                     //----------------------------------------------------------

After applying the above fix, we still get an error when trying to build with commit: https://github.com/DrTimothyAldenDavis/GraphBLAS/commit/17f431150924edacfc70d08640f239893300dbe1 with cpu_features disabled.

The new error is:

[ 54%] Building C object CMakeFiles/graphblas.dir/Source/Generated2/GB_AxB__plus_times_fp32.c.obj
GB_AxB__plus_times_fp32.c
%SRC_DIR%\Source\Generated2\GB_AxB__plus_times_fp32.c(447): error C2485: 'target': unrecognized extended attribute
%SRC_DIR%\Source\Generated2\GB_AxB__plus_times_fp32.c(447): error C2059: syntax error: ')'
%SRC_DIR%\Source\Generated2\GB_AxB__plus_times_fp32.c(447): error C2059: syntax error: 'type'
%SRC_DIR%\Source\Generated2\GB_AxB__plus_times_fp32.c(485): error C2485: 'target': unrecognized extended attribute
%SRC_DIR%\Source\Generated2\GB_AxB__plus_times_fp32.c(485): error C2059: syntax error: ')'
%SRC_DIR%\Source\Generated2\GB_AxB__plus_times_fp32.c(485): error C2059: syntax error: 'type'
%SRC_DIR%\Source\Template\GB_AxB_saxpy5_meta.c(118): warning C4013: 'GB_AxB_saxpy5_unrolled_avx512f' undefined; assuming extern returning int
%SRC_DIR%\Source\Template\GB_AxB_saxpy5_meta.c(129): warning C4013: 'GB_AxB_saxpy5_unrolled_avx2' undefined; assuming extern returning int
NMAKE : fatal error U1077: 'D:\bld\graphblas_1641583385327\_build_env\Library\bin\cmake.exe' : return code '0x2'
Stop.
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX64\x64\nmake.exe"' : return code '0x2'
Stop.
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX64\x64\nmake.exe"' : return code '0x2'
Stop.

as seen here:
https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=437290&view=logs&j=a70f640f-cc53-5cd3-6cdc-236a1aa90802&t=f5d15007-a01c-5ad8-c9ce-4d519d3b275f&l=3393

We added -D GBNCPUFEAT=1 to the cmake command:
https://github.com/conda-forge/graphblas-feedstock/blob/7e60db768b0116b7c9cfc5c5baded573c63b24b8/recipe/bld.bat

Export/unpack without values

Sometimes when I export/unpack an object, I only want the indices. It would be nice to be able to pass NULL in for the values array. I believe you already support this behavior in extract tuples.

Sometimes exporting in a given format requires SuiteSparse to perform work (such as transposing or sorting). If we know the values aren't needed, then maybe less work could be done.

I don't expect this to be optimized any time soon, but I suspect it may be optimized at some point in the future. If I could write code that passes NULL in for the values today, then I could benefit from any future optimization.

I suspect passing in NULL for any array in export/unpack would be super-tedious work. I'm not asking for this. Only values.

Alternatively, is there some trick that can be done that effectively does this efficiently? Perhaps one could create a new iso-Matrix with the same structure as the non-iso-Matrix, then export the new iso-Matrix.

Add GxB_diag - assign GrB_vector to k-th diagonal

As discussed on the forum adding this here for tracking.

GrB_Vector_size (&n, v) reports n = 5, and if k = 0, then GxB_diag (&A, v, 0) would create a 5-by-5 diagonal matrix.

If k = 1 or -1, then A would be 6-by-6. Here's what I think it should do, just like MATLAB, except it would always
produce a sparse matrix, not a full one like MATLAB.

If nvals(v) << n then I would create A as hypersparse. It would be crazy simple to write this GxB_diag function,
and it would be very simple, and O(nvals(v)) time.

>> v = rand (1,5)
v =
    0.8147    0.9058    0.1270    0.9134    0.6324
>> diag (v, 1)
ans =
         0    0.8147         0         0         0         0
         0         0    0.9058         0         0         0
         0         0         0    0.1270         0         0
         0         0         0         0    0.9134         0
         0         0         0         0         0    0.6324
         0         0         0         0         0         0
>> size(ans)
ans =
     6     6
>> diag (v, -2)
ans =
         0         0         0         0         0         0         0
         0         0         0         0         0         0         0
    0.8147         0         0         0         0         0         0
         0    0.9058         0         0         0         0         0
         0         0    0.1270         0         0         0         0
         0         0         0    0.9134         0         0         0
         0         0         0         0    0.6324         0         0

Wrong argument to GrB_Matrix_error()?

Shouldn't it be GrB_Matrix not GrB_Vector?

https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/master/Include/GraphBLAS.h#L4428

[Bug] GrB_apply one

I've not tested yet in C but in both Rust and Java when running GrB_Matrix_apply with unaryOp one I get 0 instead of 1
Steps to reproduce:

fn test_matrix_apply_transpose_on_self_fail_java() {
        let mut mat1 = SparseMatrix::<i32>::empty((1, 3));
        mat1.insert((0, 0), -3); // GrB_Matrix_setElement
        mat1.insert((0, 2), -4);
        mat1.insert((0, 1), 10);

        mat1.wait();

        // GrB_Marix_apply
        let mut mat2 = mat1.apply_mut::<i32, bool>(
            UnaryOp::<i32, i32>::one(), //GxB_ONE_INT32
            None,
            None,
            &Descriptor::default(),
        ); 

        mat2.wait();

        assert_eq!(mat2.shape(), (1, 3));
        assert_eq!(mat2.get((0, 0)), Some(1)); //this fails
        assert_eq!(mat2.get((0, 2)), Some(1)); //this fails
        assert_eq!(mat2.get((0, 1)), Some(1)); //this fails
    }

Extract matrix values (edges) used in a VxM

Is there any easy way to produce a matrix consisting only of the values interacted with in a VxM operation?

Specifically, I am using one step of a breadth-first search-like procedure to produce an s-t cut in the graph. I know that the edges traversed in this VxM corresponds to the cut I need, but I am unsure how to extract them.

The VxM produces the full set of vertices on the "t-side" of the cut, and by searching one step "backwards" again (VxM with using the t-set and the tranposed graph) I could produce all the parent-vertices. By taking all the edges then implicitly stored in the two sets of vertices, I would likely get most but not all of the cut.

I could also utilize these two sets of vertices to find the edges manuallly using matrix_extract, but that would lead to the kind of run-time I am trying to avoid.

I would appreciate any help, or an indication that this kind of procedure isn't really supported in GraphBLAS.

Intel compiler error (icc)

Hello,

I am trying to use Intel Compiler to compile GraphBLAS but I get the following error:

GraphBLAS/Source/GB_ops.c(531): error: expression must have a constant value
GB_MONOID_DEFT ( GrB_, MIN_FP32     , float     , INFINITY    , -INFINITY )

My intel compiler version is
icc (ICC) 19.0.3.199 20190206

Here it says most of C11 features are supported by icc
https://software.intel.com/content/www/us/en/develop/articles/c11-support-in-intel-c-compiler.html

I am using GraphBLAS version 4.0.3 and the following commands to compile it:

mkdir build & cd build 
CC=icc cmake ..
cd ..
make library JOBS=32

Any idea what is wrong?

GrB_eraseElement

Is there any specific reason why there isn't an erase function? And can we get one in a future release?

unpack/pack functions sometimes switch ncols and nrows of the underlying Matrix

For example, if you unpack a Matrix via FullR and then pack it with FullC, then the shape of the Matrix is transposed. I confirmed this by printing nrows and ncols after the unpack function, and they are reversed afterwards.

I have observed this reversing behavior with the following combinations of unpack-then-pack functions:

unpack	pack
`GxB_Matrix_unpack_FullR`	`GxB_Matrix_pack_FullC`
`GxB_Matrix_unpack_FullC`	`GxB_Matrix_pack_FullR`
`GxB_Matrix_unpack_BitmapR`	`GxB_Matrix_pack_BitmapC`
`GxB_Matrix_unpack_BitmapC`	`GxB_Matrix_pack_BitmapR`

Perhaps this is behaving as intended (I didn't read the fine print), but, nevertheless, I find it surprising that a pack operation can change the shape of a Matrix.

Growing size of compiled shared library

Not sure this is something that is easy to do anything about, but the size trend of the shared library seems to be blowing up:

v3.3.3 => 71MB uncompressed, 31MB compressed
v4.0.0draft4 => 109MB uncompressed, 40MB compressed
v4.0.0draft5 => 448MB uncompressed, 135MB compressed

This matters when including prebuilt graphblas libraries in docker images (where the image size is important)

For reference, I've published the docker images here:
https://hub.docker.com/r/graphblas/graphblas

# Dockerfile/how the images are built:
FROM ubuntu:20.04 as builder
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -yq cmake make wget

ARG VERSION

WORKDIR /home/build
RUN wget https://github.com/DrTimothyAldenDavis/GraphBLAS/archive/v$VERSION.tar.gz
RUN tar -xf v$VERSION.tar.gz

WORKDIR /home/build/GraphBLAS-$VERSION
RUN make JOBS=8 && make install

FROM scratch
COPY --from=builder /usr/local/lib/ /usr/local/lib/
COPY --from=builder /usr/local/include/ /usr/local/include/

Compile with cosmopolitan libc

https://github.com/jart/cosmopolitan

"Cosmopolitan Libc makes C a build-once run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS with the best possible performance and the tiniest footprint imaginable."

Could be really interesting to try building GraphBLAS/LAGraph using cosmopolitan libc instead of glibc.

Prebuilt binary rmm_wrap_test in repository

The binary executable rmm_wrap/rmm_wrap_test is present in the repository.
Is it intended?

GraphBLAS initialization and separate calls in Python wrapper

I build the c++ library with 2 functions: f1 and f2. After that I call them from python wrapper using ctypes.

In f1 I call GrB_init(GrB_NONBLOCKING); and do some computations on GraphBLAS matrices with user-defined type. f1 return the pointer to the c++ class with these matrices. to the python wrapper

After that I try to call function f2 from python wrapper and use this class and GraphBLAS matrices, but I get segmentation fault (core dumped) when I try to use GraphBLAS matrices.

Is there a way to init GrapBLAS in one call from wrapper and continue to use matrices in next calls. Or here we have some limitations?

vxm does not return

Unexpected behavior

When GrB_vxm is called with empty w, u, A, and non-empty mask, GrB_vxm never returns.

Example code

#include "GraphBLAS.h"

int main(int argc, char *argv[]) {
    GrB_init(GrB_BLOCKING);

    size_t size = 100;

    GrB_Vector w;
    GrB_Vector_new(&w, GrB_BOOL, size);

    GrB_Vector u;
    GrB_Vector_new(&u, GrB_BOOL, size);

    GrB_Vector mask;
    GrB_Vector_new(&mask, GrB_BOOL, size);
    for (int i = 0; i < 5; i++) {
        GrB_Vector_setElement_BOOL(mask, true, i);
    }

    GrB_Matrix A;
    GrB_Matrix_new(&A, GrB_BOOL, size, size);

    GrB_vxm(w, mask, GrB_NULL, GxB_LOR_LAND_BOOL, u, A, GrB_NULL);

    GrB_finalize();
}

Detailed Description

If the inputs u and A are empty, the hash table used in loop GB_AxB_saxpy3_symbolic.c#L173 will have space for 4 elements.
If the mask has more than 4 elements, loop GB_AxB_saxpy3_symbolic.c#L167 will not be able to find space for all elements, and it will loop forever.

GxB for vector dot product?

How much trouble would it be for you to finally add this? I would use it and love it.

To work around the lack of this functionality, I make a Matrix with one column or row, do mxv with a semiring and a length 1 vector for the output, then extract the first element from the vector. This is pretty annoying. It's come up a few times for me.

I don't have a preference for the name. vxv, dot, inner, or really anything would work for me. Output types of GxB_Scalar or C scalar would work for me; GxB_Scalar is preferred/expected, because it makes everything so much easier. Is there really any question about what the signature or behavior would be?

New functions in v5.0.2 not externed with GB_PUBLIC

GxB_Matrix_concat, GxB_Matrix_diag, GxB_Matrix_split, GxB_Vector_diag were not externed with GB_PUBLIC. I suspect this was a simple oversight. Sorry for not checking this earlier!

Best Way to Construct GrB_Vector of all Trues but a Few Falses

I'm constructing a mask, and I have a list of GrB_Index that specifies which indexes are to be made false, but all the rest I want true (so I can probe them in the matrix).

I wanted to just use GrB_assign to set all the values to true, and then use GrB_Vector_build to feed in tuples of the indexes I want set to False... but obviously we aren't allowed to build on the GrB_Vector if it is nonempty.

So what's the most efficient way to accomplish this?

GrB_Info batch(GrB_Matrix A, GrB_Index src, GrB_Index *dests, GrB_Index destsCount)
{
    GrB_Index n;
    GrB_Matrix_nrows(&n, A);             

    GrB_Index *destsValues;
    destsValues = malloc(destsCount * sizeof(GrB_Index));
    memset(destsValues, 0, destsCount * sizeof(GrB_Index));

    GrB_Vector mask;                                
    GrB_Vector_new(&mask, GrB_BOOL, n);             
    GrB_assign(mask, NULL, GrB_NULL, true, GrB_ALL, n, GrB_NULL);
    GrB_Vector_build(mask, dests, destsValues, destsCount, GrB_LOR);

    // etc...
}

Apply with identity and mask doesn't preserve iso-valued-ness

For example:

  GrB_Matrix_new(&M_0, GrB_INT64, 4, 4);
  GrB_Matrix_new(&M_1, GrB_INT64, 4, 4);
  GrB_Matrix_assign_INT64(M_0, NULL, NULL, 1, GrB_ALL, 4, (GrB_Index[]){0, 1}, 2, NULL);
  GrB_Matrix_assign_INT64(M_1, NULL, NULL, 1, GrB_ALL, 4, (GrB_Index[]){1, 2}, 2, NULL);
  GrB_Matrix_new(&M_2, GrB_INT64, 4, 4);
  GrB_Matrix_apply(M_2, M_1, NULL, GrB_IDENTITY_INT64, M_0, GrB_DESC_S);

M_0 and M_1 are both iso-valued with the same datatype and value of 1. I expect M_2 to be iso-valued, but it is not.

For this example, I can use assign instead of apply, which would preserve iso-valued-ness.

Typecast mystery: value-dependent coercion weirdness with accumulate?

Using SuiteSparse:GraphBLAS 3.2.2.

I apologize for not having these examples in C (GraphBLAS is still building...). Here's an example using grblas Python API. I will explain some hidden details:

w = gb.Vector.from_values([0, 1], [0, -20], dtype='INT64')
u = gb.Vector.from_values([0, 1], [1, 2], dtype='FP64')
w(accum=gb.binary.plus) << u

which results in

                    0                    1
  4886396799603965952  4886405595696988140

As you can probably guess, u and w are length-two vectors. u is FP64 dtype with values 1 and 2. w is INT64 dtype with values 0 and -20.

Under the hood, this is calling something like:

GrB_apply(
    w,
    GrB_NULL,           // no mask
    GrB_PLUS_INT64,     // accum; the dtype is very important here!
    GrB_IDENTITY_FP64,  // unary op
    u,
    GrB_NULL            // no descriptors
)

Surprisingly, the results in w are 4886396799603965952 and 4886405595696988140.

If we use GrB_PLUS_FP64 for the accumulation instead, then we get the expected results 1 and -18.

Here's the kicker. Let's extend w and u while keeping the first two elements the same:

w = gb.Vector.from_values([0, 1, 2, 4, 5], [0, -20, 30, 40, 50], dtype='INT64')
u = gb.Vector.from_values([0, 1, 3, 4, 5], [1, 2, 3, -4, 0], dtype='FP64')
w(accum=gb.binary.plus) << u

which results in

  0    1   2  3   4   5
  1  -18  30  3  36  50

These values are correct! But, why don't the first two values match the first example?!?!

I'm quite confused and surprised, although it is very possible that I am doing something silly here.

GrB_MINUS_REMOVE

I've got a situation where I'm extracting vectors and then adding or subtracting them to another column or row using GxB_Col_subassign and GxB_Row_subassign with GrB_PLUS_UINT16 and GrB_MINUS_UINT16.

the problem is when an entry becomes 0 after subtracting and should be removed... currently the only way I see to do this would be iteratively checking every entry after each subtraction operation, which i'm very weary of given extreme outliers might arise in my application on the order of many thousands (hopefully millions) of entries. but maybe such concerns are moot given each value would have to iteratively touched inside the library anyway, and they aren't operated on in some SIMD-esque way?

in either case, ideally there would be a GrB_MINUS_REMOVE_UINT16 operator that removed any value that hit zero... regardless of iteration or not, only requiring the data to be visited and operated on once. but i'm not sure how beyond the pale that is currently?

i considered the idea of a GrB_Col_extract that let me filter for 0 values and then iterate over them, but I assume inside the library all the entries would still have to be visited iteratively to test their values equal 0?

maybe you have another idea of how to best accomplish this?

worst case i might decide to just leave them as zombie zeros since they would not interfere with the operation of my database, just simply consume excess memory.

Matrix reduce to Vector with FIRST or SECOND

Per the spec, one can pass a BinaryOp to GrB_Matrix_reduce_BinaryOp to reduce the rows or columns of a Matrix to a Vector. SuiteSparse 4.0 changed so that only BinaryOps that correspond to known Monoids may be used. FIRST and SECOND do not have Monoids, therefore they cannot be used.

Nevertheless, I expected FIRST and SECOND to work in SuiteSparse. The intent is clear, and I suspect efficient implementations exist.

(Note: per the spec, the outcome of a non-commutative or non-associative binary operator is undefined. FIRST and SECOND are non-commutative--they commute to each other--so their behavior would also be undefined with the current spec. SuiteSparse often implies an ordering when applying BinaryOps and chooses to be well-defined).

Failed GxB_Vector_serialize/deserialize Boolean v5.2.2

Working with GraphBLAS with all default settings and attempting to build a JNI using a mixture of SWIG and Hard Coding. One of the functions I am using fails completely within the JNI code written in C. The GxB function fails when passed a Boolean Vector. The following snippet leads to a failed usage.

Important notes. This vector is dense, and sparse vectors seem to not suffer from the same issues. (Tested with a vector of length 1000 with 3 filled positions in multiple configurations). However, vectors with approximately >1% fill seem to fail.

  GrB_Vector vector = NULL;
  GrB_Vector_new(&vector, GrB_BOOL, 10);

  for (GrB_Index i = 0; i < 10; i++) {
    GrB_Vector_setElement_BOOL(vector, true, i); // Individually set all values to true.
  }

  GrB_Type type;
  GxB_Vector_type(&type, vector);

  GrB_Index size_handle = 0;
  void *blob = NULL;
  GxB_Vector_serialize(&blob, &size_handle, vector, NULL);

  GrB_Vector v = NULL;
  GxB_Vector_deserialize(&v, type, blob, size_handle, NULL);
  GrB_Index v_size;
  GrB_Vector_size(&v_size, v);

  for (GrB_Index i = 0; i < v_size; i++) {
    bool val;
    GrB_Vector_extractElement_BOOL(&val, v, i);
    printf("%s, \t", val?"true":"false");
  }
  printf("\n");

The Response I receive from the final print statement is

true, 	true, 	true, 	true, 	true, 	true, 	false, 	false, 	true, 	false,

This behavior is not replicated for any other types. int8_t through int64_t are all serialized and de-serialized correctly. In addition, I have tested GrB_Matrix_serialize and GrB_Matrix_deserialize and neither of them suffer the same issues regardless of sparsity.

Serialize and deserialize vectors and matrices

Hi Tim! Is there a way to (de)serialize vectors and matrices to bytes or a stream of bytes? I want to send these bytes over the wire, not to disk.

I am creating a Python library for doing distributed and/or out-of-core GraphBLAS computation using the grblas API. A single distributed matrix is comprised of many SuiteSparse matrices (chunked any which way). To have this work well in a distributed setting, the ability to serialize objects will of course be very valuable. And just a heads up: this will also exercise running GraphBLAS with lots of user threads (fingers crossed!).

Failing test (gbtest76) on Apple Silicon

I made some progress on porting the library to Apple Silicon (#84, #86). This issue documents how to build Octave v7 on macOS and run the test. Finally, it contains the output of a single failing test.

Building Octave

Grab the brew formula:

wget https://raw.githubusercontent.com/Homebrew/homebrew-core/master/Formula/octave.rb

Edit octave.rb
1. Add "disable-docs" to args (or ensure that you have a working texinfo installation).
2. Edit Mercurial (hg) repository: switch from the default branch (containing code for Octave v8.0) to stable (v7.0).
Run
```
brew install --head ./octave.rb
```
This takes about 10 minutes.

Building the tests (gbmake)

Grab the OpenMP binaries as described at https://mac.r-project.org/openmp/:

curl -O https://mac.r-project.org/openmp/openmp-13.0.0-darwin21-Release.tar.gz
sudo tar fvxz openmp-13.0.0-darwin21-Release.tar.gz -C /

Run

sed -i.bkp 's/-fopenmp/-Xclang -fopenmp/g' @GrB/private/gbmake.m

Run octave and follow the instructions in GraphBLAS/README.md, including gbmake:
```
cd @GrB/private
gbmake
```

Running the tests

Run the tests:

cd test
gbtest

Currently, all tests pass except one.

Failing test: gbtest76

gbtest76 fails on macOS / Apple Silicon. (I tried running it in both Octave 7 and 8 on macOS. It passes on Linux with Octave 8. I did not try it on Linux with Octave 7.)

The output of gbtest76 is the following:

> gbtest76

gbtest76: testing trig and special functions
single
double
single complex
double complex
error: assert (err == 0) failed
error: called from
    assert at line 107 column 11
    gbtest76>gbtest76b at line 432 column 5
    gbtest76 at line 47 column 13

Upon closer inspection, the only test case that fails is the following:

case { 'double complex' }
    A = rand (4) + 1i* rand (4) ;
    B = rand (4) + 1i* rand (4) ;
    tol = 1e-10 ;
    G = GrB (A) ;
    H = GrB (B) ;
    gbtest76b (A, B, G, H, tol) ;

I printed the output before the assertion in line 432:

C1 = A.^1 ;
C2 = G.^1 ;
err = norm (C1-C2, 1) ;
display (err)
assert (err == 0) ;

It indeed has a non-zero result:

double complex
err = 2.7497e-16

GRB_VERSION / GrB_SUBVERSION

On page 45 of "GraphBLAS_API_C_v13.pdf", I see:

#define GRB_VERSION     1
#define GrB_SUBVERSION  3

I am surprised by the use of GRB (all caps), and also the inconsistency (GRB and GrB). Do you know whether these should both be GRB (as you define them in your header file), or GrB? I know this is more of a GraphBLAS question, but I wasn't sure where else to ask.

GrB_Vector_assign_BOOL results in Memory Error for Large Vectors

I am having an issue with boolean assign and perhaps my usage of ISO valued vectors. I create an ISO valued vector ( GrB_Vector_assign_BOOL(v, NULL, NULL, true, GrB_ALL, size, NULL);) Then the presented output is iso status = true, size_t = 264. This is expected behavior for the vector. However, once I assign the opposite of the default value (false) to one position (GrB_Vector_assign_BOOL(v, NULL, NULL, false, indices, 1, NULL);) the code returns an error (GrB_OUT_OF_MEMORY) and seems to dump the vector with output iso status = false, size_t = 0.

    GrB_Vector v = NULL;
    GrB_Index n = GxB_INDEX_MAX;
    GrB_Vector_new(&v, GrB_BOOL, n);
    GrB_Index size = NULL;
    GrB_Vector_size(&size, v);
    GrB_Vector_assign_BOOL(v, NULL, NULL, true, GrB_ALL, size, NULL);

    {
        bool iso = NULL;
        size_t memsize = NULL;
        GxB_Vector_iso(&iso, v);
        GxB_Vector_memoryUsage(&memsize, v);
        printf("iso status = %s,\t size_t = %zu\n", iso?"true":"false", memsize);
    }

    GrB_Index indices[1] = {2};
    GrB_Vector_assign_BOOL(v, NULL, NULL, false, indices, 1, NULL);

    {
        bool iso = NULL;
        size_t memsize = NULL;
        GxB_Vector_iso(&iso, v);
        GxB_Vector_memoryUsage(&memsize, v);
        printf("iso status = %s,\t size_t = %zu\n", iso?"true":"false", memsize);
    }

This issue may be caused by a misunderstanding of the nature of the ISO valued vector. Does the ISO value swap the underlying "default" value or does it only work to claim that the entire vector is now equal to one value? Given the former interpretation, is the result garnered an error or expected behavior? Given the latter interpretation, is there some way to manipulate the vector so that it stores the significantly smaller cardinality false positions rather than the extremely dense true positions. Thank you for your help!

PLUS_POW semirings

I am creating aggregators in grblas (because they are oh-so-nice to use!) using straightforward recipes in anticipation of SuiteSparse 5 with O(1) dense vectors. I currently have 10 aggregators that use PLUS_POW semirings.

Would you consider adding PLUS_POW semirings? Am I correct in thinking that they can be faster if you include them? Fwiw, it's not difficult for me to create the semirings from plus monoid and pow binaryop.

Please don't deprecate GxB_EQ_BOOL_MONOID

It appears that you list this as being deprecated in your header file. I prefer the name GxB_EQ_BOOL_MONOID over GrB_LXNOR_MONOID_BOOL.

I see that you chose to have EQ_{EQ,NE,GT,LT,GE,LE,...} semirings instead of LXNOR_{...}. If you have semirings that use EQ for a monoid, then it's reasonable to expect to have EQ monoids as well.

Similarly, it appears that the GxB_EQ_LOR_BOOL semiring is deprecated in favor of GrB_LXNOR_LOR_SEMIRING_BOOL. It would be nice to keep the original name around.

If you choose not to deprecate these, please highlight this fact in the header file (perhaps you already have a blanket statement like "EQ is always preferred over LXNOR and won't be deprecated", but I may have missed this).

Apple-Darwin: Undefined symbols for architecture arm64

This may be just a missing flag on my part. Cross compiling from Alpine Linux to aarch64-apple-darwin I get this command/error pair:

/opt/bin/aarch64-apple-darwin20-libgfortran5-cxx11/aarch64-apple-darwin20-clang --sysroot=/opt/aarch64-apple-darwin20/aarch64-apple-darwin20/sys-root  -Wno-pointer-sign  -O3 -DNDEBUG   -O3 -DNDEBUG -dynamiclib -Wl,-headerpad_max_install_names -compatibility_version 5.0.0 -current_version 5.0.2 -o libgraphblasdemo.5.0.2.dylib -install_name @rpath/libgraphblasdemo.5.dylib CMakeFiles/graphblasdemo.dir/Demo/Source/bfs5m.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/bfs5m_check.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/bfs6.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/bfs6_check.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/dpagerank.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/dpagerank2.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/drowscale.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/get_matrix.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/import_test.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/ipagerank.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/irowscale.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/isequal.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/mis.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/mis_check.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/mis_score.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/prand.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/random_matrix.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/read_matrix.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/simple_rand.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/simple_timer.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/tricount.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/usercomplex.c.o CMakeFiles/graphblasdemo.dir/Demo/Source/wathen.c.o  -Wl,-rpath,/workspace/srcdir/GraphBLAS/build -lm libgraphblas.5.0.2.dylib -lm

Undefined symbols for architecture arm64:
 "___divdc3", referenced from:
    _complex_div in usercomplex.c.o
    _complex_rdiv in usercomplex.c.o
    _complex_minv in usercomplex.c.o
 ld: symbol(s) not found for architecture arm64
 clang-11: error: linker command failed with exit code 1 (use -v to see invocation)
 make[3]: *** [CMakeFiles/graphblasdemo.dir/build.make:435: libgraphblasdemo.5.0.2.dylib] Error 1

Is there a way to avoid building the demo libraries altogether, perhaps in the next patch? I didn't see a flag for it in CMakeLists.txt. The other question is whether that symbol being undefined silently affects the complex types in the library?

Descriptor for GxB_Matrix_sort

In section 6.13.2 of the User Guide for GxB_Matrix_sort, you mention to use the descriptor GrB_DESC_IN0. A descriptor of this name does not exist in GraphBLAS.h. Do you mean GrB_DESC_T0?

Advocating for Aggregator objects to perform reductions

I think there is a strong case to be made to add Aggregator objects to GraphBLAS to perform reductions. In software, it's powerful to give names to common operations, or, as is the case for aggregations, to reuse names users are already familiar with. This hides implementation details behind a cleaner API that better captures what the user wants to do instead of how to do it. I expect this would result in cleaner, more portable, and more performant code.

Such a change to a GraphBLAS library (or the spec) should, of course, be done with care. Generic user-defined aggregations are complex, and systems that implement them are often too simple and lack functionality, performance, or both. Expressive, performant user-defined aggregations are necessarily complicated to support (imho), and we shouldn't lose sight of this. Having Aggregator objects allows a library to incrementally support more and more advanced aggregations (if desired), the details of which are hidden behind a nice API and opaque types. But one shouldn't start advanced. Let's begin simple.

I think SuiteSparse:GraphBLAS can actually get really far by reusing its existing machinery. Most aggregations people care about do use monoids during a reduction step, which can be done with your O(1) dense vectors and suitable semirings. Many aggregations should be "simple" recipes for SuiteSparse (I quote "simple" so as to not make light of it--SuiteSparse is anything but simple!).

To help myself think through this, I have assembled a spreadsheet that describes how some common aggregations (and a few uncommon ones) may be performed in GraphBLAS. It's incomplete, but I think it's fairly illustrative:

https://docs.google.com/spreadsheets/d/1dQNrxLBVq8Y3JOdvNnIXRPZsPTDaf7lydztiG7Gbx8E/edit?usp=sharing

I split the aggregations into six groups of increasing complexity or likely difficultness to implement. The 0th group (not included) are the aggregations that are already well-handled by reductions with monoids, such as SUM, PROD, ANY, ALL, MIN, MAX, etc. These could, of course, be rolled into Aggregators.

Group 1

COUNT
COUNT_NONZERO
COUNT_ZERO
SUMOFSQUARES

// Group 1 (blue)
struct Aggregator {
    // May define init_value, init_unaryop, or neither
    // If none are defined, then an O(1) dense vector with irrelevant data can be used in a matrix-vector multiply.
    TYPE *init value,  // Optional; e.g., the values of an O(1) dense vector.
    UnaryOp *init_unaryop,  // Optional; e.g., is applied to the input matrix at the beginning.
    BinaryOp *update_binaryop,  // Optional; e.g., the BinaryOp in a matrix-vector multiply.
                                // If not provided, then `first` may be assumed such as `plus_first(A @ dense)`
    Monoid merge_monoid  // Required; e.g., the Monoid in a matrix-vector multiply.
                         // If only this is provided, then it is equivalent to existing reduce with Monoid.
};

Group 2

HYPOT
LOGADDEXP
LOGADDEXP2

// Group 2 (green)
// Add finalize.
struct Aggregator {
    TYPE *init value,
    UnaryOp *init_unaryop,
    BinaryOp *update_binaryop,
    Monoid merge_monoid,
    UnaryOp *finalize  // Optional; is applied to the output at the very end.
};

Group 3

MEAN
VARP
VARS
STDEVP
STDEVS
GEOMETRIC_MEAN
HARMONIC_MEAN
PTP
SKEW (not included)
KURTOSIS (not included)

// Group 3 (yello)
// Same fields as group 2, but requires tuple types or composite aggregations.

Group 4

ARGMIN
ARGMAX

// Group 4 (orange)
// Add init_indexunaryop.
struct Aggregator {
    // May define init_value, init_unaryop, init_indexunaryop and init_value, or none of these.
    TYPE *init value,
    UnaryOp *init_unaryop,
    IndexUnaryOp *init_indexunaryop,  // Optional; for indices; uses `init_value` as thunk value.
    BinaryOp *update_binaryop,
    Monoid merge_monoid,
    UnaryOp *finalize
};

Group 5

FIRST
LAST
FIRSTI
LASTI
FIRST_NONZERO
IS_MONOTONIC_DECREASING

// Group 5 (red)
// Add merge_binaryop and terminate_unaryop, and don't use update_binaryop.
// Exactly one of merge_binaryop or merge_monoid must be defined.
struct Aggregator {
    TYPE *init value,
    UnaryOp *init_unaryop,
    IndexUnaryOp *init_indexunaryop,
    BinaryOp *update_binaryop,
    BinaryOp *merge_binaryop,  // elements must be processed in order,
                               // but can still be decomposed into groups and done in parallel
    Monoid *merge_monoid,  // this is now optional
    UnaryOp *terminate_unaryop,  // call on each aggregation value to check whether we can terminate
    UnaryOp *finalize
};

Group 6

LASTI (alternative)
IS_MONOTONIC_DECREASING (alternative)

// Group 6 (brown)
// Add update_indexunaryop, and allow use of update_binaryop.
struct Aggregator {
    // init options
    TYPE *init value,
    UnaryOp *init_unaryop,
    IndexUnaryOp *init_indexunaryop,
    // update options
    BinaryOp *update_binaryop,
    IndexUnaryOp* update_indexunaryop,  // use previous agg value as thunk value
    // merge options
    BinaryOp *merge_binaryop,
    Monoid *merge_monoid,
    // final options
    UnaryOp *terminate_unaryop,
    UnaryOp *finalize
};

In group 5, we used init_unaryop or init_indexunaryop with merge_binaryop, but didn't define any update options. In this case, the init operation is applied to every element, and the elements are merged together (while maintaining order) with merge_binaryop.

In group 6, we use an init operation, an update operation, and merge_binaryop. This means if we split the elements being aggregated into sublists (for parallel processing), then the init operation is only called on the first element in each sublist, and the update operation is used for all the rest.

More behaviors and fields could be added. For example, if an update operation is specified without specifying a merge operation, then this could mean the elements must be processed in order, serially (i.e., no within-group parallelism!). For another example, terminate_value could be added to serve the same purpose as terminate_unaryop.

Known shortcomings:

Parametrized aggregations require user-defined functions with the parameters "baked in".
- Examples:
  - FREQUENCY(val)
  - MOMENT(n)
  - NTH(n)
  - POWERMEAN(p)
  - STDEV(ddof)
  - VAR(ddof)
  - ZSCORE(val)
  - PERCENT_GT(val), etc
  - MEAN_ABSOLUTE_DIFFERENCE(x) (abs(x1 - x) + abs(x2 - x) + ...)
It's unclear how an aggregation on a non-empty set can return NO_VALUE.
- For example, ONLY could return the first element if and only if there is one element.
- Also, should sample variance (VARS) return inf, nan, or NO_VALUE if a single value in group? (I say nan)
Aggregations such as MIN_BY are challenging.
- BY could be a UnaryOp or a Vector or Matrix the same shape as the input.
- Examples:
  - Return the complex number with the smallest imaginary component.
  - ARGMAX (probably in some way)
Aggregations such as weighted average aren't directly supported.
Aggregations or scans with non-scalar results (such as histograms or cumsum) aren't directly supported.
The following reductions aren't supported (yet):
- MEDIAN
- MODE
- PERCENTILES
- MAD or MEAN_ABSOLUTE_DEVIATION (around mean, median, mode, etc)
- NUNIQUE or COUNT_UNIQUE
ARGMIN and ARGMAX don't naturally generalize to Matrix to Scalar reduction.
- I think the natural generalization would be to return both I and J indices of the min or max value.
User-defined aggregators could become complicated and error-prone.
It's not obvious to me where to draw the line for how to create and support user-defined aggregators.
- Since groups 5 and 6 don't use monoids, these are most likely in the "not yet" category for SuiteSparse.

I'll be honest, the thing I like most about this proposal is COUNT. I really want a canonical, portable way to calculate row degrees and column degrees 😃 .

Please let me know if anything needs more explaining (or corrected). I intend this issue to be as much for the GraphBLAS committee and community as for SuiteSparse.

Cheers! 🍻

How to interpret larger than expected arrays in import/export

In version 4, the documentation says the size of arrays may be larger than expected. For example, in export CSR, it says Aj_size >= nvals(A). Similarly, for FullR, Ax_size >= nrows*ncols. If we're given an array larger than is expected, do we simply use the first n elements where n is the size we expect? The documentation for this is unclear to me. I understand the rationale of transferring control in O(1) time and memory if possible, which naturally exposes some underlying implementation details.

Similarly, are there any advantages to passing a larger-than-necessary array to import?

Reshape matrix

I would find it useful to be able to reshape a matrix while not modifying the values. The new shape must have the same number of elements as the old shape--i.e., nrows1 * ncols1 == nrows2 * ncols2. It would be sufficient for me to assume row-oriented layout.

For example (with dense data):

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

>>> a.reshape(5, 3)
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

The purpose of this is to rearrange what is effectively multidimensional data. For example, if we have axes coordinates [C, D] x [X, Y] x [A, B], reshaping allows us to go from:

    XA XB YA YB
C    1  2  3  4
D    5  6  7  8

    A  B
CX  1  2
CY  3  4
DX  5  6
DY  7  8

There are at least two ways we could generalize this to do more than reshape, which we can discuss if you're open to this functionality.

A nice thing about reshape is that for row-oriented storage that has sorted indices, the values don't need to be touched at all, and the indices can be processed and remain sorted. For bitmap and full formats, only the shape metadata needs to change.

Time Complexities of GrB_vxm and GrB_Matrix_extract

What are the time complexities of GrB_vxm and GrB_Matrix_extract?

When conducting a BFS, I'm curious if it's faster to extract a submatrix and then LOR across the rows, than it is to multiply.

Static link GraphBLAS 6.1.3

When trying to integrate GraphBLAS 6.1.3 to RedisGraph
We use the graphblas_static target and it look like cpu_features is not static linked
Trying to resolved it by adding the below code to CMakeLists.txt didn't help

if ( BUILD_GRB_STATIC_LIBRARY )
    target_link_libraries ( graphblas_static PUBLIC cpu_features )
endif ( )

the linker error was resolved but at runtime we get Symbol not found: _GetX86Info

GxB_PRINT_1BASED doesn't affect printing of pending tuples

5x5 GraphBLAS double matrix, sparse by col
  1 entry, memory: 6.5 KB
  pending tuples: 9 max pending: 256 zombies: 0

    (1,1)    0.570652
  pending tuples:
    GraphBLAS type:  double size: 8
    row: 1 col: 3    0.0304623
    row: 1 col: 4    0.987438
    row: 2 col: 3    0.10812
    row: 3 col: 0    0.124603
    row: 3 col: 3    0.411447
    row: 4 col: 0    0.953935
    row: 4 col: 1    0.581752
    row: 4 col: 2    0.22532
    row: 4 col: 3    0.634672
  pending operator: implicit 2nd

I can try and find this later this week and make a PR to fix, unless this is intentional. It looks correct one pending tuples are "materialized".

SIMD + GraphBLAS

just out of pure curiosity, do you have any plans to vectorize GraphBLAS? (I saw at least GB_DOT_SIMD but couldn't locate any usages).

on that topic, avx512 will be closer to penalty-less come Ice Lake.

and of course there are inbound matrix instructions as well. (who knows the chip wide downscaling penalties yet though).

^^ from this broadcast: https://sc20.gallery.video/detail/videos/sc20/video/6208439527001/a-closer-look-at-ice-lake-architecture-overview

Masking groups of rows/cols in a matrix

Hello,
I am quite new with GraphBlas so I may have missed something but I could not find an easy way to remove/mask groups of rows and/or columns from a matrix. Formally, I got a sparse vector, says U (containing integers but actually all equal to 1), and I want to remove rows or columns of a matrix A where U has no entry. In Matlab, something like : A.*U or U.*A woul work well. In GraphBlas, I am currently going with
-Creating a matrix Du = diag(u)
-Compute A*Du or Du*A
Is there another, more efficient way for performing this ?
Thanks

How to introspect a Vector or Matrix to determine if it's iso-valued

Is there some way to ask a Vector or Matrix whether or not it's iso-valued? Thanks.

Bug with `GrB_{Matrix,Vector}_assign`? 3.3.0 and 3.2.2

Best just to show. It's possible I'm not doing things quite right:

int main (void)
{
    GrB_Vector w, u ;
    GrB_Info info ;
    OK (GrB_init (GrB_NONBLOCKING)) ;
    OK (GrB_Vector_new (&w, GrB_INT64, 1)) ;
    OK (GrB_Vector_setElement_INT64 (w, 1, 0)) ;
    OK (GxB_print (w, 3)) ;

    OK (GrB_Vector_new (&u, GrB_FP64, 1)) ;
    OK (GrB_Vector_assign (u, NULL, NULL, w, GrB_ALL, 1, NULL)) ;
    OK (GxB_print (u, 3)) ;

    OK (GrB_Vector_setElement_FP64(u, 98.5, 0)) ;
    OK (GxB_print (u, 3)) ;

    OK (GrB_finalize ( )) ;
}

w Looks good so far:

  1x1 GraphBLAS int64_t vector, sparse by col:
  w, 1 entry

    (0,0)   1

u after assign. Wait, why is u an int64_t now?!

  1x1 GraphBLAS int64_t vector, sparse by col:
  u, 1 entry

    (0,0)   1

u after set element. Indeed, I can't assign a float to u:

  1x1 GraphBLAS int64_t vector, sparse by col:
  u, 1 entry

    (0,0)   98

Now for another example:

int main (void)
{
    GrB_Vector w, u ;
    GrB_Info info ;
    OK (GrB_init (GrB_NONBLOCKING)) ;
    OK (GrB_Vector_new (&w, GrB_INT64, 1)) ;
    OK (GrB_Vector_setElement_INT64 (w, 1, 0)) ;
    OK (GxB_print (w, 3)) ;

    OK (GrB_Vector_new (&u, GrB_FP64, 1)) ;
    OK (GrB_Vector_setElement_FP64(u, 42.5, 0)) ;   // NEW
    OK (GxB_print (u, 3)) ;                         // NEW

    OK (GrB_Vector_assign (u, NULL, NULL, w, GrB_ALL, 1, NULL)) ;
    OK (GxB_print (u, 3)) ;

    OK (GrB_Vector_setElement_FP64(u, 98.5, 0)) ;
    OK (GxB_print (u, 3)) ;

    OK (GrB_finalize ( )) ;
}

This outputs:

  1x1 GraphBLAS int64_t vector, sparse by col:
  w, 1 entry

    (0,0)   1


  1x1 GraphBLAS double vector, sparse by col:
  u, 1 entry

    (0,0)    42.5


  1x1 GraphBLAS double vector, sparse by col:
  u, 1 entry

    (0,0)    4.94066e-324


  1x1 GraphBLAS double vector, sparse by col:
  u, 1 entry

    (0,0)    98.5

So, u stays a double, but now the assignment into it from w gives bizarre results.

I tested this on my personal build on OS X with versions 3.2.2 and 3.3.0.

xref: python-graphblas/python-graphblas#25

Feature idea: add `LXNOR` (the opposite of `LXOR`)

This is more of an idea/suggestion than a feature request. What do you think of adding LXNOR binary operator and monoid as a GxB?

Here's a use case: triangle count given the incidence matrix E and either the adjacency matrix A or the lower triangle of the adjacency matrix L.

A value and dtype-agnostic way to write triangle count is (in grblas syntax):

val1 = A.mxm(E, semiring.lxnor_pair[dtypes.BOOL]).new()
num_triangles = val1.reduce_scalar(monoid.plus[dtypes.UINT64]).value // 3

val1 = L.mxm(E, semiring.lxnor_pair[dtypes.BOOL]).new()
num_triangles = val1.reduce_scalar(monoid.plus[dtypes.UINT64]).value

In my limited benchmarking, it appears to be better to reduce val1 to a scalar than apply a mask based on the values of val1 then count the number of non-zero (and I'm eager to try using apply with a binary operator as per the 1.3 spec). Using LXOR_PAIR followed by LNOT isn't a terrible option either.

Of course, if one has L or A in these examples, then there are other ways to compute the total number of triangles.

How to avoid realizing huge results from matrix multiply?

Any advice for how to avoid creating huge results from matrix multiply? Preferably, this won't entail trying (and failing) to allocate memory.

I have three use-cases in mind:

I want to be "nice" to users and preferably give them a useful error message
I want to be "nice" to other processes running on the machine by not using too much memory (so I want to avoid creating too large of objects even if they'll fit in memory)
When executing a task in a distributed setting, I want to discover when a task is too large for one machine so I can distribute it over multiple machines.

Perhaps the easiest approach may be to add a setting that says "don't allow this object (or any object if a global setting) to hold more than MAX_NNZ elements". Perhaps max size in bytes would be better than NNZ. Then, if you detect this will be violated during a calculation, you can return GrB_INSUFFICIENT_SPACE, GrB_OUT_OF_MEMORY, or an exception of your creation. This approach would be "good enough" for me.

Is this a reasonable request? There are, of course, other, fancier ways of computing or estimating nnz of a sparse matrix product (symbolic multiplication, sketches, sampling, etc.), but I'll leave it up to you if you want them in SuiteSparse any time soon.

`GrB_Matrix_apply_BinaryOp{1st,2nd}` use the wrong transpose descriptors.

According to the spec (and what one would expect from the argument list to the functions), GrB_Matrix_apply_BinaryOp1st uses GrB_INP1 to indicate the input matrix is transposed, and GrB_Matrix_apply_BinaryOp2nd uses GrB_INP0. SuiteSparse has these reversed both in code and in documentation.

This looks easy enough to fix. Care if I take a stab at it?

findmax and findmin built-ins

I'd like to request the addition of two(+) binary operators/monoids, findmax and findmin which act like min and max but return the position of the max and min values. While argmin/argmax exist in the MATLAB library, this is slightly different.

My use case is something like FINDMAXJ_PLUS. The typical MAX_PLUS semiring results in C_ij = max_k(A_ik + B_kj). FINDMAXJ_PLUS would return set C_ij to be the k which is provided the result of max_k (hopefully that's clear).

This should be done for MIN and I (FINDMAXI) as well.

As far as I can tell this cannot be constructed as a user defined operator as of now. Do internal positional binary ops have access to the values as well? Or is there a different set of algorithms for positional and value based which makes this difficult?

Transpose mask

Is it possible to transpose the mask in a GraphBLAS operation?

GrB_MINUS_INT8 interacts interestingly with implicit zero elements

Hi,

I wish to subtract the negative transpose of a matrix from itself as part of a max-flow algorithm.

I created a descriptor to transpose the second intput, and use element-wise addition with the GrB_MINUS_INT8 binary operator. As you can see below, instead of subtracting, the operation performs an addition, and indeed the result is the same if I replace GrB_MINUS_INT8 with GrB_PLUS_INT8. Is this the intended behaviour when GrB_MINUS_INT8 interacts with imlpicit zeros a matrix?

I can achieve the intended result if I perform three successive subtractions, but that feels a bit excessive.

GB_PUBLIC for new `GxB_*_eWiseUnion` functions

GB_PUBLIC is missing for GxB_Vector_eWiseUnion and GxB_Matrix_eWiseUnion, so these are not externed.

How to use GraphBLAS on CUDA?

Hi,

I have install GraphBLAS on a cuda enabled machine and I can run the demos in Demo successfully on cpu. But How to use the library on cuda? Whether it has independent API w.r.t. cpu?

Best,
Kangfei

GCC 11.1 fails with -Werror=misleading-indentation

Buillding with GCC 11.1 -Werror=misleading-indentation stops at:

/pub/devel/suitesparse/suitesparse-5.9.0-1.i686/build/GraphBLAS/Source/Generated/GB_AxB__any_first_int32.c:278:5:
error: this ‘if’ clause does not guard... [-Werror=misleading-indentation]
278 | if (exists && !cb) cx = (ax) ; cb |= exists

Is it a misunderstanding of the compiler or a real code issue ?