Giter Site home page Giter Site logo

flame / blis Goto Github PK

View Code? Open in Web Editor NEW
2.1K 78.0 355.0 48.16 MB

BLAS-like Library Instantiation Software Framework

License: Other

Shell 0.97% C 89.53% Assembly 0.24% Makefile 2.12% Python 0.18% Fortran 5.18% MATLAB 1.18% Emacs Lisp 0.01% C++ 0.60%
blis blas blas-libraries linear-algebra linear-algebra-library matrix-multiplication matrix-calculations matrix-library matrix-functions high-performance

blis's Introduction

Recipient of the 2023 James H. Wilkinson Prize for Numerical Software

Recipient of the 2020 SIAM Activity Group on Supercomputing Best Paper Prize

The BLIS cat is sleeping.

Build Status Build Status

Discord logo

Contents

Introduction

BLIS is an award-winning portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. BLIS is written in ISO C99 and available under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like API, it also includes a BLAS compatibility layer which gives application developers access to BLIS implementations via traditional BLAS routine calls. An object-based API unique to BLIS is also available.

For a thorough presentation of our framework, please read our ACM Transactions on Mathematical Software (TOMS) journal article, "BLIS: A Framework for Rapidly Instantiating BLAS Functionality". For those who just want an executive summary, please see the Key Features section below.

In a follow-up article (also in ACM TOMS), "The BLIS Framework: Experiments in Portability", we investigate using BLIS to instantiate level-3 BLAS implementations on a variety of general-purpose, low-power, and multicore architectures.

An IPDPS'14 conference paper titled "Anatomy of High-Performance Many-Threaded Matrix Multiplication" systematically explores the opportunities for parallelism within the five loops that BLIS exposes in its matrix multiplication algorithm.

For other papers related to BLIS, please see the Citations section below.

It is our belief that BLIS offers substantial benefits in productivity when compared to conventional approaches to developing BLAS libraries, as well as a much-needed refinement of the BLAS interface, and thus constitutes a major advance in dense linear algebra computation. While BLIS remains a work-in-progress, we are excited to continue its development and further cultivate its use within the community.

The BLIS framework is primarily developed and maintained by individuals in the Science of High-Performance Computing (SHPC) group in the Oden Institute for Computational Engineering and Sciences at The University of Texas at Austin and in the Matthews Research Group at Southern Methodist University. Please visit the SHPC website for more information about our research group, such as a list of people and collaborators, funding sources, publications, and other educational projects (such as MOOCs).

Education and Learning

Want to understand what's under the hood? Many of the same concepts and principles employed when developing BLIS are introduced and taught in a basic pedagogical setting as part of LAFF-On Programming for High Performance (LAFF-On-PfHP), one of several massive open online courses (MOOCs) in the Linear Algebra: Foundations to Frontiers series, all of which are available for free via the edX platform.

What's New

  • BLIS selected for the 2023 James H. Wilkinson Prize for Numerical Software! We are thrilled to announce that Field Van Zee and Devin Matthews were chosen to receive the 2023 James H. Wilkinson Prize for Numerical Software. The selection committee sought to recognize the recipients "for the development of BLIS, a portable open-source software framework that facilitates rapid instantiation of high-performance BLAS and BLAS-like operations targeting modern CPUs." This prize is awarded once every four years to the authors of an outstanding piece of numerical software, or to individuals who have made an outstanding contribution to an existing piece of numerical software. It is awarded to an entry that best addresses all phases of the preparation of high-quality numerical software, and is intended to recognize innovative software in scientific computing and to encourage researchers in the earlier stages of their career. The prize will be awarded at the 2023 SIAM Conference on Computational Science and Engineering in Amsterdam.

  • Join us on Discord! In 2021, we soft-launched our Discord server by privately inviting current and former collaborators, attendees of our BLIS Retreat, as well as other participants within the BLIS ecosystem. We've been thrilled by the results thus far, and are happy to announce that our new community is now open to the broader public! If you'd like to hang out with other BLIS users and developers, ask a question, discuss future features, or just say hello, please feel free to join us! We've put together a step-by-step guide for creating an account and joining our cozy enclave. We even have a monthly "BLIS happy hour" event where people can casually come together for a video chat, Q&A, brainstorm session, or whatever it happens to unfold into!

  • Addons feature now available! Have you ever wanted to quickly extend BLIS's operation support or define new custom BLIS APIs for your application, but were unsure of how to add your source code to BLIS? Do you want to isolate your custom code so that it only gets enabled when the user requests it? Do you like sandboxes, but wish you didn't have to provide an implementation of gemm? If so, you should check out our new addons feature. Addons act like optional extensions that can be created, enabled, and combined to suit your application's needs, all without formally integrating your code into the core BLIS framework.

  • Multithreaded small/skinny matrix support for sgemm now available! Thanks to funding and hardware support from Oracle, we have now accelerated gemm for single-precision real matrix problems where one or two dimensions is exceedingly small. This work is similar to the gemm optimization announced last year. For now, we have only gathered performance results on an AMD Epyc Zen2 system, but we hope to publish additional graphs for other architectures in the future. You may find these Zen2 graphs via the PerformanceSmall document.

  • BLIS awarded SIAM Activity Group on Supercomputing Best Paper Prize for 2020! We are thrilled to announce that the paper that we internally refer to as the second BLIS paper,

    "The BLIS Framework: Experiments in Portability." Field G. Van Zee, Tyler Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John A. Gunnels, Lee Killough. ACM Transactions on Mathematical Software (TOMS), 42(2):12:1--12:19, 2016.

    was selected for the SIAM Activity Group on Supercomputing Best Paper Prize for 2020. The prize is awarded once every two years to a paper judged to be the most outstanding paper in the field of parallel scientific and engineering computing, and has only been awarded once before (in 2016) since its inception in 2015 (the committee did not award the prize in 2018). The prize was awarded at the 2020 SIAM Conference on Parallel Processing for Scientific Computing in Seattle. Robert was present at the conference to give a talk on BLIS and accept the prize alongside other coauthors. The selection committee sought to recognize the paper, "which validates BLIS, a framework relying on the notion of microkernels that enables both productivity and high performance." Their statement continues, "The framework will continue having an important influence on the design and the instantiation of dense linear algebra libraries."

  • Multithreaded small/skinny matrix support for dgemm now available! Thanks to contributions made possible by our partnership with AMD, we have dramatically accelerated gemm for double-precision real matrix problems where one or two dimensions is exceedingly small. A natural byproduct of this optimization is that the traditional case of small m = n = k (i.e. square matrices) is also accelerated, even though it was not targeted specifically. And though only dgemm was optimized for now, support for other datatypes and/or other operations may be implemented in the future. We've also added new graphs to the PerformanceSmall document to showcase multithreaded performance when one or more matrix dimensions are small.

  • Performance comparisons now available! We recently measured the performance of various level-3 operations on a variety of hardware architectures, as implemented within BLIS and other BLAS libraries for all four of the standard floating-point datatypes. The results speak for themselves! Check out our extensive performance graphs and background info in our new Performance document.

  • BLIS is now in Debian Unstable! Thanks to Debian developer-maintainers M. Zhou and Nico Schlömer for sponsoring our package in Debian. Their participation, contributions, and advocacy were key to getting BLIS into the second-most popular Linux distribution (behind Ubuntu, which Debian packages feed into). The Debian tracker page may be found here.

  • BLIS now supports mixed-datatype gemm! The gemm operation may now be executed on operands of mixed domains and/or mixed precisions. Any combination of storage datatype for A, B, and C is now supported, along with a separate computation precision that can differ from the storage precision of A and B. And even the 1m method now supports mixed-precision computation. For more details, please see our ACM TOMS journal article submission (current draft).

  • BLIS now implements the 1m method. Let's face it: writing complex assembly gemm microkernels for a new architecture is never a priority--and now, it almost never needs to be. The 1m method leverages existing real domain gemm microkernels to implement all complex domain level-3 operations. For more details, please see our ACM TOMS journal article submission (current draft).

What People Are Saying About BLIS

"I noticed a substantial increase in multithreaded performance on my own machine, which was extremely satisfying." ... "[I was] happy it worked so well!" (Justin Shea)

"This is an awesome library." ... "I want to thank you and the blis team for your efforts." (@Lephar)

"Any time somebody outside Intel beats MKL by a nontrivial amount, I report it to the MKL team. It is fantastic for any open-source project to get within 10% of MKL... [T]his is why Intel funds BLIS development." (@jeffhammond)

"So BLIS is now a part of Elk." ... "We have found that zgemm applied to a 15000x15000 matrix with multi-threaded BLIS on a 32-core Ryzen 2990WX processor is about twice as fast as MKL" ... "I'm starting to like this a lot." (@jdk2016)

"I [found] BLIS because I was looking for BLAS operations on C-ordered arrays for NumPy. BLIS has that, but even better is the fact that it's developed in the open using a more modern language than Fortran." (@nschloe)

"The specific reason to have BLIS included [in Linux distributions] is the KNL and SKX [AVX-512] BLAS support, which OpenBLAS doesn't have." (@loveshack)

"All tests pass without errors on OpenBSD. Thanks!" (@ararslan)

"Thank you very much for your great help!... Looking forward to benchmarking." (@mrader1248)

"Thanks for the beautiful work." (@mmrmo)

"[M]y software currently uses BLIS for its BLAS interface..." (@ShadenSmith)

"[T]hanks so much for your work on this! Excited to test." ... "[On AMD Excavator], BLIS is competitive to / slightly faster than OpenBLAS for dgemms in my tests." (@iotamudelta)

"BLIS provided the only viable option on KNL, whose ecosystem is at present dominated by blackbox toolchains. Thanks again. Keep on this great work." (@heroxbd)

"I want to definitely try this out..." (@ViralBShah)

Key Features

BLIS offers several advantages over traditional BLAS libraries:

  • Portability that doesn't impede high performance. Portability was a top priority of ours when creating BLIS. With virtually no additional effort on the part of the developer, BLIS is configurable as a fully-functional reference implementation. But more importantly, the framework identifies and isolates a key set of computational kernels which, when optimized, immediately and automatically optimize performance across virtually all level-2 and level-3 BLIS operations. In this way, the framework acts as a productivity multiplier. And since the optimized (non-portable) code is compartmentalized within these few kernels, instantiating a high-performance BLIS library on a new architecture is a relatively straightforward endeavor.

  • Generalized matrix storage. The BLIS framework exports interfaces that allow one to specify both the row stride and column stride of a matrix. This allows one to compute with matrices stored in column-major order, row-major order, or by general stride. (This latter storage format is important for those seeking to implement tensor contractions on multidimensional arrays.) Furthermore, since BLIS tracks stride information for each matrix, operands of different storage formats can be used within the same operation invocation. By contrast, BLAS requires column-major storage. And while the CBLAS interface supports row-major storage, it does not allow mixing storage formats.

  • Rich support for the complex domain. BLIS operations are developed and expressed in their most general form, which is typically in the complex domain. These formulations then simplify elegantly down to the real domain, with conjugations becoming no-ops. Unlike the BLAS, all input operands in BLIS that allow transposition and conjugate-transposition also support conjugation (without transposition), which obviates the need for thread-unsafe workarounds. Also, where applicable, both complex symmetric and complex Hermitian forms are supported. (BLAS omits some complex symmetric operations, such as symv, syr, and syr2.) Another great example of BLIS serving as a portability lever is its implementation of the 1m method for complex matrix multiplication, a novel mechanism of providing high-performance complex level-3 operations using only real domain microkernels. This new innovation guarantees automatic level-3 support in the complex domain even when the kernel developers entirely forgo writing complex kernels.

  • Advanced multithreading support. BLIS allows multiple levels of symmetric multithreading for nearly all level-3 operations. (Currently, users may choose to obtain parallelism via OpenMP, POSIX threads, or HPX). This means that matrices may be partitioned in multiple dimensions simultaneously to attain scalable, high-performance parallelism on multicore and many-core architectures. The key to this innovation is a thread-specific control tree infrastructure which encodes information about the logical thread topology and allows threads to query and communicate data amongst one another. BLIS also employs so-called "quadratic partitioning" when computing dimension sub-ranges for each thread, so that arbitrary diagonal offsets of structured matrices with unreferenced regions are taken into account to achieve proper load balance. More recently, BLIS introduced a runtime abstraction to specify parallelism on a per-call basis, which is useful for applications that want to handle most of the parallelism.

  • Ease of use. The BLIS framework, and the library of routines it generates, are easy to use for end users, experts, and vendors alike. An optional BLAS compatibility layer provides application developers with backwards compatibility to existing BLAS-dependent codes. Or, one may adjust or write their application to take advantage of new BLIS functionality (such as generalized storage formats or additional complex operations) by calling one of BLIS's native APIs directly. BLIS's typed API will feel familiar to many veterans of BLAS since these interfaces use BLAS-like calling sequences. And many will find BLIS's object-based APIs a delight to use when customizing or writing their own BLIS operations. (Objects are relatively lightweight structs and passed by address, which helps tame function calling overhead.)

  • Multilayered API and exposed kernels. The BLIS framework exposes its implementations in various layers, allowing expert developers to access exactly the functionality desired. This layered interface includes that of the lowest-level kernels, for those who wish to bypass the bulk of the framework. Optimizations can occur at various levels, in part thanks to exposed packing and unpacking facilities, which by default are highly parameterized and flexible.

  • Functionality that grows with the community's needs. As its name suggests, the BLIS framework is not a single library or static API, but rather a nearly-complete template for instantiating high-performance BLAS-like libraries. Furthermore, the framework is extensible, allowing developers to leverage existing components to support new operations as they are identified. If such operations require new kernels for optimal efficiency, the framework and its APIs will be adjusted and extended accordingly. Community developers who wish to experiment with creating new operations or APIs in BLIS can quickly and easily do so via the Addons feature.

  • Code re-use. Auto-generation approaches to achieving the aforementioned goals tend to quickly lead to code bloat due to the multiple dimensions of variation supported: operation (i.e. gemm, herk, trmm, etc.); parameter case (i.e. side, [conjugate-]transposition, upper/lower storage, unit/non-unit diagonal); datatype (i.e. single-/double-precision real/complex); matrix storage (i.e. row-major, column-major, generalized); and algorithm (i.e. partitioning path and kernel shape). These "brute force" approaches often consider and optimize each operation or case combination in isolation, which is less than ideal when the goal is to provide entire libraries. BLIS was designed to be a complete framework for implementing basic linear algebra operations, but supporting this vast amount of functionality in a manageable way required a holistic design that employed careful abstractions, layering, and recycling of generic (highly parameterized) codes, subject to the constraint that high performance remain attainable.

  • A foundation for mixed domain and/or mixed precision operations. BLIS was designed with the hope of one day allowing computation on real and complex operands within the same operation. Similarly, we wanted to allow mixing operands' numerical domains, floating-point precisions, or both domain and precision, and to optionally compute in a precision different than one or both operands' storage precisions. This feature has been implemented for the general matrix multiplication (gemm) operation, providing 128 different possible type combinations, which, when combined with existing transposition, conjugation, and storage parameters, enables 55,296 different gemm use cases. For more details, please see the documentation on mixed datatype support and/or our ACM TOMS journal paper on mixed-domain/mixed-precision gemm (linked below).

How to Download BLIS

There are a few ways to download BLIS. We list the most common four ways below. We highly recommend using either Option 1 or 2. Otherwise, we recommend Option 3 (over Option 4) so your compiler can perform optimizations specific to your hardware.

  1. Download a source repository with git clone. Generally speaking, we prefer using git clone to clone a git repository. Having a repository allows the user to periodically pull in the latest changes and quickly rebuild BLIS whenever they wish. Also, implicit in cloning a repository is that the repository defaults to using the master branch, which contains the latest "stable" commits since the most recent release. (This is in contrast to Option 3 in which the user is opting for code that may be slightly out of date.)

    In order to clone a git repository of BLIS, please obtain a repository URL by clicking on the green button above the file/directory listing near the top of this page (as rendered by GitHub). Generally speaking, it will amount to executing the following command in your terminal shell:

    git clone https://github.com/flame/blis.git
    

    At this point, you will have the latest commit of the master branch checked out. If you wish to check out a particular version x.y.z, execute the following:

    git checkout x.y.z
    

    git will then transform your working copy to match the state of the commit associated with version x.y.z. You can view a list of tags at any time by executing:

    git tag --list
    
  2. Download a source repository via a zip file. If you are uncomfortable with using git but would still like the latest stable commits, we recommend that you download BLIS as a zip file.

    In order to download a zip file of the BLIS source distribution, please click on the green button above the file listing near the top of this page. This should reveal a link for downloading the zip file.

  3. Download a source release via a tarball/zip file. Alternatively, if you would like to stick to the code that is included in official releases, you may download either a tarball or zip file of BLIS's latest release. Some older releases are only available as tagged commits. (Note: downloading release x.y.z is equivalent to downloading, or checking out, tag x.y.z.) We consider this option to be less than ideal for most people since it will likely mean you miss out on the latest bugfix or feature commits (in contrast to Options 1 or 2), and you also will not be able to update your code with a simple git pull command (in contrast to Option 1).

  4. Download a binary package specific to your OS. While we don't recommend this as the first choice for most users, we provide links to community members who generously maintain BLIS packages for various Linux distributions such as Debian Unstable and EPEL/Fedora. Please see the External Packages section below for more information.

Getting Started

NOTE: This section assumes you've either cloned a BLIS source code repository via git, downloaded the latest source code via a zip file, or downloaded the source code for a tagged version release---Options 1, 2, or 3, respectively, as discussed in the previous section.

If you just want to build a sequential (not parallelized) version of BLIS in a hurry and come back and explore other topics later, you can configure and build BLIS as follows:

$ ./configure auto
$ make [-j]

You can then verify your build by running BLAS- and BLIS-specific test drivers via make check:

$ make check [-j]

And if you would like to install BLIS to the directory specified to configure via the --prefix option, run the install target:

$ make install

Please read the output of ./configure --help for a full list of configure-time options. If/when you have time, we strongly encourage you to read the detailed walkthrough of the build system found in our Build System guide.

If you are still having trouble, you are welcome to join us on Discord for further information and/or assistance.

Example Code

The BLIS source distribution provides example code in the examples directory. Example code focuses on using BLIS APIs (not BLAS or CBLAS), and resides in two subdirectories: examples/oapi (which demonstrates the object API) and examples/tapi (which demonstrates the typed API).

Either directory contains several files, each containing various pieces of code that exercise core functionality of the BLIS API in question (object or typed). These example files should be thought of collectively like a tutorial, and therefore it is recommended to start from the beginning (the file that starts in 00).

You can build all of the examples by simply running make from either example subdirectory (examples/oapi or examples/tapi). (You can also run make clean.) The local Makefile assumes that you've already configured and built (but not necessarily installed) BLIS two directories up, in ../... If you have already installed BLIS to some permanent directory, you may refer to that installation by setting the environment variable BLIS_INSTALL_PATH prior to running make:

export BLIS_INSTALL_PATH=/usr/local; make

or by setting the same variable as part of the make command:

make BLIS_INSTALL_PATH=/usr/local

Once the executable files have been built, we recommend reading the code and the corresponding executable output side by side. This will help you see the effects of each section of code.

This tutorial is not exhaustive or complete; several object API functions were omitted (mostly for brevity's sake) and thus more examples could be written.

Documentation

We provide extensive documentation on the BLIS build system, APIs, test infrastructure, and other important topics. All documentation is formatted in markdown and included in the BLIS source distribution (usually in the docs directory). Slightly longer descriptions of each document may be found via in the project's wiki section.

Documents for everyone:

  • Build System. This document covers the basics of configuring and building BLIS libraries, as well as related topics.

  • Testsuite. This document describes how to run BLIS's highly parameterized and configurable test suite, as well as the included BLAS test drivers.

  • BLIS Typed API Reference. Here we document the so-called "typed" (or BLAS-like) API. This is the API that many users who are already familiar with the BLAS will likely want to use.

  • BLIS Object API Reference. Here we document the object API. This is API abstracts away properties of vectors and matrices within obj_t structs that can be queried with accessor functions. Many developers and experts prefer this API over the typed API.

  • Hardware Support. This document maintains a table of supported microarchitectures.

  • Multithreading. This document describes how to use the multithreading features of BLIS.

  • Mixed-Datatypes. This document provides an overview of BLIS's mixed-datatype functionality and provides a brief example of how to take advantage of this new code.

  • Performance. This document reports empirically measured performance of a representative set of level-3 operations on a variety of hardware architectures, as implemented within BLIS and other BLAS libraries for all four of the standard floating-point datatypes.

  • PerformanceSmall. This document reports empirically measured performance of gemm on select hardware architectures within BLIS and other BLAS libraries when performing matrix problems where one or two dimensions is exceedingly small.

  • Discord. This document describes how to: create an account on Discord (if you don't already have one); obtain a private invite link; and use that invite link to join our BLIS server on Discord.

  • Release Notes. This document tracks a summary of changes included with each new version of BLIS, along with contributor credits for key features.

  • Frequently Asked Questions. If you have general questions about BLIS, please read this FAQ. If you can't find the answer to your question, please feel free to join the blis-devel mailing list and post a question. We also have a blis-discuss mailing list that anyone can post to (even without joining).

Documents for github contributors:

  • Contributing bug reports, feature requests, PRs, etc. Interested in contributing to BLIS? Please read this document before getting started. It provides a general overview of how best to report bugs, propose new features, and offer code patches.

  • Coding Conventions. If you are interested or planning on contributing code to BLIS, please read this document so that you can format your code in accordance with BLIS's standards.

Documents for BLIS developers:

  • Kernels Guide. If you would like to learn more about the types of kernels that BLIS exposes, their semantics, the operations that each kernel accelerates, and various implementation issues, please read this guide.

  • Configuration Guide. If you would like to learn how to add new sub-configurations or configuration families, or are simply interested in learning how BLIS organizes its configurations and kernel sets, please read this thorough walkthrough of the configuration system.

  • Addon Guide. If you are interested in learning about using BLIS addons--that is, enabling existing (or creating new) bundles of operation or API code that are built into a BLIS library--please read this document.

  • Sandbox Guide. If you are interested in learning about using sandboxes in BLIS--that is, providing alternative implementations of the gemm operation--please read this document.

Performance

We provide graphs that report performance of several implementations across a range of hardware types, multithreading configurations, problem sizes, operations, and datatypes. These pages also document most of the details needed to reproduce these experiments.

  • Performance. This document reports empirically measured performance of a representative set of level-3 operations on a variety of hardware architectures, as implemented within BLIS and other BLAS libraries for all four of the standard floating-point datatypes.

  • PerformanceSmall. This document reports empirically measured performance of gemm on select hardware architectures within BLIS and other BLAS libraries when performing matrix problems where one or two dimensions is exceedingly small.

External Packages

Generally speaking, we highly recommend building from source whenever possible using the latest git clone. (Tarballs of each tagged release are also available, but we consider them to be less ideal since they are not as easy to upgrade as git clones.)

That said, some users may prefer binary and/or source packages through their Linux distribution. Thanks to generous involvement/contributions from our community members, the following BLIS packages are now available:

  • Debian. M. Zhou has volunteered to sponsor and maintain BLIS packages within the Debian Linux distribution. The Debian package tracker can be found here. (Also, thanks to Nico Schlömer for previously volunteering his time to set up a standalone PPA.)

  • Gentoo. M. Zhou also maintains the BLIS package entry for Gentoo, a Linux distribution known for its source-based portage package manager and distribution system.

  • EPEL/Fedora. There are official BLIS packages in Fedora and EPEL (for RHEL7+ and compatible distributions) with versions for 64-bit integers, OpenMP, and pthreads, and shims which can be dynamically linked instead of reference BLAS. (NOTE: For architectures other than intel64, amd64, and maybe arm64, the performance of packaged BLIS will be low because it uses unoptimized generic kernels; for those architectures, OpenBLAS may be a better solution.) Dave Love provides additional packages for EPEL6 in a Fedora Copr, and possibly versions more recent than the official repo for other EPEL/Fedora releases. The source packages may build on other rpm-based distributions.

  • OpenSuSE. The copr referred to above has rpms for some OpenSuSE releases; the source rpms may build for others.

  • GNU Guix. Guix has BLIS packages, provides builds only for the generic target and some specific x86_64 micro-architectures.

  • Conda. conda channel conda-forge has Linux, OSX and Windows binary packages for x86_64.

Discussion

Most of the active discussions are now happening on our Discord server. Users and developers alike are welcome! Please see the BLIS Discord guide for a walkthrough of how to join us.

You can also still stay in touch by using either of the following mailing lists:

  • blis-devel: Please join and post to this mailing list if you are a BLIS developer, or if you are trying to use BLIS beyond simply linking to it as a BLAS library.

  • blis-discuss: Please join and post to this mailing list if you have general questions or feedback regarding BLIS. Application developers (end users) may wish to post here, unless they have bug reports, in which case they should open a new issue on github.

Contributing

For information on how to contribute to our project, including preferred coding conventions, please refer to the CONTRIBUTING file at the top-level of the BLIS source distribution.

Citations

For those of you looking for the appropriate article to cite regarding BLIS, we recommend citing our first ACM TOMS journal paper (unofficial backup link):

@article{BLIS1,
   author      = {Field G. {V}an~{Z}ee and Robert A. {v}an~{d}e~{G}eijn},
   title       = {{BLIS}: A Framework for Rapidly Instantiating {BLAS} Functionality},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {41},
   number      = {3},
   pages       = {14:1--14:33},
   month       = {June},
   year        = {2015},
   issue_date  = {June 2015},
   url         = {https://doi.acm.org/10.1145/2764454},
}

You may also cite the second ACM TOMS journal paper (unofficial backup link):

@article{BLIS2,
   author      = {Field G. {V}an~{Z}ee and Tyler Smith and Francisco D. Igual and
                  Mikhail Smelyanskiy and Xianyi Zhang and Michael Kistler and Vernon Austel and
                  John Gunnels and Tze Meng Low and Bryan Marker and Lee Killough and
                  Robert A. {v}an~{d}e~{G}eijn},
   title       = {The {BLIS} Framework: Experiments in Portability},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {42},
   number      = {2},
   pages       = {12:1--12:19},
   month       = {June},
   year        = {2016},
   issue_date  = {June 2016},
   url         = {https://doi.acm.org/10.1145/2755561},
}

We also have a third paper, submitted to IPDPS 2014, on achieving multithreaded parallelism in BLIS (unofficial backup link):

@inproceedings{BLIS3,
   author      = {Tyler M. Smith and Robert A. {v}an~{d}e~{G}eijn and Mikhail Smelyanskiy and
                  Jeff R. Hammond and Field G. {V}an~{Z}ee},
   title       = {Anatomy of High-Performance Many-Threaded Matrix Multiplication},
   booktitle   = {28th IEEE International Parallel \& Distributed Processing Symposium
                  (IPDPS 2014)},
   year        = {2014},
   url         = {https://doi.org/10.1109/IPDPS.2014.110},
}

A fourth paper, submitted to ACM TOMS, also exists, which proposes an analytical model for determining blocksize parameters in BLIS (unofficial backup link):

@article{BLIS4,
   author      = {Tze Meng Low and Francisco D. Igual and Tyler M. Smith and
                  Enrique S. Quintana-Ort\'{\i}},
   title       = {Analytical Modeling Is Enough for High-Performance {BLIS}},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {43},
   number      = {2},
   pages       = {12:1--12:18},
   month       = {August},
   year        = {2016},
   issue_date  = {August 2016},
   url         = {https://doi.acm.org/10.1145/2925987},
}

A fifth paper, submitted to ACM TOMS, begins the study of so-called induced methods for complex matrix multiplication (unofficial backup link):

@article{BLIS5,
   author      = {Field G. {V}an~{Z}ee and Tyler Smith},
   title       = {Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {44},
   number      = {1},
   pages       = {7:1--7:36},
   month       = {July},
   year        = {2017},
   issue_date  = {July 2017},
   url         = {https://doi.acm.org/10.1145/3086466},
}

A sixth paper, submitted to ACM TOMS, revisits the topic of the previous article and derives a superior induced method (unofficial backup link):

@article{BLIS6,
   author      = {Field G. {V}an~{Z}ee},
   title       = {Implementing High-Performance Complex Matrix Multiplication via the 1m Method},
   journal     = {SIAM Journal on Scientific Computing},
   volume      = {42},
   number      = {5},
   pages       = {C221--C244},
   month       = {September}
   year        = {2020},
   issue_date  = {September 2020},
   url         = {https://doi.org/10.1137/19M1282040}
}

A seventh paper, submitted to ACM TOMS, explores the implementation of gemm for mixed-domain and/or mixed-precision operands (unofficial backup link):

@article{BLIS7,
   author      = {Field G. {V}an~{Z}ee and Devangi N. Parikh and Robert A. van~de~{G}eijn},
   title       = {Supporting Mixed-domain Mixed-precision Matrix Multiplication
within the BLIS Framework},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {47},
   number      = {2},
   pages       = {12:1--12:26},
   month       = {April},
   year        = {2021},
   issue_date  = {April 2021},
   url         = {https://doi.org/10.1145/3402225},
}

Awards

Funding

This project and its associated research were partially sponsored by grants from Microsoft, Intel, Texas Instruments, AMD, HPE, Oracle, Huawei, Facebook, and ARM, as well as grants from the National Science Foundation (Awards CCF-0917167, ACI-1148125/1340293, CCF-1320112, and ACI-1550493).

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

blis's People

Contributors

ajaypanyala avatar bartoldeman avatar biplabraut avatar cpulib-git avatar ct-clmsn avatar devinamatthews avatar dnparikh avatar fgvanzee avatar figual avatar hominhquan avatar iotamudelta avatar isuruf avatar jdiamondgithub avatar jeffhammond avatar kali avatar kiran-amd avatar kvaragan avatar leekillough avatar loveshack avatar maratyszcza avatar meghana-vankadari avatar mkv14 avatar nicholaitukanov avatar nisanthmp avatar nisanthmpamd avatar pradeeptrgit avatar shadensmith avatar tlrmchlsmth avatar xianyi avatar xrq-phys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blis's Issues

Edge cases for row-major ukernels not handled quite right

For gemm ukernels which prefer contiguous rows, the temporary buffer for edge cases (when m != mr or n != nr) is treated as column-major (i.e. rs_c = 1 and cs_c = mr). If the general-stride pathway in the ukernel is very slow then this might have a performance impact for small to medium size matrices.

The ukernel wrapper should treat the temporary buffer as row-major (rs_c = nr and cs_c = 1) in this case.

Issue with pthreads and memory broker abstraction

When compiling with pthreads now, I get a bunch of warnings and errors like the below. If I back up to the commit ce59f81, these go away.

In file included from ./frame/thread/bli_mutex.h:43:0,
from ./frame/thread/bli_thread.h:56,
from ./frame/include/blis.h:74,
from frame/base/bli_membrk.c:36:
frame/base/bli_membrk.c: In function ‘bli_membrk_init’:
./frame/base/bli_membrk.h:56:2: warning: passing argument 1 of ‘pthread_mutex_init’ from incompatible pointer type
( &( (membrk_p)->mutex ) )
^
./frame/thread/bli_mutex_pthreads.h:53:22: note: in definition of macro ‘bli_mutex_init’
pthread_mutex_init( mtx_p );
^
frame/base/bli_membrk.c:44:18: note: in expansion of macro ‘bli_membrk_mutex’
bli_mutex_init( bli_membrk_mutex( membrk ) );
^
In file included from ./frame/thread/bli_mutex_pthreads.h:42:0,
from ./frame/thread/bli_mutex.h:43,
from ./frame/thread/bli_thread.h:56,
from ./frame/include/blis.h:74,
from frame/base/bli_membrk.c:36:
/usr/include/pthread.h:723:12: note: expected ‘union pthread_mutex_t *’ but argument is of type ‘struct mtx_s *’
extern int pthread_mutex_init (pthread_mutex_t *__mutex,
^
In file included from ./frame/thread/bli_mutex.h:43:0,
from ./frame/thread/bli_thread.h:56,
from ./frame/include/blis.h:74,
from frame/base/bli_membrk.c:36:
./frame/thread/bli_mutex_pthreads.h:53:2: error: too few arguments to function ‘pthread_mutex_init’
pthread_mutex_init( mtx_p );
^
frame/base/bli_membrk.c:44:2: note: in expansion of macro ‘bli_mutex_init’
bli_mutex_init( bli_membrk_mutex( membrk ) );
^
In file included from ./frame/thread/bli_mutex_pthreads.h:42:0,
from ./frame/thread/bli_mutex.h:43,
from ./frame/thread/bli_thread.h:56,
from ./frame/include/blis.h:74,
from frame/base/bli_membrk.c:36:
/usr/include/pthread.h:723:12: note: declared here
extern int pthread_mutex_init (pthread_mutex_t *__mutex,
^

BLAS-like kernel for symmetric updates in LDL factorization without pivoting

There is a long history of proposals to include simple modifications of ?syrk and ?herk to support C += alpha A (D A)^T, where D is diagonal. This kernel turns out to be important for Interior Point Methods, which often make use of LDL factorizations (without pivoting) of modified saddle-point systems.

Does such a routine already exist in BLIS? If not, is there a good place to start for adding support? Or a preferred name/convention?

Incorrect result in axpys (theoretically)

The ???axpys family of macros could give the incorrect result for alpha and x complex and y real. For example, zzdaxpys calls daxpyris which computes yr += ar * xr instead of the correct result yr += ar * xr - ai * xi.

One way to fix this is to drop all of the scalar macros and just use C99 complex operators like normal people.

Intel Skylake is not autodetected.

On my i5-6500, the configuration is auto-detected as 'reference'. Since there is no skylake ukernel, the config should be set to haswell.

Kernel symlinks cause build failure in msys2

Another one for you, following up on #9

When I try to build the Sandy Bridge configuration on Windows in MSYS2 with MinGW compiler (on an i7-2630QM), I get a failure to link the test executable:

Archiving lib/sandybridge/libblis.a
Linking test_libblis.x against './lib/sandybridge/libblis.a -lm'
./lib/sandybridge/libblis.a(bli_gemm_cntl.o):bli_gemm_cntl.c:(.text+0x1bc): undefined reference to `bli_dgemm_opt_8x4_ref_u4_nodupl_avx1'
./lib/sandybridge/libblis.a(bli_gemm_ukernel.o):bli_gemm_ukernel.c:(.text+0x11): undefined reference to `bli_dgemm_opt_8x4_ref_u4_nodupl_avx1'
./lib/sandybridge/libblis.a(bli_gemmtrsm_l_ukr_ref.o):bli_gemmtrsm_l_ukr_ref.c:(.text+0x10d): undefined reference to `bli_dgemm_opt_8x4_ref_u4_nodupl_avx1'
./lib/sandybridge/libblis.a(bli_gemmtrsm_u_ukr_ref.o):bli_gemmtrsm_u_ukr_ref.c:(.text+0x10d): undefined reference to `bli_dgemm_opt_8x4_ref_u4_nodupl_avx1'
./lib/sandybridge/libblis.a(bli_gemm4m_ukr_ref.o):bli_gemm4m_ukr_ref.c:(.text+0xe94): undefined reference to `bli_dgemm_opt_8x4_ref_u4_nodupl_avx1'
./lib/sandybridge/libblis.a(bli_gemm4m_ukr_ref.o):bli_gemm4m_ukr_ref.c:(.text+0xedb): more undefined references to `bli_dgemm_opt_8x4_ref_u4_nodupl_avx1' follow
d:/code/mingw-builds/x64-4.8.1-win32-seh-rev5/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.8.1/../../../../x86_64-w64-mingw32/bin/ld.exe: ./lib/sandybridge/libblis.a(bli_gemm4m_ukr_ref.o): bad reloc address 0x0 in section `.pdata'
collect2.exe: error: ld returned 1 exit status
Makefile:531: recipe for target 'test_libblis.x' failed
make: *** [test_libblis.x] Error 1

If I try in Cygwin, setting CC := x86_64-w64-mingw32-gcc and AR := x86_64-w64-mingw32-ar in config/sandybridge/make_defs.mk to use the MinGW cross-compiler, then the executable links correctly but segfaults when running the tests. The backtrace is more interesting if I set BLIS_SIMD_ALIGN_SIZE to 1 in config/sandybridge/config.h, since my patch in #9 didn't completely fix the alignment problems. Backtrace with alignment=1 (also uncommented CDBGFLAGS := -g to get debug info) posted here https://gist.github.com/tkelman/25d290b131c0a1205b27. Everything passes until blis_dgemm_nn_ccc. The same bli_dgemm_opt_8x4_ref_u4_nodupl_avx1 that was an undefined reference in MSYS2 is causing the segfault in Cygwin-to-MinGW cross-compile.

Illegal Instruction in bli_init()

I've just set up my first blis program, but calling bli_init() causes the program to trigger a SIGILL. I'm using Ubuntu 14.04 GNU/Linux using the Sandymount configuration for BLIS. It also breaks here when I run the test_gemm_blis.x test in blis/test/ or the test suite with the given makefiles.

I've traced it down to inside of bli_const_init() using gdb.

bli_obj_create_const(2.0, &BLIS_TWO);

I've tried re-configuring & rebuilding the library several times but it doesn't seem to help.

Intermittent NaNs appearing in results from calling shared-library blis from Julia

CPU: Core2 Duo E8400 (old machine)
OS: Ubuntu 14.04, x86-64

Compiled BLIS reference configuration, setting BLIS_ENABLE_DYNAMIC_BUILD := yes. By itself, BLIS passes its own make test.

I'm calling into BLIS from Julia by the following steps:

git clone https://github.com/julialang/julia
cd julia
mkdir -p $PWD/usr/lib
cp /path/to/libblis.so $PWD/usr/lib
echo 'override USE_SYSTEM_BLAS = 1' > Make.user
echo 'override USE_BLAS64 = 0' >> Make.user
echo 'override LIBBLAS = -L$(build_libdir) -lblis -lm' >> Make.user
echo 'override LIBBLASNAME = libblis' >> Make.user
make -j8   # this will take quite a while - Julia has lots of big dependencies
make testall

This gives me a different failure each time I repeat make testall. Here are some examples, from the first couple of files in Julia's test suite (linalg1 and linalg2):
https://gist.github.com/47dbd5517c4a6f56fb2e
https://gist.github.com/51835795c8ada7c0f2a1
https://gist.github.com/552ec09e5e78d1cd3da7
https://gist.github.com/7cce0057bf9a3009a92a
https://gist.github.com/0f01364072634cc95be9
https://gist.github.com/9292fc3fe1a2e9afe09d

I'll see if I can translate a few of these test cases into C to figure out whether the problem is reproducible outside of Julia. I'll also try setting the BLIS integer size to 64-bit and delete the USE_BLAS64 = 0 line, see whether that changes anything.

BLIS test suite uses aligned ldims for column storage cases, but not row (or general) cases

Due to the current implementation of libblis_test_mobj_create(), tests with column-stored matrix operands cause those operands to be created with aligned leading dimensions. However, when row storage is tested, matrix operands are created with leading dimensions that are NOT aligned. This is because bli_obj_create() applies alignment to the default storage case (when "0, 0" is passed in for rs, cs), which currently is column storage. However, when explicit strides are passed in, such as is necessary in order to request row storage, alignment is not applied.

Proposed solution: Add a new parameter to input.general that controls globally whether the test suite will align its operands or not. Then, update libblis_test_mobj_create() so that it keys off of this parameter and then manually aligns the strides (using the SIMD alignment value), if needed, regardless of whether row, column, or general storage is being used, and then passes those strides into bli_obj_create().

BLIS Test Failure in BlueGene/Q

Hi,

I compiled BLIS 0.1.8 on our BG/Q system without any modification, but I got the following failures when I
run the test. The complete test log is given here: https://goo.gl/vGXz6S . (libblis.a binary: https://goo.gl/8lQ999 )

Note: The version of our software stack is V1R2M0, job a submitted interactively through SLURM scheduler.

e.g.
...
blis_cgemm4mh_ct_ccc 200 200 200 0.582 1.28e-04 FAILURE
blis_cgemm4mh_ch_ccc 100 100 100 0.515 3.80e-04 FAILURE
blis_cgemm4mh_ch_ccc 200 200 200 0.582 1.30e-04 FAILURE
blis_cgemm4mh_tn_ccc 100 100 100 0.528 3.60e-04 FAILURE
...
blis_zgemm4mh_hc_ccc 200 200 200 2.446 1.40e-01 FAILURE
blis_zgemm4mh_hc_ccc 300 300 300 2.798 1.51e-01 FAILURE
blis_zgemm4mh_hc_ccc 400 400 400 3.141 1.72e-01 FAILURE
blis_zgemm4mh_ht_ccc 100 100 100 1.449 1.43e-01 FAILURE
blis_zgemm4mh_ht_ccc 200 200 200 2.308 1.56e-01 FAILURE
blis_zgemm4mh_ht_ccc 300 300 300 2.682 1.63e-01 FAILURE
...
blis_zsymm4mh_ruch_ccc 100 100 1.449 3.44e-01 FAILURE
blis_zsymm4mh_ruch_ccc 200 200 2.332 3.72e-01 FAILURE
blis_zsymm4mh_ruch_ccc 300 300 2.718 3.99e-01 FAILURE
blis_zsymm4mh_ruch_ccc 400 400 3.051 3.70e-01 FAILURE
blis_csyrk4mh_ln_cc 100 100 0.421 3.72e-04 FAILURE
blis_csyrk4mh_ln_cc 200 200 0.528 1.26e-04 FAILURE
blis_csyrk4mh_lc_cc 100 100 0.421 3.38e-04 FAILURE
blis_csyrk4mh_lc_cc 200 200 0.527 1.25e-04 FAILURE
blis_csyrk4mh_lt_cc 100 100 0.439 3.34e-04 FAILURE
...

Please advice.

Thanks!

Rgds,
Dominic Chien

sandybridge: Does not compile with -dnoopt

The sandybridge configuration does not compile with -dnoopt, it requires avx.

The following would fix it:

COPTFLAGS      := -O0 -march=native

Sample error, using gcc (gcc version 5.3.1 20160316 (Debian 5.3.1-12))

config/sandybridge/kernels/3/bli_gemm_int_d8x4.c: In functionbli_dgemm_int_8x4’:
config/sandybridge/kernels/3/bli_gemm_int_d8x4.c:111:10: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
  va0_3b0 = _mm256_setzero_pd();
          ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:41:0,
                 from config/sandybridge/kernels/3/bli_gemm_int_d8x4.c:36:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avxintrin.h:834:1: error: inlining failed in call to always_inline_mm256_load_pd’: target specific option mismatch
 _mm256_load_pd (double const *__P)
 ^

Test suite always exits with code 0

As testsuite always exits with code 0 (success), it is hard to use it as a unit test. The test suite should return non-zero if any of the tests failed.

Fused gemvs and trmvs?

BLIS often refers to fused level 1 BLAS-like operations, but I have not seen any fused level 2 operations (e.g., a single-sweep y := A x and u := A^T v). Is there a plan to support such kernels?

support LLVM OpenMP

OpenMP is not supported with Clang in the build system has been false [http://blog.llvm.org/2015/05/openmp-support_22.html](for almost a year now). While Mac does not appear to support it in the default toolchain, I've been using Homebrew's clang-omp just fine.

I think that the BLIS build system should test for OpenMP support "the old fashioned way" (i.e. like configure does) and use OpenMP if it is available.

Allow BLIS developer to specify arbitrary malloc()-style allocation functions

As part of the configuration, BLIS should allow the developer to specify the function to call for allocating memory for the following three use cases that occur in BLIS:

  • blocks that are allocated within the BLIS memory pools, which get used for packing buffers.
  • internally-invoked allocation for things such as control tree nodes.
  • user-invoked allocation, primarily exemplified by bli_obj_create() and friends.

The BLIS developer would specify, in bli_kernel.h, cpp macros to identify the names of the malloc()-style functions to use in any of the above cases. It would then be the developer's responsibility to ensure that the object code that defines the malloc() substitutes are available at link-time. The developer does not need to provide a prototype for the malloc() substitutes since we will require those functions to adhere to the same function signature as malloc(), and therefore BLIS can/will provide those prototypes on behalf of the developer, similar to what is done when a developer defines custom kernels/micro-kernels.

Assignment between restrict pointers "alpha" and "alpha_in" is not allowed. Only outer-to-inner scope assignments between restrict pointers are allowed.

XLC is the only compiler I've ever seen that gives these warnings, but I think they are formally legit. On the other hand, they may be false positives in practice.

[jhammond@vestalac1 blis]$ ./configure -p $HOME/BLIS/bgq bgq  && make -j32 && make install
configure: checking whether we need to update the version file.
configure: checking version file './version'.
configure: found '.git' directory; assuming git clone.
configure: executing git describe --tags.
configure: got back 0.1.1-14-gd531a24.
configure: truncating to 0.1.1-14.
configure: updating version file './version'.
configure: starting configuration of BLIS 0.1.1-14.
configure: configuring with 'bgq' configuration sub-directory.
configure: detected -p option; using install prefix '/home/jhammond/BLIS/bgq'.
configure: creating ./config.mk from ./build/config.mk.in
configure: creating ./obj/bgq
configure: creating ./obj/bgq/config
configure: creating ./obj/bgq/frame
configure: creating ./obj/bgq/testsuite
configure: creating ./lib/bgq
configure: mirroring ./config/bgq to ./obj/bgq/config
configure: mirroring ./frame to ./obj/bgq/frame
configure: creating makefile fragment in ./config/bgq
configure: creating makefile fragment in ./config/bgq/kernels
configure: creating makefile fragment in ./config/bgq/kernels/1
configure: creating makefile fragment in ./config/bgq/kernels/1f
configure: creating makefile fragment in ./config/bgq/kernels/3
configure: creating makefile fragment in ./frame
configure: creating makefile fragment in ./frame/0
configure: creating makefile fragment in ./frame/0/absqsc
configure: creating makefile fragment in ./frame/0/addsc
configure: creating makefile fragment in ./frame/0/copysc
configure: creating makefile fragment in ./frame/0/divsc
configure: creating makefile fragment in ./frame/0/getsc
configure: creating makefile fragment in ./frame/0/mulsc
configure: creating makefile fragment in ./frame/0/normfsc
configure: creating makefile fragment in ./frame/0/setsc
configure: creating makefile fragment in ./frame/0/sqrtsc
configure: creating makefile fragment in ./frame/0/subsc
configure: creating makefile fragment in ./frame/0/unzipsc
configure: creating makefile fragment in ./frame/0/zipsc
configure: creating makefile fragment in ./frame/1
configure: creating makefile fragment in ./frame/1/addv
configure: creating makefile fragment in ./frame/1/axpyv
configure: creating makefile fragment in ./frame/1/copyv
configure: creating makefile fragment in ./frame/1/dotv
configure: creating makefile fragment in ./frame/1/dotxv
configure: creating makefile fragment in ./frame/1/invertv
configure: creating makefile fragment in ./frame/1/packv
configure: creating makefile fragment in ./frame/1/scal2v
configure: creating makefile fragment in ./frame/1/scalv
configure: creating makefile fragment in ./frame/1/setv
configure: creating makefile fragment in ./frame/1/subv
configure: creating makefile fragment in ./frame/1/swapv
configure: creating makefile fragment in ./frame/1/unpackv
configure: creating makefile fragment in ./frame/1d
configure: creating makefile fragment in ./frame/1d/addd
configure: creating makefile fragment in ./frame/1d/axpyd
configure: creating makefile fragment in ./frame/1d/copyd
configure: creating makefile fragment in ./frame/1d/invertd
configure: creating makefile fragment in ./frame/1d/scal2d
configure: creating makefile fragment in ./frame/1d/scald
configure: creating makefile fragment in ./frame/1d/setd
configure: creating makefile fragment in ./frame/1d/subd
configure: creating makefile fragment in ./frame/1f
configure: creating makefile fragment in ./frame/1f/axpy2v
configure: creating makefile fragment in ./frame/1f/axpyf
configure: creating makefile fragment in ./frame/1f/dotaxpyv
configure: creating makefile fragment in ./frame/1f/dotxaxpyf
configure: creating makefile fragment in ./frame/1f/dotxf
configure: creating makefile fragment in ./frame/1m
configure: creating makefile fragment in ./frame/1m/addm
configure: creating makefile fragment in ./frame/1m/axpym
configure: creating makefile fragment in ./frame/1m/copym
configure: creating makefile fragment in ./frame/1m/packm
configure: creating makefile fragment in ./frame/1m/packm/ukernels
configure: creating makefile fragment in ./frame/1m/scal2m
configure: creating makefile fragment in ./frame/1m/scalm
configure: creating makefile fragment in ./frame/1m/setm
configure: creating makefile fragment in ./frame/1m/subm
configure: creating makefile fragment in ./frame/1m/unpackm
configure: creating makefile fragment in ./frame/1m/unpackm/ukernels
configure: creating makefile fragment in ./frame/2
configure: creating makefile fragment in ./frame/2/gemv
configure: creating makefile fragment in ./frame/2/ger
configure: creating makefile fragment in ./frame/2/hemv
configure: creating makefile fragment in ./frame/2/her
configure: creating makefile fragment in ./frame/2/her2
configure: creating makefile fragment in ./frame/2/symv
configure: creating makefile fragment in ./frame/2/syr
configure: creating makefile fragment in ./frame/2/syr2
configure: creating makefile fragment in ./frame/2/trmv
configure: creating makefile fragment in ./frame/2/trsv
configure: creating makefile fragment in ./frame/3
configure: creating makefile fragment in ./frame/3/gemm
configure: creating makefile fragment in ./frame/3/gemm/3m
configure: creating makefile fragment in ./frame/3/gemm/3m/ukernels
configure: creating makefile fragment in ./frame/3/gemm/4m
configure: creating makefile fragment in ./frame/3/gemm/4m/ukernels
configure: creating makefile fragment in ./frame/3/gemm/ukernels
configure: creating makefile fragment in ./frame/3/hemm
configure: creating makefile fragment in ./frame/3/hemm/3m
configure: creating makefile fragment in ./frame/3/hemm/4m
configure: creating makefile fragment in ./frame/3/her2k
configure: creating makefile fragment in ./frame/3/her2k/3m
configure: creating makefile fragment in ./frame/3/her2k/4m
configure: creating makefile fragment in ./frame/3/herk
configure: creating makefile fragment in ./frame/3/herk/3m
configure: creating makefile fragment in ./frame/3/herk/4m
configure: creating makefile fragment in ./frame/3/symm
configure: creating makefile fragment in ./frame/3/symm/3m
configure: creating makefile fragment in ./frame/3/symm/4m
configure: creating makefile fragment in ./frame/3/syr2k
configure: creating makefile fragment in ./frame/3/syr2k/3m
configure: creating makefile fragment in ./frame/3/syr2k/4m
configure: creating makefile fragment in ./frame/3/syrk
configure: creating makefile fragment in ./frame/3/syrk/3m
configure: creating makefile fragment in ./frame/3/syrk/4m
configure: creating makefile fragment in ./frame/3/trmm
configure: creating makefile fragment in ./frame/3/trmm/3m
configure: creating makefile fragment in ./frame/3/trmm/4m
configure: creating makefile fragment in ./frame/3/trmm3
configure: creating makefile fragment in ./frame/3/trmm3/3m
configure: creating makefile fragment in ./frame/3/trmm3/4m
configure: creating makefile fragment in ./frame/3/trsm
configure: creating makefile fragment in ./frame/3/trsm/3m
configure: creating makefile fragment in ./frame/3/trsm/3m/ukernels
configure: creating makefile fragment in ./frame/3/trsm/4m
configure: creating makefile fragment in ./frame/3/trsm/4m/ukernels
configure: creating makefile fragment in ./frame/3/trsm/ukernels
configure: creating makefile fragment in ./frame/base
configure: creating makefile fragment in ./frame/base/check
configure: creating makefile fragment in ./frame/base/noopt
configure: creating makefile fragment in ./frame/cntl
configure: creating makefile fragment in ./frame/compat
configure: creating makefile fragment in ./frame/compat/check
configure: creating makefile fragment in ./frame/compat/f2c
configure: creating makefile fragment in ./frame/compat/f2c/util
configure: creating makefile fragment in ./frame/include
configure: creating makefile fragment in ./frame/include/level0
configure: creating makefile fragment in ./frame/include/level0/ri
configure: creating makefile fragment in ./frame/include/level0/ri3
configure: creating makefile fragment in ./frame/util
configure: creating makefile fragment in ./frame/util/amaxv
configure: creating makefile fragment in ./frame/util/asumv
configure: creating makefile fragment in ./frame/util/mkherm
configure: creating makefile fragment in ./frame/util/mksymm
configure: creating makefile fragment in ./frame/util/mktrim
configure: creating makefile fragment in ./frame/util/norm1m
configure: creating makefile fragment in ./frame/util/norm1v
configure: creating makefile fragment in ./frame/util/normfm
configure: creating makefile fragment in ./frame/util/normfv
configure: creating makefile fragment in ./frame/util/normim
configure: creating makefile fragment in ./frame/util/normiv
configure: creating makefile fragment in ./frame/util/printm
configure: creating makefile fragment in ./frame/util/printv
configure: creating makefile fragment in ./frame/util/randm
configure: creating makefile fragment in ./frame/util/randv
configure: creating makefile fragment in ./frame/util/sumsqv
configure: configured to build within top-level directory of source distribution.
Compiling config/bgq/kernels/1/bli_axpyv_opt_var1.c (NOTE: using flags for kernels)
Compiling frame/0/unzipsc/bli_unzipsc.c
Compiling frame/0/unzipsc/bli_unzipsc_check.c
Compiling frame/0/unzipsc/bli_unzipsc_unb_var1.c
Compiling frame/0/zipsc/bli_zipsc.c
Compiling frame/0/zipsc/bli_zipsc_check.c
Compiling frame/0/zipsc/bli_zipsc_unb_var1.c
Compiling frame/1/addv/bli_addv.c
Compiling frame/1/addv/bli_addv_check.c
Compiling frame/1/addv/bli_addv_kernel.c
Compiling frame/1/addv/bli_addv_ref.c
Compiling frame/1/axpyv/bli_axpyv.c
Compiling frame/1/axpyv/bli_axpyv_check.c
Compiling frame/1/axpyv/bli_axpyv_kernel.c
Compiling frame/1/axpyv/bli_axpyv_ref.c
Compiling frame/1/copyv/bli_copyv.c
Compiling frame/1/copyv/bli_copyv_check.c
Compiling frame/1/copyv/bli_copyv_kernel.c
Compiling frame/1/copyv/bli_copyv_ref.c
Compiling frame/1/dotv/bli_dotv.c
Compiling frame/1/dotv/bli_dotv_check.c
Compiling frame/1/dotv/bli_dotv_kernel.c
Compiling frame/1/dotv/bli_dotv_ref.c
Compiling frame/1/dotxv/bli_dotxv.c
Compiling frame/1/dotxv/bli_dotxv_check.c
Compiling frame/1/dotxv/bli_dotxv_kernel.c
Compiling frame/1/dotxv/bli_dotxv_ref.c
Compiling frame/1/invertv/bli_invertv.c
Compiling frame/1/invertv/bli_invertv_check.c
Compiling frame/1/invertv/bli_invertv_kernel.c
Compiling frame/1/invertv/bli_invertv_ref.c
Compiling frame/1/packv/bli_packv.c
Compiling frame/1/packv/bli_packv_check.c
Compiling frame/1/packv/bli_packv_cntl.c
Compiling frame/1/packv/bli_packv_init.c
"config/bgq/kernels/1/bli_axpyv_opt_var1.c", line 45.33: 1506-1418 (E) Assignment between restrict pointers "alpha" and "alpha_in" is not allowed. Only outer-to-inner scope assignments between restrict pointers are allowed.
"config/bgq/kernels/1/bli_axpyv_opt_var1.c", line 46.29: 1506-1418 (E) Assignment between restrict pointers "x" and "x_in" is not allowed. Only outer-to-inner scope assignments between restrict pointers are allowed.
"config/bgq/kernels/1/bli_axpyv_opt_var1.c", line 47.29: 1506-1418 (E) Assignment between restrict pointers "y" and "y_in" is not allowed. Only outer-to-inner scope assignments between restrict pointers are allowed.
"config/bgq/kernels/1/bli_axpyv_opt_var1.c", line 68.28: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
make: *** [obj/bgq/config/kernels/1/bli_axpyv_opt_var1.o] Error 1
make: *** Waiting for unfinished jobs....

use generic paths for toolchain in POWER7

[jhammond@ftlogin2 git]$ cat 0002-generic-gcc-path-instead-of-something-at-IBM-Austin.patch 0003-generic-gcc-path-instead-of-something-at-IBM-Austin.patch 
From f02aca90c2c045c3ed7573ff5bb8a82b3e45938b Mon Sep 17 00:00:00 2001
From: Jeff Hammond <[email protected]>
Date: Mon, 31 Mar 2014 21:53:56 +0000
Subject: [PATCH 2/5] generic gcc path instead of something at IBM Austin

---
 kernels/power7/3/test/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernels/power7/3/test/Makefile b/kernels/power7/3/test/Makefile
index 356cde9d..15f27b81 100644
--- a/kernels/power7/3/test/Makefile
+++ b/kernels/power7/3/test/Makefile
@@ -1,5 +1,5 @@

-CC = /opt/at6.0/bin/powerpc64-linux-gcc
+CC = gcc
 TARGET_ARCH = -m64 -mvsx

 TGTS   = exp
-- 
1.9.1

From edd5efef2508cf3623d55dd47bb92159ebb2ee34 Mon Sep 17 00:00:00 2001
From: Jeff Hammond <[email protected]>
Date: Mon, 31 Mar 2014 21:54:15 +0000
Subject: [PATCH 3/5] generic gcc path instead of something at IBM Austin

---
 config/power7/make_defs.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/config/power7/make_defs.mk b/config/power7/make_defs.mk
index df3eb363..de3f05b5 100644
--- a/config/power7/make_defs.mk
+++ b/config/power7/make_defs.mk
@@ -76,7 +76,7 @@ GIT_LOG    := $(GIT) log --decorate
 #

 # --- Determine the C compiler and related flags ---
-CC             := /opt/at6.0/bin/powerpc64-linux-gcc
+CC             := gcc
 # Enable IEEE Standard 1003.1-2004 (POSIX.1d). 
 # NOTE: This is needed to enable posix_memalign().
 CPPROCFLAGS    := -D_POSIX_C_SOURCE=200112L
@@ -96,7 +96,7 @@ CFLAGS_KERNELS := $(CDBGFLAGS) $(CKOPTFLAGS) $(CVECFLAGS) $(CWARNFLAGS) $(CMISCF
 CFLAGS_NOOPT   := $(CDBGFLAGS)                            $(CWARNFLAGS) $(CMISCFLAGS) $(CPPROCFLAGS)

 # --- Determine the archiver and related flags ---
-AR             := /opt/at6.0/bin/powerpc64-linux-ar
+AR             := ar
 ARFLAGS        := cru

 # --- Determine the linker and related flags ---
-- 
1.9.1

OMP problem with Intel compiler

Compiling frame/base/bli_threading_omp.c
frame/base/bli_threading_omp.c: In function 'bli_barrier':
frame/base/bli_threading_omp.c:88: error: expected end of line before 'capture'
frame/base/bli_threading_omp.c:89: error: invalid operator for '#pragma omp atomic' before '=' token
make: *** [obj/sandybridge/frame/base/bli_threading_omp.o] Error 1

I know that earlier Intel compilers did not support the full OMP 3.1 standard, but this is Intel 15.

Configuring maximum number of threads at runtime

A user of Elemental has been running into strange performance issues when building on top of BLIS that seem to be due to the environment variable OMP_NUM_THREADS being set to one still leading to a large performance degradation when the number of independent uses of BLIS times the number of configured threads is larger than the number of cores on the machine.

While they have configured BLIS to use OpenMP with 16 threads, it is often preferred when using the library from within an MPI environment to be able to disable threading at runtime (or at least, decrease the number of threads).

What environment variables should be set to one to have the same effect as exporting OMP_NUM_THREADS=1 for other BLAS libraries? I would humbly suggest either adding support for OMP_NUM_THREADS or adding the full list of variables (including BLIS_IR_NT) to the wiki that need to be configured to have the same effect.

BLIS should allow simultaneously exporting both 32- and 64-bit variants of BLAS/CBLAS

The de facto standard is that the standard BLAS/CBLAS functions take 32-bit integers in their API. Julia experimented with changing this so that they could use 64-bit integers in their main BLAS wrappers, and this worked great for a little while until they discovered that when people started trying to link in other existing BLAS-using code, this code was assuming that BLAS uses 32-bit integers and was causing segfaults. Their solution was to continue to use a 64-bit integer version of BLAS, but with symbols renamed to avoid collisions (so e.g. dgemm_ uses 32-bit integers, and dgemm_64_ uses 64-bit integers... [edited to get the 64-bit symbol names correct])

As mentioned in #37 (comment) , it would be great if a single BLIS library could export both 32- and 64-bit versions of these symbols simultaneously. It doesn't look like this would be too hard, since both the BLAS2BLIS interface is already generated using C preprocessor magic, and the CBLAS wrapper is already getting programmatically patched...

Fork safety

The combination of thread pools + fork is unfortunate: it tends to lead to random freezes. Unfortunately, the best way to achieve high-level parallelism in Python is to use fork (via the multiprocessing module).

Fundamentally the problem is that if you spawn a thread pool, and then fork, then the child ends up thinking that it has a thread pool, and dispatching work to it, but there aren't actually any threads running. This doesn't end well.

When using OMP for threading, dealing with this is the responsibility of the OMP implementation, and out-of-scope for BLIS itself. But when using pthreads, this should be handled by using pthread_atfork to register a pre-fork callback that shuts down the thread pool.

The equivalent issue in OpenBLAS was fixed in OpenMathLib/OpenBLAS#343, specifically with this code (which should probably actually be called openblas_install_fork_handler...):

+void openblas_fork_handler()
+{
+  // This handler shuts down the OpenBLAS-managed PTHREAD pool when OpenBLAS is
+  // built with "make USE_OPENMP=0".
+  // Hanging can still happen when OpenBLAS is built against the libgomp
+  // implementation of OpenMP. The problem is tracked at:
+  //   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035
+  // In the mean time build with USE_OPENMP=0 or link against another
+  // implementation of OpenMP.
+#ifndef OS_WINDOWS
+  int err;
+  err = pthread_atfork (BLASFUNC(blas_thread_shutdown), NULL, NULL);
+  if(err != 0)
+    openblas_warning(0, "OpenBLAS Warning ... cannot install fork handler. You may meet hang after fork.\n");
+#endif
+}

The attached test case can also probably be re-used with trivial tweaks to use the BLIS API instead of cblas: https://github.com/ogrisel/OpenBLAS/blob/49bd98f410369c9604031296f8ff47c5c20052bb/utest/test_fork.c

reg-BLAS.R test fails

being not a programmer I'd like to ask - what might the problem?

Type 'q()' to quit R.

PR#4582 %*% with NAs

stopifnot(is.na(NA %% 0), is.na(0 %% NA))

depended on the BLAS in use.

found from fallback test in slam 0.1-15

most likely indicates an inaedquate BLAS.

x <- matrix(c(1, 0, NA, 1), 2, 2)
y <- matrix(c(1, 0, 0, 2, 1, 0), 3, 2)
(z <- tcrossprod(x, y))
[,1] [,2] [,3]
[1,] NA NA 0
[2,] 2 1 0
stopifnot(identical(z, x %% t(y)))
Error: identical(z, x %
% t(y)) is not TRUE

sandybridge segfault with 32-bit dim_t

The culprit seems to be the load of k in the micro-kernel which is explicitly a movq. Changing type of k_iter and k_next to [u?]int64_t should fix it.

Problems about specifying compiler with configure

Hi, I'm trying to compile BLIS on my server, it has a Xeon E5-2620v2 core and I want to use my ICC 2016. I used the following command to configure:
./configure CC=icc sandybridge
But when I tried to make it, it shows that:
config/sandybridge/make_defs.mk:84: *** gcc is required for this configuration.. Stop.
It seems that the file in configure/sandybridge did not change.
Is there anything wrong in my command? Please help, much thanks!

bli_obj_create_with_attached_buffer mishandling empty matrices

The BLAS interface does not appear to work for dtrsm when one of the input matrices has a zero dimension. It appears to be the result of bli_trsm calling bli_obj_create_with_attached_buffer on the input matrices, which leads to a check that improperly aborts if the corresponding buffer is null, even if the matrix dimensions were zero. I would assume that this bug affects a large number of routines.

More compilation error on BG/Q

I got another error when I attempt to compile the latest version (0.2) for BG/Q

(1) Undeclared identifier bli_daxpyf_fusefac
Compiling ../config/bgq/kernels/1f/bli_axpyf_opt_var1.c (NOTE: using flags for kernels)
"../config/bgq/kernels/1f/bli_axpyf_opt_var1.c", line 57.20: 1506-045 (S) Undeclared identifier bli_daxpyf_fusefac.
"../config/bgq/kernels/1f/bli_axpyf_opt_var1.c", line 64.39: 1506-098 (E) Missing argument(s).
make: *** [obj/bgq/config/kernels/1f/bli_axpyf_opt_var1.o] Error 1

in file ./config/bgq/kernels/1f/bli_axpyf_opt_var1.c, line 57

if ( b_n < PASTEMAC(d,axpyf_fusefac) || inca != 1 || incx != 1 || incy != 1 || bli_is_unaligned_to( a, 32 ) || bli_is_unaligned_to( y, 32 ) )
use_ref = TRUE;

I cannot find where "axpyf_fusefac" is, so I simply comment out this line to call the reference DAXPYF function

(2) More error in bli_gemm_int_8x8.c
But when I commented out the line 57, I got a bunch of error messages in bli_gemm_int_8x8.c:

Compiling ../config/bgq/kernels/3/bli_gemm_int_8x8.c (NOTE: using flags for kernels)
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 133.28: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 134.28: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
...
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 231.28: 1506-045 (S) Undeclared identifier c_z.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 262.14: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 263.14: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 264.14: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 265.14: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 267.14: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 268.14: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 302.19: 1506-196 (S) Initialization between types "double" and "struct {...}" is not allowed.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 303.19: 1506-196 (S) Initialization between types "double" and "struct {...}" is not allowed.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 304.18: 1506-196 (S) Initialization between types "double" and "struct {...}" is not allowed.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 305.18: 1506-196 (S) Initialization between types "double" and "struct {...}" is not allowed.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 362.5: 1506-068 (S) Operation between types "double" and "struct {...}" is not allowed.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 362.5: 1506-068 (S) Operation between types "double" and "struct {...}" is not allowed.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 362.5: 1506-068 (S) Operation between types "double" and "struct {...}" is not allowed.
"../config/bgq/kernels/3/bli_gemm_int_8x8.c", line 362.5: 1506-068 (S) Operation between types "double" and "struct {...}" is not allowed.
...

Sandybridge OpenMP Failure

Hi,

I compiled BLIS on my server(CentOS 7.1, gcc 4.8.5, 2*Xeon E5-2620v2), but it seems that BLIS can just use single thread. I used the command

./configure sandybridge
make -j
make install

to compile and install. I didn't change any file in config/sandybridge. I checked that

#define BLIS_ENABLE_OPENMP

is uncommented in bli_config.h, so it should be able to use OpenMP. Then I used the makefile in wiki/BuildSystem#linking-against-blis section and added -fopenmp flag to compile the test program. However the program is single-threaded. I also tried to use the script below to test the program but it still failed to use OpenMP

export OMP_NUM_THREADS=12
make -f BLIS-Makefile
./testBLIS.x 

I tried to use
···
#define BLIS_ENABLE_PTHREAD
···
instead of OpenMP setting, but it still fail.

What should I do to use OpenMP?

bli_gemm_8x8.h?

  1. I attempted to build BLIS version 0.2.0 on bgq, but I found this error
    "./config/bgq/bli_kernel.h", line 174.10: 1506-296 (S) #include file "bli_gemm_8x8.h" not found.

and I check the file "bli_kernel.h" for bgq, line 174 contains

include "bli_gemm_8x8.h"

but I cannot locate this header file anywhere in the package.

  1. For my previous ticket "BLIS Test Failure in BlueGene/Q #34", is there any follow up? seems that BLIS still not working correctly for all complex test cases.

  2. Is there any instruction to build LAPACK to work with BLIS?

Thanks!

4mh failures with icc

The following operations fail with the dunnington and sandybridge configurations (and probably haswell too, but not tested) using icc 16.0.1 and compiling with CVECOPTS = '-xSSE4.2' and '-xAVX' respectively:

cgemm4mh
chemm4mh
csymm4mh
csyrk4mh
csyr2k4mh
ctrmm34mh

configure does not obviously fail if non-existent configuration is used

For example, if one does configure knc instead of configure mic, the results is:

[dam879@stampede knc]$~/src/blis/configure -p `pwd`/install knc
configure: checking whether we need to update the version file.
configure: checking version file '/home1/02742/dam879/src/blis/version'.
configure: starting configuration of BLIS 0.2.0.
configure: manual configuration requested.
configure: configuring with 'knc' configuration sub-directory.
configure: using install prefix '/home1/02742/dam879/build/blis/knc/install'.
configure: debug symbols disabled.
configure: disabling verbose make output, enable with 'make V=1'.
configure: building BLIS as a static library.
configure: threading is disabled.
configure: the CBLAS compatibility layer is disabled.
configure: the BLAS compatibility layer is enabled.
configure: the internal integer size is automatically determined.
configure: the BLAS/CBLAS interface integer size is 32-bit.
configure: creating ./config.mk from /home1/02742/dam879/src/blis/build/config.mk.in
configure: creating ./bli_config.h from /home1/02742/dam879/src/blis/build/bli_config.h.in
configure: creating ./obj/knc
configure: creating ./obj/knc/config
configure: creating ./obj/knc/frame
configure: creating ./obj/knc/testsuite
configure: creating ./lib/knc
configure: mirroring /home1/02742/dam879/src/blis/config/knc to ./obj/knc/config
ls: cannot access /home1/02742/dam879/src/blis/config/knc: No such file or directory
configure: mirroring /home1/02742/dam879/src/blis/frame to ./obj/knc/frame
configure: creating makefile fragment in /home1/02742/dam879/src/blis/config/knc
ls: cannot access /home1/02742/dam879/src/blis/config/knc: No such file or directory
ls: cannot access /home1/02742/dam879/src/blis/config/knc: No such file or directory
/home1/02742/dam879/src/blis/build/gen-make-frags/gen-make-frag.sh: line 230: /home1/02742/dam879/src/blis/config/knc/.fragment.mk: No such file or directory
ls: cannot access /home1/02742/dam879/src/blis/config/knc: No such file or directory
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/0
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/0/copysc
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1/kernels
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1/packv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1/scalv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1/unpackv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1d
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1f
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1f/kernels
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1m
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1m/packm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1m/packm/ukernels
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1m/scalm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1m/unpackm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/1m/unpackm/ukernels
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/gemv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/ger
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/hemv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/her
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/her2
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/symv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/syr
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/syr2
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/trmv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/2/trsv
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/gemm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/gemm/ind
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/hemm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/her2k
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/herk
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/symm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/syr2k
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/syrk
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/trmm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/trmm3
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/trsm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/3/ukernels
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/base
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/base/check
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/base/noopt
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/cntl
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/compat
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/compat/cblas
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/compat/cblas/f77_sub
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/compat/cblas/src
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/compat/check
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/compat/f2c
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/compat/f2c/util
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include/level0
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include/level0/io
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include/level0/ri
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include/level0/ri3
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include/level0/rih
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include/level0/ro
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/include/level0/rpi
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind/cntx
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind/include
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind/oapi
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind/tapi
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind/ukernels
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind/ukernels/gemm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/ind/ukernels/trsm
configure: creating makefile fragment in /home1/02742/dam879/src/blis/frame/util
configure: creating symbolic link to Makefile.
configure: creating symbolic link to common.mk.
configure: configured to build outside of source distribution.

Without looking carefully this seems to indicate a successful configuration. Failure in this case should be quick and obvious.

Shared library versions of BLIS?

Is there a simple means of modifying BLIS to build a shared library instead of a static library? It seems to be missing from the configure script.

Constants in bli_const.c "undefined" on OSX

It seems that the linker on OSX is not able to find any of the constants defined in bli_const.c, such as BLIS_ONE etc. The problem, according to this page is that undefined constants end up as "common" symbols which are ignored by the OSX linker. The two available solutions seem to be:

  1. Initialize the variables. Since these are obj_t's, default-initialization with ...={} should be OK (and they are initialized for real later).
  2. Compile with -fno-common. This would have to be added to each configuration or in common.mk.

I am not sure when this problem first appeared, since I thought I had successfully compiled after the "big commit", but I am seeing it in my branch based off of cbcd0b7. Using gcc-5.3.0 from Homebrew instead of the system "gcc" may also be a contributing factor.

pthreads does not compile on gcc/linux

Commands:

 ./configure -t pthreads sandybridge
make

First error is:

In file included from ./frame/base/bli_threading.h:88:0,
                 from ./frame/include/blis.h:73,
                 from config/sandybridge/kernels/3/bli_gemm_asm_d8x4.c:38:
./frame/base/bli_threading_pthreads.h:43:13: error: conflicting types forpthread_barrierattr_ttypedef int pthread_barrierattr_t;
             ^
In file included from /usr/include/pthread.h:26:0,
                 from ./frame/base/bli_threading_pthreads.h:40,
                 from ./frame/base/bli_threading.h:88,
                 from ./frame/include/blis.h:73,
                 from config/sandybridge/kernels/3/bli_gemm_asm_d8x4.c:38:
/usr/include/x86_64-linux-gnu/bits/pthreadtypes.h:249:3: note: previous declaration ofpthread_barrierattr_twas here
 } pthread_barrierattr_t;
   ^

versions:

  • gcc version 5.3.1 20160409 (Debian 5.3.1-14)
  • blis dd62080

multi thread crash

Hello,
It seems libblis is not thread safe, I have a gemm invocation

static inline void matmul(cv::Mat &c, cv::Mat &a, cv::Mat b)
{
        float   alphap = 1.0;
        //since beta is zero, we don't need to init c to zero
        float   betap = 0.0;

        cntx_t cntx;
        bli_gemm_cntx_init(&cntx);

        bli_sgemm(BLIS_NO_TRANSPOSE, BLIS_NO_TRANSPOSE, a.rows, b.cols, a.cols,
                &alphap,
                (float *)a.data, a.cols, 1,
                (float *)b.data, b.cols, 1,
                &betap,
                (float *)c.data, b.cols, 1, &cntx);
        bli_gemm_cntx_finalize(&cntx);
}

and we have several thread will invoke the matmul, then it will crash as follow, the parameter p get changed to 0, if I change the program to run only one thread to invoke the matmul, it will be all right.
#0 0x0000000000545abc in bli_spackm_6xk_ref (conja=BLIS_NO_CONJUGATE, n=25, kappa=0x7fffec000dc0, a=0x7fffb8091540, inca=25, lda=1, p=0x0, ldp=6)

at frame/1m/packm/ukernels/bli_packm_cxk_ref.c:414

#1 0x00000000004ee461 in bli_spackm_cxk (conja=BLIS_NO_CONJUGATE, panel_dim=6, panel_len=25, kappa=0x7fffec000dc0, a=0x7fffb8091540, inca=25, lda=1, p=0x0, ldp=6,

cntx=0x7fffd247c190) at frame/1m/packm/bli_packm_cxk.c:216

#2 0x00000000004c4806 in bli_spackm_struc_cxk (strucc=BLIS_GENERAL, diagoffc=0, diagc=BLIS_NONUNIT_DIAG, uploc=BLIS_DENSE, conjc=BLIS_NO_CONJUGATE,

schema=BLIS_PACKED_COL_PANELS, invdiag=0, m_panel=25, n_panel=6, m_panel_max=25, n_panel_max=6, kappa=0x7fffec000dc0, c=0x7fffb8091540, rs_c=1, cs_c=25, p=0x0, 
rs_p=6, cs_p=1, is_p=1, cntx=0x7fffd247c190) at frame/1m/packm/bli_packm_struc_cxk.c:255

#3 0x00000000004bbbe6 in bli_spackm_blk_var1 (strucc=BLIS_GENERAL, diagoffc=0, diagc=BLIS_NONUNIT_DIAG, uploc=BLIS_DENSE, transc=BLIS_NO_TRANSPOSE,

schema=BLIS_PACKED_COL_PANELS, invdiag=0, revifup=0, reviflo=0, m=25, n=1936, m_max=25, n_max=1938, kappa=0x7fffec000dc0, c=0x7fffb8091540, rs_c=1, cs_c=25, p=0x0, 
rs_p=6, cs_p=1, is_p=1, pd_p=6, ps_p=150, packm_ker=0x4c46f9 <bli_spackm_struc_cxk>, cntx=0x7fffd247c190, thread=0x7fffb8007600)
at frame/1m/packm/bli_packm_blk_var1.c:668

#4 0x00000000004bb133 in bli_packm_blk_var1 (c=0x7fffd2479e50, p=0x7fffd24798e0, cntx=0x7fffd247c190, t=0x7fffb8007600) at frame/1m/packm/bli_packm_blk_var1.c:234
#5 0x00000000004aed11 in bli_packm_int (a=0x7fffd2479e50, p=0x7fffd24798e0, cntx=0x7fffd247c190, cntl=0x7fffec002c00, thread=0x7fffb8007600)

at frame/1m/packm/bli_packm_int.c:125

#6 0x00000000004b23ca in bli_gemm_blk_var1f (a=0x7fffd2479d80, b=0x7fffd2479e50, c=0x7fffd2479f20, cntx=0x7fffd247c190, cntl=0x7fffec002d00, thread=0x7fffb8007660)

at frame/3/gemm/bli_gemm_blk_var1f.c:79

#7 0x00000000004488b2 in bli_gemm_int (alpha=0x7c66a0 <BLIS_ONE>, a=0x7fffd247a160, b=0x7fffd247a230, beta=0x7c66a0 <BLIS_ONE>, c=0x7fffd247a090, cntx=0x7fffd247c190,

cntl=0x7fffec002d00, thread=0x7fffb8007660) at frame/3/gemm/bli_gemm_int.c:154

#8 0x00000000004b304b in bli_gemm_blk_var3f (a=0x7fffd247a530, b=0x7fffd247a600, c=0x7fffd247a6d0, cntx=0x7fffd247c190, cntl=0x7fffec002da0, thread=0x7fffb80c0ae0)

at frame/3/gemm/bli_gemm_blk_var3f.c:121

#9 0x00000000004488b2 in bli_gemm_int (alpha=0x7c66a0 <BLIS_ONE>, a=0x7fffd247a840, b=0x7fffd247a910, beta=0x7c66a0 <BLIS_ONE>, c=0x7fffd247a9e0, cntx=0x7fffd247c190,

cntl=0x7fffec002da0, thread=0x7fffb80c0ae0) at frame/3/gemm/bli_gemm_int.c:154

#10 0x00000000004b2b28 in bli_gemm_blk_var2f (a=0x7fffd247ace0, b=0x7fffd247adb0, c=0x7fffd247ae80, cntx=0x7fffd247c190, cntl=0x7fffec002e40, thread=0x7fffb80c0ca0)

at frame/3/gemm/bli_gemm_blk_var2f.c:123

#11 0x00000000004488b2 in bli_gemm_int (alpha=0x7fffd247bcf0, a=0x7fffd247b0c0, b=0x7fffd247b190, beta=0x7fffd247bf60, c=0x7fffd247b260, cntx=0x7fffd247c190,

cntl=0x7fffec002e40, thread=0x7fffb80c0ca0) at frame/3/gemm/bli_gemm_int.c:154

#12 0x0000000000423d60 in bli_level3_thread_decorator (n_threads=1, func=0x447f75 <bli_gemm_int>, alpha=0x7fffd247bcf0, a=0x7fffd247b0c0, b=0x7fffd247b190,

beta=0x7fffd247bf60, c=0x7fffd247b260, cntx=0x7fffd247c190, cntl=0x7fffec002e40, thread=0x7fffb80070c0) at frame/base/bli_threading.c:92

#13 0x0000000000447f5a in bli_gemm_front (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190,

cntl=0x7fffec002e40) at frame/3/gemm/bli_gemm_front.c:86

#14 0x0000000000429be5 in bli_gemmnat (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190)

at frame/ind/oapi/bli_l3_nat_oapi.c:80

#15 0x000000000049242b in bli_gemmind (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190)

at frame/ind/oapi/bli_l3_ind_oapi.c:59

#16 0x000000000044701e in bli_gemm_ex (alpha=0x7fffd247bcf0, a=0x7fffd247bdc0, b=0x7fffd247be90, beta=0x7fffd247bf60, c=0x7fffd247c030, cntx=0x7fffd247c190)

at frame/3/bli_l3_oapi.c:74

#17 0x0000000000419e3c in bli_sgemm (transa=BLIS_NO_TRANSPOSE, transb=BLIS_NO_TRANSPOSE, m=1936, n=16, k=25, alpha=0x7fffd247c188, a=0x7fffb8091540, rs_a=25, cs_a=1,

b=0x7fffb8003790, rs_b=16, cs_b=1, beta=0x7fffd247c18c, c=0x7fffb8040e30, rs_c=16, cs_c=1, cntx=0x7fffd247c190) at frame/3/bli_l3_tapi.c:93

Undefined references to ``GOMP_parallel``

When I try to test for BLIS's support for dgemm_, I see the following link error, which does not seem to be covered by your build system wiki:

"""
/usr/bin/gcc-4.9 -DCHECK_FUNCTION_EXISTS=dgemm_ CMakeFiles/cmTryCompileExec3961259244.dir/CheckFunctionExists.c.o -o cmTryCompileExec3961259244 -rdynamic /home/poulson/Install/lib/libblis.a -lpthread -lm
/home/poulson/Install/lib/libblis.a(bli_init.o): In function bli_init': bli_init.c:(.text+0xb4): undefined reference toGOMP_critical_name_start'
bli_init.c:(.text+0xc6): undefined reference to GOMP_critical_name_end' /home/poulson/Install/lib/libblis.a(bli_init.o): In functionbli_finalize':
bli_init.c:(.text+0x1d0): undefined reference to GOMP_critical_name_start' bli_init.c:(.text+0x1e2): undefined reference toGOMP_critical_name_end'
/home/poulson/Install/lib/libblis.a(bli_mem.o): In function bli_mem_acquire_m': bli_mem.c:(.text+0x77): undefined reference toGOMP_critical_name_start'
bli_mem.c:(.text+0x8f): undefined reference to GOMP_critical_name_end' /home/poulson/Install/lib/libblis.a(bli_mem.o): In functionbli_mem_release':
bli_mem.c:(.text+0xfa): undefined reference to GOMP_critical_name_start' bli_mem.c:(.text+0x112): undefined reference toGOMP_critical_name_end'
/home/poulson/Install/lib/libblis.a(bli_threading_omp.o): In function bli_level3_thread_decorator._omp_fn.0': bli_threading_omp.c:(.text+0x5): undefined reference toomp_get_thread_num'
/home/poulson/Install/lib/libblis.a(bli_threading_omp.o): In function bli_level3_thread_decorator': bli_threading_omp.c:(.text+0x99): undefined reference toGOMP_parallel'
collect2: error: ld returned 1 exit status
"""

uninitialized variable warnings in frame/util/norm1m/bli_norm1m_unb_var1.c

While perhaps innocuous, such compiler warnings are not ideal...

frame/util/norm1m/bli_norm1m_unb_var1.c: In function 'bli_znorm1m_unb_var1':
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'ij0' may be used uninitialized in this function
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'n_shift' may be used uninitialized in this function
frame/util/norm1m/bli_norm1m_unb_var1.c: In function 'bli_cnorm1m_unb_var1':
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'ij0' may be used uninitialized in this function
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'n_shift' may be used uninitialized in this function
frame/util/norm1m/bli_norm1m_unb_var1.c: In function 'bli_dnorm1m_unb_var1':
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'ij0' may be used uninitialized in this function
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'n_shift' may be used uninitialized in this function
frame/util/norm1m/bli_norm1m_unb_var1.c: In function 'bli_snorm1m_unb_var1':
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'ij0' may be used uninitialized in this function
frame/util/norm1m/bli_norm1m_unb_var1.c:230: warning: 'n_shift' may be used uninitialized in this function

packm breaks with 1x1 micro-kernels

packm implicitly assumes that the register blocksizes are both non-unit. While this is not a bad assumption in practice, it would be nice to lift this constraint so that the right thing happens even if MR or NR (or both) happen to be 1. The problem boils down to the definition of the bli_is_row_stored_f() and bli_is_col_stored_f() macros, which only look at the row and column strides [of the packed micro-panel]. Naturally, if both are unit, then a "row-stored" mx1 micro-panel is indistinguishable from a "column-stored" 1xn micro-panel.

Issue with the new blis library

I am running libflame benchmark test routine (test_libflame.x in test folder). the test gets aborted because FLA_Hemv_check() called from FLA_Hemv_external() reports "Detecting unequal object datatypes". Prior to this call the data is getting corrupted and culprit could be Trsm from the blis library.
Note: With older version of blis, the test suite works just fine.

Test Parameters are: single precision, row-major format, FLA_Chol_solve() corrupts the data due to the call to Trsm_external.

License/copyright headers in need of update

The license/copyright headers at the top of each source file need to be updated. The copyright year needs to be changed to "2016". (I never got around to updating it in 2015 and so it still reads "2014".)

Unfortunately, this change is going to touch virtually every file in the repository. If you have any objections, or this will disrupt your work, please speak up.

Support for building with CMake

Having a CMake build system available makes it a lot easier to build with a variety of compilers in a variety of environments. In particular, it makes it a lot easier to build things on Windows.
Is there interest in this?
I'm considering implementing this myself, though it'll probably take a while since I have several side projects going at the moment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.