Giter Site home page Giter Site logo

flexfloat's Introduction

FlexFloat

FlexFloat is a C library for the emulation of reduced-precision floating point types.

Building FlexFloat

Required packages:

  • CMake 3.1 or higher
  • GCC 7.1 or higher

To build the library:

  • Create a directory (denoted as "<build_dir>" in this document) where you want to put the generated Makefiles, project files as well the object files and output binaries and enter this location. For example: cd flexfloat & mkdir build

  • Run cmake [<optional configuration parameters>] <path to the FlexFloat source directory> (from "<build_dir>") For example: cd build && cmake .. [optional] Configuration parameters are:

    • -DCMAKE_BUILD_TYPE=Release\Debug (default: Release) - The release mode compiles sources with "-O3 -DNDEBUG" flags, while the debug mode uses "-O0 -g3" flags.
    • -DBUILD_TESTS=ON/OFF (default: ON) - Enable unit testing of FlexFloat
    • -DBUILD_EXAMPLES=ON/OFF (default: ON) - Build usage examples
    • -DDISABLE_ROUNDING=ON/OFF (default: OFF) - Disable the library support to IEEE rounding modes (truncation is always applied)
    • -DSINGLE_BACKEND=ON/OFF (default: OFF) - Use single-precision type (float) as a backend type instead of double precision
    • -DQUAD_BACKEND=ON/OFF (default: OFF) - Use quad-precision type (_Float128) as a backend type instead of float or double precision
    • -DENABLE_FLAGS=ON/OFF (default: OFF) - Enable support for floating-point exception flags
    • -DENABLE_STATS=ON/OFF (default: OFF) - Enable collection of statistics
    • -DENABLE_TRACKING=ON/OFF (default: OFF) - Enable track of error accumulation on program variables
  • In the "<build_dir>" directory execute make

  • [optional] To executes the library tests, execute make test (note that this feature requires to enable the unit testing feature of the library)

Base usage

To replace a floating-point type with a reduced-precision one, the native types used in the program must be replaced with flexfloat_t. Before its first use each FlexFloat variable must be given an initial value for exponent and mantissa bit-widths (two unsigned integers) by invoking ff_init (e.g., 5 bits for the exponent and 10 bits for the mantissa characterize the IEEE 754 half-precision format. Users can also (optionally) specify an initialization value expressed as a native C type using ff_init_float or ff_init_double. Since an initialization value might not be exactly representable in a target type with a lower number of bits, it is typically rounded to its nearest representable value using the current rounding mode. The FlexFloat API includes a set of functions to perform arithmetic operations involving operands of the same floating-point type, such as ff_add and ff_mul. Arithmetic perations in the original source code must be replaced by function calls that implement equivalent functionality on top of the emulated types. See "flexfloat.h" for further details on the API.

The C++ wrapper provides a generic floating-point type by defining a template class (flexfloat<e,m>, e and m are exponent and mantissa bit-widths) and a set of auxiliary functions (useful for debugging and collecting statistics). This only requires users to replace original variable declarations with instantiations of this template class. No other part of the program needs modification since class methods include operator overloading. See "flexfloat.hpp" for further details on the class methods.

Examples for the C API and for the C++ wrapper are provided in the "examples" folder (after building the executables are available in "<build_dir>/examples").

Advanced features

Flexfloat includes basic support for floating-point exception flags. If enabled, operations on FlexFloat variables will raise floating-point exception flags within the floating-point environment. One notable limitation of this feature is that the overflow exception (FE_OVERFLOW) is no longer raised on infinity results after a divide-by-zero condition is detected (FE_DIVBYZERO). Flags can be cleared using feclearexcept(FE_ALL_EXCEPT) to restore correct overflow detection until the next singularity division.

FlexFloat allows a complete set of execution statistics related to FP types to be collected. The library API includes functions to start, stop and reset the collection of statistics, namely ff_start_stats, ff_stop_stats and ff_clear_stats. A report is generated by calling ff_print_stats, and it includes the number of arithmetic operations (grouped by operator name) and the number of casts (grouped by source+destination type pairs). This feature allows the evaluation of the overhead due to the casts that have been introduced in a transprecision scenario, where the type of computations can be assigned at a very fine grain level. An example of this feature can be found at “examples/example_stats.c”.

FlexFloat also provides an advanced feature to keep track of error accumulation. Activating this feature, library adopters (programmers or automatic tools) can retrieve the exact value of a computation stored in a variable (calling ff_track_get_exact) or its current error w.r.t. the exact value (calling ff_track_get_error) at any point of the program. In addition users can add a callback to a program variable (calling ff_track_callback), that is a function invoked at any update of the variable. This feature can be useful for different purposes, for instance it can be used to track which internal expression has more impact on the result quality or to study the evolution of the error over time. An example of variable tracking can be found at “examples/example_tracking.c”.

flexfloat's People

Contributors

cristianomalossi avatar gtagliavini avatar haugoug avatar lucabertaccini avatar nazavode avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flexfloat's Issues

No sqrt, fabs?

Hi there,

I'm using FlexFloat with some numerical kernels that come from finite element calculations.

Mostly I can just replace double with flexfloat<11, 16> or whatever, and compile... this is great! I have a couple of annoyances, though. Firstly, sqrt and fabs don't seem to be supported by the flexfloat type. I.e. I have to manually add a cast to e.g. sqrt(a*b + c) to become sqrt(double(a*b + c)), and similarly with fabs. Am I doing something wrong, or are these operations not supported yet?

Secondly, a statement of the form double = double + flexfloat doesn't seem to work... I have to again add a manual cast double = double + double(flexfloat). Is this expected?

flexfloat_get_bits() gives wrong results

flexfloat_get_bits() gives wrong results when the bit representation is a power of 2 and the flexfloat is denormalized (exp <= 0).

For example

flexfloat_t test;

ff_init(&test, {5, 10});
flexfloat_set_bits(&test, 8);
std::cout << flexfloat_get_bits(&test) << std::endl;

gives the output 18446744073709545472.

This is most likely due to the line else if(exp <= 0 && frac != 0) in

uint_t flexfloat_get_bits(flexfloat_t *a)
{
    int_fast16_t exp = flexfloat_exp(a);
    uint_t frac = flexfloat_frac(a);

    if(exp == INF_EXP) exp = flexfloat_inf_exp(a->desc);
    else if(exp <= 0 && frac != 0) {
        frac = flexfloat_denorm_frac(a, exp);
        exp = 0;
    }

    return ((uint_t)flexfloat_sign(a) << (a->desc.exp_bits + a->desc.frac_bits))
           + ((uint_t)exp << a->desc.frac_bits)
           + frac;
}

What is the && frac != 0 for? Without it, it seems to work (at least for my example).

compressed nan/inf representations used for new fp8? e4m3 e5m2

via https://dblalock.substack.com/p/2022-9-18-arxiv-roundup-reliable

FP8 Formats for Deep Learning
A group of NVIDIA, ARM, and Intel researchers got fp8 training working reliably, with only a tiny accuracy loss compared to fp16.

8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.

reducing the number of NaN/Inf encodings in fp1-4-3 down to just one bitstring

how much accuracy loss does this approach cause? They find that, across a huge array of models and tasks, the consistent answer is: not much—around 0-.3% accuracy/BLEU/perplexity:

The test value_representation_half fails

/usr/ports/math/flexfloat/work/flexfloat-6db869087a12d763a94d53e9b0a9d52def270865/test/value_representation_half.cpp:66: Failure
Expected equality of these values:
  out
    Which is: 9221120237041090560
  out_number
    Which is: 18446686899104907264
[  FAILED  ] TestWithParameters_BF_004/MyTest.TestFormula/0, where GetParam() = (18446690030136066048, 18446686899104907264) (0 ms)
[ RUN      ] TestWithParameters_BF_004/MyTest.TestFormula/1
/usr/ports/math/flexfloat/work/flexfloat-6db869087a12d763a94d53e9b0a9d52def270865/test/value_representation_half.cpp:66: Failure
Expected equality of these values:
  out
    Which is: 9221120237041090560
  out_number
    Which is: 18446686899104907264
[  FAILED  ] TestWithParameters_BF_004/MyTest.TestFormula/1, where GetParam() = (18446690030672936960, 18446686899104907264) (0 ms)
[ RUN      ] TestWithParameters_BF_004/MyTest.TestFormula/2
/usr/ports/math/flexfloat/work/flexfloat-6db869087a12d763a94d53e9b0a9d52def270865/test/value_representation_half.cpp:66: Failure
Expected equality of these values:
  out
    Which is: 9221120237041090560
  out_number
    Which is: 18446686899104907264
[  FAILED  ] TestWithParameters_BF_004/MyTest.TestFormula/2, where GetParam() = (18446690031209807872, 18446686899104907264) (0 ms)

OS: FreeBSD 12.1
Compiler: gcc-9

NaN handling in tests makes too many assumptions

The tests partially assume that NaN bit-patterns emitted by FlexFloat are the same as on the host machine, although this property is implementation-defined.
Namely, one assumption made is expecting a qNaN with negative sign as the output of 0.0/0.0. While true on x86 hardware using SSE, this doesn't have to hold for other implementations and also doesn't hold for FlexFloat outputs.
FlexFloat currently outputs qNaN with positive sign and all payload bits cleared (e.g. 0x7fc00000 for binary32). Most hardware usually produces NaNs with cleared payload bits, but that is implementation defined, too.
#6 aligns the tests with the +qNaN behavior and removes some tests that make explicit assumptions about bit patterns generated by the host machine. However, many tests in test_value_representation_half still rely on exact bit patterns as NaN payloads.

It should be decided if NaN payloads are worth supporting or whether (the current) approach with a single canonical output NaN should be kept.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.