Giter Site home page Giter Site logo

intel / yask Goto Github PK

View Code? Open in Web Editor NEW
102.0 18.0 34.0 29.52 MB

YASK--Yet Another Stencil Kit: a domain-specific language and framework to create high-performance stencil code for implementing finite-difference methods and similar applications.

License: Other

Makefile 3.38% Perl 4.69% C++ 89.21% Shell 0.94% Python 1.27% SWIG 0.51%
xeon-phi intel kernel linux stencil optimization finite-difference-method xeon domain-specific-language avx512

yask's Introduction

YASK--Yet Another Stencil Kit

Overview

YASK is a framework to rapidly create high-performance stencil code including optimizations and features such as

  • Support for boundary layers and staggered-grid stencils.
  • Vector-folding to increase data reuse via non-traditional data layout.
  • Multi-level OpenMP parallelism to exploit multiple CPU cores and threads.
  • OpenMP offloading to GPUs.
  • MPI scaling to multiple sockets and nodes with overlapped communication and compute.
  • Spatial tiling with automatically-tuned block sizes.
  • Temporal tiling in multiple dimensions to further increase cache locality.
  • APIs for C++ and Python.

YASK contains a domain-specific compiler to convert stencil-equation specifications to optimized code for Intel(R) Xeon(R) processors, Intel(R) Xeon Phi(TM) processors, and Intel(R) graphics processors.

Supported Platforms and Processors:

  • 64-bit Linux.
  • Intel(R) Xeon(R) processors supporting the AVX, AVX2, or CORE_AVX512 instruction sets.
  • Intel(R) Xeon Phi(TM) x200-family processors supporting the MIC_AVX512 instruction set (KNL).
  • Intel(R) graphics processors supporting UHD graphics, e.g., Intel(R) Data Center GPU Max Series products.

Pre-requisites:

  • Intel(R) oneAPI HPC Toolkit for Linux (toolkit 2023.2 or later recommended); this will install the Intel(R) oneAPI DPC++/C++ Compiler and the Intel(R) MPI Library. See compiler notes below under version 4.00.00 changes.
  • Gnu C++ compiler, g++ (8.5.0 or later recommended). Even when using Intel compilers, a g++ installation is required.
  • Linux libraries librt and libnuma.
  • Grep.
  • Perl (v5 or later).
  • Awk.
  • Gnu make.
  • Bash shell.
  • Numactl utility if running on more than one CPU socket.
  • Optional utilities and their purposes:
    • The indent or gindent utility, used automatically during the build process to make the generated code easier for humans to read. You'll get a warning when running make if one of these doesn't exist. Everything will still work, but the generated code will be difficult to read. Reading the generated code is only necessary for debug, performance analysis, etc.
    • SWIG (4.0.0 or later): http://www.swig.org, for creating the Python interface.
    • Python 3 (3.6.1 or later): https://www.python.org/downloads, for creating and using the Python interface. Included with Intel(R) oneAPI HPC Toolkit.
    • Python numpy package for running Python interface tests. Included with Intel(R) oneAPI HPC Toolkit.
    • Doxygen (1.9.0 or later): https://www.doxygen.nl, for creating updated API documentation. If you're not changing the API documentation, you can view the existing documentation at the link at the top of this page.
    • Graphviz (2.30.1 or later): http://www.graphviz.org, for rendering stencil diagrams.
    • Intel(R) Software Development Emulator: https://software.intel.com/en-us/articles/intel-software-development-emulator, for functional testing if you don't have native support for the targeted instruction set.

Backward-compatibility notices

Version 4

  • Version 4.05.00 removes the "out-of-band" genetic-algorithm tuning script due to lack of resources for maintenance and testing.
  • Version 4.04.00 deprecates the existing void* {set,get}_elements_in_slice() APIs and provides safer float* and double* versions.
  • Version 4.03.00 is a significant release with the following notices:
    • Each non-scratch stencil equation is now checked to ensure offsets of +/-1 from the step-dimension on the LHS, e.g., A(t+1, x, y) EQUALS B(t, x, y+1). (-1 is used for less-common reverse-time stencils.)
    • The yk_solution::get_var() API now throws an exception if the named var does not exist. (Used to return std::nullptr.)
    • Vector "clustering" (unrolling by the YASK compiler) is no longer supported.
    • Read-ahead in the inner-loop is no longer supported.
    • APIs for getting OpenMP thread counts were added.
    • Equation "bundles" are now called solution "parts".
  • Version 4.01.00 added several new APIs. The following changes were made to to the YASK compiler: removed the -eq_bundles option, and an exception is now thrown from output_solution() if the format string is unrecognized.
  • Version 4.00.00 was a major release with a number of notices:
    • Support has been added for GPU offloading via the OpenMP device model. Build any YASK stencil kernel with make offload=1 .... This will create a kernel library and executable with an "arch" field containing "offload" and the OpenMP device target name. Use make offload=1 offload_arch=<target> to change the OpenMP target; the default is spir64, for GPUs with Intel(R) Architecture (e.g., Gen12). Use make offload_usm=1 to use the OpenMP Unified Shared Memory model.
    • The default compiler is now the Intel(R) oneAPI C++ compiler, icpx. If you want to use a different compiler, use make YK_CXX=<compiler> ... for the kernel, and/or make YC_CXX=<compiler> ... for the YASK compiler, or make CXX=<compiler> for both. A C++ compiler that supports C++17 is now required.
    • The loop hierarchy has been extended and renamed with (hopefully) more memorable names: version 3's regions, blocks, mini-blocks, and sub-blocks are now mega-blocks, blocks, micro-blocks, and nano-blocks, respectively. Pico-blocks have been added inside nano-blocks. When offloading, the nano-blocks and pico-blocks are executed on the device. The looping behaviors, including any temporal tiling, of mega-blocks, blocks, and micro-blocks are handled by the CPU. The get_region_size() and set_region_size() APIs have been removed. The -r and -sb options, e.g., -rx and -sbx, have also been removed.
    • Regarding CPU threads, "region threads" are now referred to as "outer threads", and "block threads" are now referred to as "inner threads". The option -block_threads is deprecated. The option -thread_divisor has been removed. See the -help documentation for new options -outer_threads and -inner_threads. The -max_threads option remains.
    • Only one thread per core is now used by default on most CPU models. This is done in yask.sh by passing -outer_threads <N> to the executable, where <N> is the number of cores on the node divided by the number of MPI ranks. Consequently, the default number of inner threads is now one (1) to use one core per block. This change was made based on observed performance on newer Intel(R) Xeon(R) Processors. Previous versions used two threads per block by default and used both hyper-threads if they were enabled. To configure two hyper-threads to work cooperatively on each block, use the option -inner_threads 2. These changes do not apply to Intel(R) Xeon Phi(TM) x200-family processors (KNL), which continue to use all 4 hyper-threads per core and 8 inner threads by default (because 2 cores share an L2 cache).
    • Intel(R) Xeon Phi(TM) x100-family processors (KNC) are no longer supported. (Intel(R) Xeon Phi(TM) x200-family processors (KNL) are still supported.)
    • Python v2 is no longer supported.
    • New vector APIs were added to yk_solution and yk_var to allow getting or setting multiple dimensions in one API call.
    • new_relative_var_point() API is deprecated.
    • APIs that were previously deprecated in the yk_var class have been removed.
    • Explicit support for persistent-memory devices has been removed. (Persistent-memory accessible via separate NUMA nodes or other standard Linux mechanisms is supported as with any other special memory types, e.g., high-bandwidth memory.)

Version 3

  • Version 3.05.00 changed the default setting of -use_shm to true. Use -no-use_shm to disable shared-memory inter-rank communication.
  • Version 3.04.00 changed the terms "pack" and "pass" to "stage", which may affect user-written result parsers. Option auto_tune_each_pass changed to auto_tune_each_stage.
  • Version 3.01.00 moved the -trace and -msg_rank options from the kernel library to the kernel utility, so those options may no longer be set via yk_solution::apply_command_line_options(). APIs to set the corresponding options are now in yk_env. This allows configuring the debug output before a yk_solution is created.
  • Version 3.00.00 was a major release with a number of notices:
    • The old (v1 and v2) internal DSL that used undocumented types such as SolutionBase and GridValue and undocumented macros such as MAKE_GRID was replaced with an expanded version of the documented YASK compiler API. Canonical v2 DSL code should still work using the Soln.hpp backward-compatibility header file. To convert v2 DSL code to v3 format, use the ./utils/bin/convert_v2_stencil.pl utility. Conversion is recommended.
    • For both the compiler and kernel APIs, all uses of the term "grid" were changed to "var". (Historically, early versions of YASK allowed only variables whose elements were points on the domain grid, so the terms were essentially interchangeable. Later, variables became more flexible. They could be defined with a subset of the domain dimensions, include non-domain or "miscellaneous" indices, or even be simple scalar values, so the term "grid" to describe any variable became inaccurate. This change addresses that contradiction.) Again, backward-compatibility features in the API should maintain functionality of v2 DSL and kernel code.
    • The default strings used in the kernel library and filenames to identify the targeted architecture were changed from Intel CPU codenames to [approximate] instruction-set architecture (ISA) names "avx512", "avx2", "avx", "knl", "knc", or "intel64". The YASK targets used in the YASK compiler were updated to be consistent with this list.
    • The "mid" (roughly, median) performance results are now the first ones printed by the utils/bin/yask_log_to_csv.pl script.
    • In general, any old DSL and kernel code or user-written output-parsing scripts that use any undocumented files, data, or types may have to be updated.

Version 2

  • Version 2.22.00 changed the heuristic to determine vector-folding sizes when some sizes are specified. This did not affect the default folding sizes.
  • Version 2.21.02 simplified the example 3-D stencils (3axis, 3plane, etc.) to calculate simple averages like those in the MiniGhost benchmark. This reduced the number of floating-point operations but not the number of points read for each stencil.
  • Version 2.20.00 added checking of the step-dimension index value in the yk_grid::get_element() and similar APIs. Previously, invalid values silently "wrapped" around to valid values. Now, by default, the step index must be valid when reading, and the valid step indices are updated when writing. The old behavior of silent index wrapping may be restored via set_step_wrap(true). The default for all strict_indices API parameters is now true to catch more programming errors and increase consistency of behavior between "set" and "get" APIs. Also, the advanced share_storage() APIs have been replaced with fuse_grids().
  • Version 2.19.01 turned off multi-pass tuning by default. Enable with -auto_tune_each_pass.
  • Version 2.18.03 allowed the default radius to be stencil-specific and changed the names of example stencil "9axis" to "3axis_with_diags".
  • Version 2.18.00 added the ability to specify the global-domain size, and it will calculate the local-domain sizes from it. There is no longer a default local-domain size. Output changed terms "overall-problem" to "global-domain" and "rank-domain" to "local-domain".
  • Version 2.17.00 determined the host architecture in make and bin/yask.sh and number of MPI ranks in bin/yask.sh. This changed the old behavior of make defaulting to snb architecture and bin/yask.sh requiring -arch and -ranks. Those options are still available to override the host-based default.
  • Version 2.16.03 moved the position of the log-file name to the last column in the CSV output of utils/bin/yask_log_to_csv.pl.
  • Version 2.15.04 required a call to yc_grid::set_dynamic_step_alloc(true) to allow changing the allocation in the step (time) dimension at run-time for grid variables created at YASK compile-time.
  • Version 2.15.02 required all "misc" indices to be yask-compiler-time constants.
  • Version 2.14.05 changed the meaning of temporal sizes so that 0 means never do temporal blocking and 1 allows blocking within a single time-step for multi-pack solutions. The default setting is 0, which keeps the old behavior.
  • Version 2.13.06 changed the default behavior of the performance-test utility (yask.sh) to run trials for a given amount of time instead of a given number of steps. As of version 2.13.08, use the -trial_time option to specify the number of seconds to run. To force a specific number of trials as in previous versions, use the -trial_steps option.
  • Version 2.13.02 required some changes in perf statistics due to step (temporal) conditions. Both text output and yk_stats APIs affected.
  • Version 2.12.00 removed the long-deprecated == operator for asserting equality between a grid point and an equation. Use EQUALS instead.
  • Version 2.11.01 changed the plain-text format of some of the performance data in the test-utility output. Specifically, some leading spaces were added, SI multipliers for values < 1 were added, and the phrase "time in" no longer appears before each time breakdown. This may affect some user programs that parse the output to collect stats.
  • Version 2.10.00 changed the location of temporary files created during the build process. This will not affect most users, although you may need to manually remove old src/compiler/gen and src/kernel/gen directories.
  • Version 2.09.00 changed the location of stencils in the internal DSL from .hpp to .cpp files. See the notes in https://github.com/intel/yask/releases/tag/v2.09.00 if you have any new or modified code in src/stencils.

yask's People

Contributors

adurang avatar chuckyount avatar fabioluporini avatar herjmoo avatar jrt54 avatar rjtobin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yask's Issues

Tuning parameters for temporal blocking

I am trying to evaluate the performance of the 3axis stencil in your framework with different radius values on a Xeon Phi 7210F and the MCDRAM configured in the default cache mode flat mode. My compiler is ICC 2018.1. I am trying to improve the performance of this stencil using temporal blocking (-r and -rt switches), since the general consensus is that this stencil is memory-bound, especially at low-order, and could benefit from temporal blocking. However, I have not been successful to improve the performance using temporal blocking, unless when the input size is small (smaller than 128^3). Should I at all expect performance improvement for this stencil when large inputs are used (512^3 and above) using temporal blocking, or is the ~140 GFLOPs that can be achieved using the default settings, the maximum achievable performance?

Throw exceptions instead of exiting on bad API calls

An existing code sequence like
cerr << "Error: run_solution() called without calling prepare_solution() first.\n"; exit_yask(1);
Should throw an exception with the formatted string as a payload.
In the kernel code, most of the code to be replaced should be in this cerr...exit_yask(1) form.
In the compiler code, most of the code to be replaced should be in the form cerr...exit(1).
There could be different types of exceptions: parameter-out-of-range, invalid-state, etc., or initially we could just have one generic yask exception.
The program in kernel/yask_main.cpp should catch the exception and call something similar to the existing exit_yask(). Same with compiler/main.cpp.
The fix should include some tests in the C++ and perl API test programs.

Failed to add IF conditions to kernel

Hello,

It fails when I try adding an IF condition in the kernel, such as grid(t+1, x, y, z) IS_EQUIV_TO v IF (v == constNum(1.0)). However, it works about grid(t+1, x, y, z) IS_EQUIV_TO v IF (z == last_index(z)). I think both IF conditions return the same type. Do you have any ideas about this problem? Thank you so much.

Create strong-scaling mode

Automatically distribute size among ranks as a default.
For strong and weak-scaling, provide a heuristic to set rank topology.

How to initialize a 2D GRID

Hello,

It fails when I try using INIT_GRID_2D to design a kernel, but I find the INIT_GRID_2D method defined in src/foldBuilder/Expr.hpp. Is there any way to use INIT_GRID_2D? Thanks for your help.

Please verify that both the operating system and the processor support Intel(R) MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions.

Hello,

I got an error " Please verify that both the operating system and the processor support Intel(R) MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions." when I run YASK.

Following is the command:
mpirun-n 2 -ppn2 ./stencil-run.sh -arch hsw---nr 1 -nrx 2 -d 1024 -dx 512 -b 64 -bz 96

CPU : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
MIC : Intel Xeon Phi coprocessors 5110P

Thanks for your help.

Fix printing of cmd-line syntax errors when MPI enabled

Example:

  • make clean; make -j stencil=3axis arch=snb
  • bin/yask.sh -ranks 2 -stencil 3axis -arch snb -bad-option

Can only see the following:

YASK Kernel: .
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
YASK Kernel: .
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Rewrite yask_main.cpp to work with APIs only

Will require adding APIs, removing functionality, and/or moving code from kernel/lib to yask_main.cpp.
This is not critical, but it would showcase the API if we actually used it!

Change "any" to "many" in first paragraph of README

Sorry for the pettiness of this issue, but I believe that good descriptions are helpful: the first paragraph of the README reads: "YASK--Yet Another Stencil Kernel: A framework to facilitate exploration of the HPC stencil-performance design space, including any optimizations such as...". Should it not be, instead "including many optimizations" ?

I hope this is useful. Thank you for your wonderful work!

Boundary condition for 3axis

Can you kindly elaborate how the boundary condition for the 3axis stencil is handled? By boundary condition I mean the neighbors of cells on the grid boundaries that fall outside of the gird. From the WOLFHPC16 slides it seems that the allocated grid is (2 * radius) points bigger than the problem domain in each dimension, allowing such neighbors to be read from a valid memory address. Is this correct? If yes, am I correct to say that the input size that we set in YASK is the actual problem size, but the size of the grid that is allocated in memory is bigger due to these out-of-bound neighbors?

Large performance difference between running from yask.sh and executable

Hi,

I am evaluating your framework on a machine with a socketed Xeon Phi 7210F and the MCDRAM configured in the default cache mode. I compiled the 3axis stencil by using the following command:

make clean; make stencil=3axis arch=knl

If I run the stencil using yask.sh, I get the following results:

bin/yask.sh -stencil 3axis -arch knl -d 512 -t 1:

best-throughput (est-FLOPS):       138.865G

However, running the same stencil directly using the executable with the same settings results in significantly lower performance:

bin/yask_kernel.3axis.knl.exe -d 512 -t 1:

best-throughput (est-FLOPS):       31.4167G

Is there a specific reason for this? Should I always run from the sh file?

Add ability to define scratch-pad grids in the DSL.

Temps would be similar to grids, but data would not be available outside of computation.
Stencil compiler should determine required "halo" size based on accesses in stencil.
Actual size would be based on block-size plus these halos.
Would need a set of temps for each thread team. These would be reused across blocks.

make realclean

it's leaving some useless files under /lib, such as devito_ctx0_yk_hook0.py, devito_ctx0_yk_hook0.pyc, ...

How to add an input

Hello Charles,

I am very interested in this work and I am learning about this framework. I try adding a new stencil into this tool.
I have a question about how to decide the input data. I check the existed stencils, such as AveStencile and Iso3dfdStencil, but I still have no idea about the location of input data.

Thanks for your help.

share_storage is broken

location: src/kernel/lib/realv_grids.cpp:void RealVecGridBase::share_storage(yk_grid_ptr source)

in the implementation, get_num_dims() may return 4 if the time dimension is present, but then get_rank_domain_size(dname) is called, which will fail if dname == 't' (as t isn't a domain dimension)

Improve perf of arbitrary domain sizes

Performance is reduced by several factors when the domain size is not a multiple of the vector-cluster size in each dim. Need to add remainder loops to handle partial vectors.

Improve dependency calculations

Currently, the dependency-calculation code in the compiler is pretty primitive. It is especially poor between sub-domains.

MPI now needed

Seems mpi=1 is required even for single node builds.

icpc -V
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.0.098 Build 20160721

make arch=hsw stencil=iso3dfd
Header files generated.
icpc -g -O3 -std=c++11 -Wall -xCORE-AVX2 -debug extended -Fa -restrict -ansi-alias -fno-alias -fimf-precision=low -fast-transcendentals -no-prec-sqrt -no-prec-div -fp-model fast=2 -fno-protect-parens -rcd -ftz -fma -fimf-domain-exclusion=none -qopt-assume-safe-padding -qopt-report=5 -qopt-report-phase=VEC,PAR,OPENMP,IPO,LOOP -no-diag-message-catalog -DMAX_EXCH_DIST=1 -DUSE_INTRIN256 -DREAL_BYTES=4 -DLAYOUT_3D=Layout_123 -DLAYOUT_4D=Layout_2314 -DTIME_DIM_SIZE=2 -DDEF_RANK_SIZE=1024 -DDEF_BLOCK_SIZE=64 -DDEF_BLOCK_THREADS=2 -DDEF_THREAD_FACTOR=1 -DDEF_PAD=1 -DARCH_HSW -DNO_STORE_INTRINSICS -DUSE_STREAMING_STORE -fopenmp -c -o src/stencil_main.hsw.o src/stencil_main.cpp
icpc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icpc -g -O3 -std=c++11 -Wall -xCORE-AVX2 -debug extended -Fa -restrict -ansi-alias -fno-alias -fimf-precision=low -fast-transcendentals -no-prec-sqrt -no-prec-div -fp-model fast=2 -fno-protect-parens -rcd -ftz -fma -fimf-domain-exclusion=none -qopt-assume-safe-padding -qopt-report=5 -qopt-report-phase=VEC,PAR,OPENMP,IPO,LOOP -no-diag-message-catalog -DMAX_EXCH_DIST=1 -DUSE_INTRIN256 -DREAL_BYTES=4 -DLAYOUT_3D=Layout_123 -DLAYOUT_4D=Layout_2314 -DTIME_DIM_SIZE=2 -DDEF_RANK_SIZE=1024 -DDEF_BLOCK_SIZE=64 -DDEF_BLOCK_THREADS=2 -DDEF_THREAD_FACTOR=1 -DDEF_PAD=1 -DARCH_HSW -DNO_STORE_INTRINSICS -DUSE_STREAMING_STORE -fopenmp -c -o src/stencil_calc.hsw.o src/stencil_calc.cpp
icpc: remark #10397: optimization reports are generated in *.optrpt files in the output location
src/stencil_calc.cpp(701): error: identifier "MPI_INTEGER8" is undefined
MPI_Bcast(&coords[rn][0], num_dims, MPI_INTEGER8,
^

src/stencil_calc.cpp(701): error: identifier "MPI_Bcast" is undefined
MPI_Bcast(&coords[rn][0], num_dims, MPI_INTEGER8,
^

compilation aborted for src/stencil_calc.cpp (code 2)
make: *** [src/stencil_calc.hsw.o] Error 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.