Giter Site home page Giter Site logo

Parallel IO in CICE about access-om3 HOT 18 OPEN

cosima avatar cosima commented on September 6, 2024
Parallel IO in CICE

from access-om3.

Comments (18)

micaeljtoliveira avatar micaeljtoliveira commented on September 6, 2024 1

@anton-seaice I think the development of PIO support in the COSIMA fork of CICE5 and in CICE6 were done independently. So they might not provide exactly the same features. Still, very likely the existing PIO support in CICE6 is good enough for our needs, although that needs to be tested.

from access-om3.

micaeljtoliveira avatar micaeljtoliveira commented on September 6, 2024

CICE6 has the option to perform IO using parallelio. This is implemented here:

https://github.com/CICE-Consortium/CICE/tree/main/cicecore/cicedyn/infrastructure/io/io_pio2

My understanding is that, when using it, it replaces the serial IO entirely, which is probably why this is not obvious in ice_history.F90 .

Note that, currently, the default build option in OM3 is to use PIO (see here

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

Thanks Micael

Maybe I misunderstood the changes done to CICE5 and COSIMA/cice5@e9575cd is just about adding the chunking features and some other improvements? But the parrallel IO was already working?

@aekiss - Can you confirm?

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

Using the config from COSIMA/MOM6-CICE6#17 , ice.log gives these times:

Timer   1:     Total     173.07 seconds
Timer  13:   History      43.67 seconds

It's not clear to me if that is a problem (times are not mutually exclusive), and we might not know until we try the higher resolutions.

There are a couple of other issues though:

Monthly output in OM2 was ~17mb:

-rw-r-----+ 1 rmh561 ik11 7.6M May 11 2022 /g/data/ik11/outputs/access-om2/1deg_era5_ryf/output000/ice/OUTPUT/iceh.1900-01.nc

But the OM3 output is ~69mb
-rwxrwx--x 1 as2285 tm70 69M Nov 3 14:22 GMOM_JRA.cice.h.0001-01.nc

The history output is not chunked
And @dougie pointed out the history output is being written in "64-bit offset" which is a very dated way to write output which predates NetCDF-4

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

It looks like we need to set pio_typename = netcdf4p in nuopc.runconfig to turn this on (per med_io_mod )

But when I do this, i get this error in access-om3.err:

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Obtained 10 stack frames.
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(print_trace+0x29) [0x147f3a88eff9]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(piodie+0x42) [0x147f3a88d082]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(check_netcdf2+0x1b9) [0x147f3a88d019]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile_retry+0x855) [0x147f3a88d9f5]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile+0x16) [0x147f3a8887e6]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpiof.so(piolib_mod_mp_pio_openfile_+0x21f) [0x147f3a61dacf]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x4082508]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x408b56f]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x42544bd]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x40589e5]

The No data available is curious. I think its trying to open the restart file (which works fine if pio_typename = netcdf). This implies it could be missing dependencies - are we including both the HDF5 and PnetCDF libraries ? Where would I find out? (more importantly)

from access-om3.

micaeljtoliveira avatar micaeljtoliveira commented on September 6, 2024

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

from access-om3.

aekiss avatar aekiss commented on September 6, 2024

Possibly relevant:
COSIMA/access-om2#166

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

Thanks - this sounds ok. HDF5 is the one we want, and the ParrallelIO library should be backward compatible without pnetcdf.

I am still getting the "NetCDF: Error initializing for parallel access" error when reading files (although I can generate netcdf4 files ok). The error text comes from the Netcdf library but it looks like it could be an error from the HDF library. I can't see any error logs from the HDF5 library though? I wonder if building hdf in Build Mode: 'Debug' rather than release will generate error messages (or at least lines numbers in the stack trace)?

from access-om3.

access-hive-bot avatar access-hive-bot commented on September 6, 2024

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

I was way off on a tangent. The ParallelIO library doesn't like using a symlink to the initial conditions file, and this gives the get_stripe failed errror.

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

I raised an issue for the code changes needed for chunking and compression:
CICE-Consortium/CICE#914

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

For anyone reading later, Dale Roberts and OpenMPI both suggested setting the mpi io library to romio321 instead of ompio (the default).

(i.e. mpirun --mca io romio321 ./cice)

Which works and open files through the symlink, but there is a significant performance hit. Monthly runs (with some daily output) have history timers in the ice.log of approximately double (99 seconds vs 54 seconds, 48 cores, 12 pio tasks, pio_type=netcdf4p).

It looks like ompio was deliberately chosen in OM2, (see https://cosima.org.au/index.php/category/minutes/ and COSIMA/cice5#34 (comment)) but the details are pretty minimal. So it doesn't seem like a good fix.

There is an open issue with OpenMPI still: open-mpi/ompi#12141

from access-om3.

dsroberts avatar dsroberts commented on September 6, 2024

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here:
In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here: In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

Thanks Dale.

The other big caveat here is we only have a 1 degree resolution at this point, and in OM2, performance was worse with parallel IO (than without) at 1 degree but better at 0.25 degree. So it may get hard to really get into the details at this point.

Lustre stripe count is 1 (files are <100MB), but I couldn't figure out an easy way to check cb_nodes?

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Saying that, it looks like using 1 pio iotask (there are 48 PEs) and box re-arranger is fastest. With 1 pio task, box rearranger and ompio reported history time is ~12seconds (vs about 15 seconds with romio321).

(For reference: config tested )

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

OpenMpi will fix the bug, so plan of action is

from access-om3.

aekiss avatar aekiss commented on September 6, 2024

Could also be worth discussing with Rui Yang (NCI) - he has a lot of experience with parallel IO.

from access-om3.

aekiss avatar aekiss commented on September 6, 2024

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

from access-om3.

anton-seaice avatar anton-seaice commented on September 6, 2024

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Possibly - we will have to revisit when the chunking is working, although with neatly organised data (i.e. in 1 degree where blocks are adjacent) it might not matter. If we stick with the box rearranger, then 1 chunk per iotask is worth trying. Ofcourse we need to mindful of read patterns just as much as write speed though.

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

Using the box rearranger - this would send all data from one compute task to one IO task - but then the data blocks would be non-contiguous in the output and need multiple calls to the netcdf library. (Presumably set netcdf chunk size = block size)

Using the subset rearranger - the data from compute tasks would be spread among multiple IO tasks - but then the data blocks would be contiguous for each IO task and require only one call to the netcdf library. (Presumably set netcdf chunk size = 1 chunk per IO task)

Box would have more IO operations and subset would have more network operations. I don't know how they would balance out (and also would guess the results are different depending if the tasks are across multiple NUMA nodes / real nodes etc).

NB: The TWG minutes talk about this a lot. Suggestion is actually that one chunk per node will be best!

from access-om3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.