Giter Site home page Giter Site logo

Comments (26)

MichaelSchulzMETNO avatar MichaelSchulzMETNO commented on September 27, 2024

@j34ni
This should be certainly be tracked down. I wonder

  • is it a CAMS or NorESM problem. Does it also happen in AMIP configuration?
  • Can it be circumvented by manipulating the emissions without adding a file, eg by putting the SO2 emissions in one of the existing emission files to zero. Or add the Pinatubo emission in month 6 to SO2 fluxes in an existing file.
  • whether this happens on vilje and fram.
  • if the output is bit unidentical from the first month?

from noresm.

j34ni avatar j34ni commented on September 27, 2024
  • Is it a CAM or NorESM problem: this is hard to say. I only made short runs and the problem was easy to spot after only a few months because there were already significant differences for instance in the sea ice (which do not occur with prescribed SSTs). I have asked Ada about what happened for CMIP and how they diagnosed the issues at the time.

  • Can it be circumvented: probably, but would it not be more sensible to find a more permanent solution since this bug is likely to also have other consequences not yet understood.

  • Only tried on Fram since I did not have CPU time on Vilje.

  • Not bit identical from the first month, yes, and some differences already clearly visible (sea ice fraction and probably other variables also).

from noresm.

DirkOlivie avatar DirkOlivie commented on September 27, 2024

Here some comments :

  • The problems we encountered with emission files (such a s (1) crash or (2) start of non-being bit identical) often started in the middle of the month. The atm.log file is a place where one can follow the state of the model every time step (and one can see when two simulations start to diverge).
  • Not all combinations of compsets and machines have been tested. However a few results are :
    (1) The problem only appeared on fram, not on vilje.
    (2) On Fram, it happened for the fully-coupled compsets when using 30 nodes (+/- standard setup).
    (3) On Fram, it happened for the fixed-SST compsets when using 32 nodes, not when using 16 nodes.
  • With the "frc2"-type compsets (which use less emission files), we avoided crashes and the problem of being not bit identical. Maybe it is an option to do the Pinatubo tests with the N1850frc2 compset.

from noresm.

MichaelSchulzMETNO avatar MichaelSchulzMETNO commented on September 27, 2024

Suggestion from Thomas email: @tto061 @j34ni @DirkOlivie

  1. run a parallel test with prescribed SSTs and sea-ice (e.g.
    NF2000climo compset)
  2. run a parallel test with CESM cam (e.g. F2000climo compset,
    assuming you can adjust your input to suit MAM -- if not, please
    ignore)
  3. run a parallel test without land component (QP compset; you'd
    need to reset all your inputs manually; I can probably help you
    with that).

from noresm.

j34ni avatar j34ni commented on September 27, 2024

I ran a NF2000climo compset with and without additional zeros emission files and the results are different also!

from noresm.

tto061 avatar tto061 commented on September 27, 2024

OK thanks Jean. So we've ruled out sea-ice. Do you think you can try test #2? also could you share your NorESM case directories and point to your NorESM root directory for these tests on fram?

from noresm.

j34ni avatar j34ni commented on September 27, 2024

I have not done this particular test on Fram but on a Virtual Machine, with the same run-time environment (same compiler version, same libraries, etc.), without batch system or queuing time (and also with less computational resources).

Let me know if you want to look at particular files and I will put them somewhere accessible to you.

from noresm.

j34ni avatar j34ni commented on September 27, 2024

As for CESM and the F2000climo compset, I ran it several times in similar conditions (f19 res) and it never crashed. Also it does not give different results when adding other emission files with zeros.

from noresm.

j34ni avatar j34ni commented on September 27, 2024

I forgot to mention that I did all the CESM tests with the latest release (cesm2.1.3), is it worth trying older versions or should we focus on NorESM?

from noresm.

MichaelSchulzMETNO avatar MichaelSchulzMETNO commented on September 27, 2024

I believe, we should just test the newest NorESM-CAM6-Nor without "coupling" to other components. My suspicion it is related to the emissions read in CAM in combination with some other feature of the aerosol or CAM-Nor code.

A test could be to see if NF2000climo, CAM6-NOR with MAM4 aerosol can be run. @DirkOlivie is that possible? Would be interesting anyway.

from noresm.

j34ni avatar j34ni commented on September 27, 2024

It seems to me that there are several problems which may or may not be related: i) intermittent NorESM crashes (occurrence of NaNs & INFs), ii) non bit-for-bit reproducibility, and iii) issues when reading the emission files

from noresm.

DirkOlivie avatar DirkOlivie commented on September 27, 2024

The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition).
Have the frc2 compsets been tested in this context? Has a test been done without any emissions?

from noresm.

oyvindseland avatar oyvindseland commented on September 27, 2024

If no-one else does I can check if adding zero when reading in existing files matter, i.e. before the numbers are scattered to the chunks

from noresm.

tto061 avatar tto061 commented on September 27, 2024

from noresm.

oyvindseland avatar oyvindseland commented on September 27, 2024

Hi

I can elaborate a bit more on my comment above. The check that can be done is to add an extra input sector at the point where the input files are read in but instead of reading in a zero file just define the input array to be zero. The purpose would be to check if it is the read-in process itself that causes the problem or whether it is when defining new sectors.
If the results are different still the addition of zero can be done further down in the physics structure.

from noresm.

monsieuralok avatar monsieuralok commented on September 27, 2024

update from 22/10/2019
When I was executing compset NFPTAERO60 with grid f19_f19_mg17, I was getting some strange value of field names which is getting crash atleast when I compile with MPI+OpenMP.

I printed the following block from file ndrop.F90 around line 2172

#ifdef OSLO_AERO
tendencyCounted(:)=.FALSE.
do m = 1, ntot_amode
do l = 1, nspec_amode(m)
mm = mam_idx(m,l)
lptr = getTracerIndex(m,l,.false.)
if(.NOT. tendencyCounted(lptr))then
print*,mm,fieldname(mm),'ndrop'
call outfld(fieldname(mm), coltend(:,lptr), pcols,lchnk)
call outfld(fieldname_cw(mm), coltend_cw(:,lptr), pcols,lchnk)
tendencyCounted(lptr)=.TRUE.
endif
end do
end do
#endif

I get :

        8 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop
      12 BC_AI_mixnuc1           ndrop
      13 OM_AI_mixnuc1           ndrop
      15 SO4_A2_mixnuc1          ndrop
      18 SO4_PR_mixnuc1          ndrop
      19 BC_AC_mixnuc1           ndrop
      20 OM_AC_mixnuc1           ndrop
      22 SO4_AC_mixnuc1          ndrop
      26 DST_A2_mixnuc1          ndrop
      34 DST_A3_mixnuc1          ndrop
      35 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop

When, I printed fieldname(mm) for mm=8,14,35 @ line 263 in ndrop.F90; I guess it is never assigned any value or initialized.

Second, it could be that it should not loop over these numbers. Please could you check and update.

from noresm.

tto061 avatar tto061 commented on September 27, 2024

Further adding to the picture: as far as I can tell, none of my integrations on tetralith, including the NFHIST cases for CMIP6, are reproducible bfb with default compiler options (i.e. -O2 doe fortran), either from existing restarts or from default i.c. I have no reproducibility test with -O0 option.

from noresm.

j34ni avatar j34ni commented on September 27, 2024

I am investigating the bug with different tools (like the Intel Inspector) for memory and thread checking and debugging.
I think I am getting there.
That now seems to work on a virtual machine.

from noresm.

DirkOlivie avatar DirkOlivie commented on September 27, 2024

@j34ni A temporary solution might be to use one 3D SO2--emission file, which will contain the standard 3D emissions + Pinatubo explosion emission. Would you like me to create such a file?

from noresm.

j34ni avatar j34ni commented on September 27, 2024

@DirkOlivie We can give it a go

from noresm.

j34ni avatar j34ni commented on September 27, 2024

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061 I eventually got NorESM working in the Conda environment (with a GNU compiler) and did not manage to make it to crash yet!

There may be something very wrong with the Intel 2018 compiler, as was already the case when I was running the Variable Resolution CESM (for which I ended up using Intel 2019)

from noresm.

MichaelSchulzMETNO avatar MichaelSchulzMETNO commented on September 27, 2024

@j34ni Did you / could you explain how one can run NorESM in a conda enviroment? Is that in the NorESM2 documentation already? ( I mean thats really interesting to have !!)

from noresm.

oyvindseland avatar oyvindseland commented on September 27, 2024

@j34ni Really great news that you can run NorESM in a Conda environment. It is going to be interesting to see scaling results.

from noresm.

j34ni avatar j34ni commented on September 27, 2024

@MichaelSchulzMETNO At the moment that has not been much documented, it is still work in progress building on the "conda cesm" recipe. That was mainly used for teaching purposes (to learn how to run an ESM on Galaxy), and for development (without having to wait in a queue). However a proper "conda noresm" will be made available soon, that will allow a simple installation and contain what's needed to run the model (including configuration files, the Microsoft Kernel Library instead of Blas/Lapack, etc.) on generic platform first, and after that on an HPC.

from noresm.

j34ni avatar j34ni commented on September 27, 2024

@oyvindseland Yes, we will have to evaluate the scalability on an HPC (so far that only used small configurations on virtual machines with a single node), Betzy comes at the perfect time...

from noresm.

j34ni avatar j34ni commented on September 27, 2024

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061
Some of the problems occur at the very beginning of a run: initialization issues (obviously) but also non-BFB reproducibility and even crashes due to NaNs or INFs.

To test that quickly:

  • create a new case (for example original_N1850_f19_tn14) and run the simulation for 1 day;
  • create a 1st branch from the original case (needs a copy of the restart files & rpointers from the original case in the run dir) and continue the run for a single time step;
  • create a 2nd branch from the original case, add a couple of zero_emission files (copy also the restarts in the run dir) and run it for 1 time step;
  • continue the original simulation for 1 time step (CONTINUE_RUN=TRUE);
  • compare the original case with the 1st and 2nd branches.

I did that many times with CESM and the 3 simulations provide identical results, systematically.

So far with NorESM that only worked with the GNU (for instance 9.3.0) and Intel (2019.5) compilers, not with Intel 2018, whether it makes use of Alok's SourceMods or not.

That is not meant to replace a long run, but it is much faster to evaluate the effect of various fixes: if the 3 simulations do not provide identical results after one time step, there is no need to waste more resources. However if they do provide identical results, the simulation can always fail later.

from noresm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.