geoschem / gchp_legacy Goto Github PK

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.

Home Page: http://wiki.geos-chem.org/GEOS-Chem_HP

License: Other

Fortran 55.00% Makefile 1.41% TeX 3.61% C 18.21% C++ 13.91% Shell 0.11% Perl 0.92% Assembly 0.01% XSLT 0.01% HTML 0.80% Python 1.14% Pawn 0.18% Awk 0.01% CMake 0.04% PostScript 3.78% Jupyter Notebook 0.86% Pascal 0.01% NASL 0.01%

atmospheric-chemistry atmospheric-composition atmospheric-modelling aws cloud-computing geos-chem scientific-computing

gchp_legacy's People

Contributors

Stargazers

Watchers

Forkers

jimmielin liambindle sdeastham lizziel ruijundang terriblenews jiaweizhuang vondergathen hongjianweng fritzt yz3259 joeylamcy senyu30

gchp_legacy's Issues

[BUG/ISSUE] Duration in HISTORY.rc ignored in dev/12.5.0; files written at duration = frequency

I am having trouble getting the output I would like with the frequency and duration settings in HISTORY.rc (set through runConfig.sh). I interpret the wiki documentation as saying that files are supposed to be written out every [duration], with data for every [frequency], so each file would have a time dimension of [duration]/[frequency]. Instead, what I'm finding with dev branch 12.5.0, is that it writes files at [frequency] and seems to ignore duration.

[BUG/ISSUE] Suggested fix for C-preprocessor error when using ifort and gcc5+

(I am using my Intel student license on AWS; this allows me to skip #15 for now and test high-resolution multi-node runs as quickly as possible.)

Problem

I got this compile error when using ifort18 + gcc5.4.0 on Ubuntu 16.04:

<stdin>:2046:22: error: C++ style comments are not allowed in ISO C90
<stdin>:2046:22: error: (this will be reported only once per input file)
/home/ubuntu/tutorial/gchp_standard/CodeDir/GCHP/Shared/Config/ESMA_base.mk:367: recipe for target 'ESMFL_Mod.o' failed
gmake[8]: *** [ESMFL_Mod.o] Error 1
gmake[8]: *** Waiting for unfinished jobs....
<stdin>:4653:54: error: C++ style comments are not allowed in ISO C90
<stdin>:4653:54: error: (this will be reported only once per input file)
/home/ubuntu/tutorial/gchp_standard/CodeDir/GCHP/Shared/Config/ESMA_base.mk:367: recipe for target 'MAPL_IO.o' failed
gmake[8]: *** [MAPL_IO.o] Error 1
gmake[8]: Leaving directory '/home/ubuntu/tutorial/Code.GCHP/GCHP/Shared/MAPL_Base'
GNUmakefile:65: recipe for target 'install' failed
gmake[7]: *** [install] Error 2
gmake[7]: Leaving directory '/home/ubuntu/tutorial/Code.GCHP/GCHP/Shared/MAPL_Base'

Full log: compile_ifort_icc_bug.log

Suggested fix

fe5e863 fixes the same issue for gfortran, but the intel block remains unchanged:

https://github.com/geoschem/gchp/blob/7a4589c276876b6674800f4e4137b575e4def4f5/Shared/Config/ESMA_base.mk#L301-L313

I made the same change for intel, and then the compile error was gone:

 ifeq ("$(ESMF_COMPILER)","intel")
   FREAL8      = -r8
   FREE        =
-  CPPANSIX    = -ansi -DANSI_CPP
+  CPPANSIX    = -std=gnu11 -nostdinc -C

Note that it doesn't matter whether icc or gcc is set as $CC. The preprocessor is hard-coded as gcc:

https://github.com/geoschem/gchp/blob/7a4589c276876b6674800f4e4137b575e4def4f5/ESMF/build/common.mk#L611

After changing CPPANSIX, GCHP can compile correctly with either CC=icc or CC=gcc (both failed originally):

On Odyssey, the default gcc version is 4.8.5, so there is no such problem. I expect that gcc5+ will all have this issue.

Additional info

I found an old email from Daniel Rothenberg, which explained this issue well:

MAPL_Base was really tough to compile because of
MAPL_IO.P90 and
ESMFL_Mod.P90. The issue is the intermediate step of using cpp to pre-process them before Fortran compilation; a flag “-ansi” is hard-coded into this step in the
Config/ESMA_base.mk makefile, which my local gcc (v6.3.1) really didn’t like. It kept failing to process the P90 files, throwing an error

C++ style comments are not allowed in ISO c90
but disabling these warnings didn’t work. The issue is that the P90 files contain Fortran string concatenation operators (“//“) which cpp can’t handle elegantly. Totally disabling comment removal with the “-C” flag worked, but produced license comments that ifort then choked on.

The only solution I came up with was to replace the pre-processing rule and instead force it to use Intel’s fpp, but for this to work correctly I had to hard-code it into the rule.

[BUG/ISSUE] Large memory requirement at compile time prevents automated build

I am able to build GCHP Docker image on a large EC2 instance (>10 GB RAM), but fail to do so with automated build on Docker Hub because of the 2 GB RAM restrictions on Docker Hub

Here's the full build log:
https://hub.docker.com/r/zhuangjw/gchp_model/builds/b4bvaupogcmwvy5dcc9nzdw/

Any idea why GCHP needs so large memory at compile time?

The workaround is to build Docker images locally (e.g. on AWS) and uploaded to Docker Hub.

Alternatively I can try building Docker images on TravisCI. Travis has 7.5 GB RAM and should probably work.

[BUG/ISSUE] CMake's Fortran dependency scanner's preprocessing limitation

Hi everyone,

Sorry for the confusing name...I'm not really know how to title this because this is a bit obscure.

After yesterdays update to feature/MAPL_v1.0.0 (compatibility with dev/12.4.0), I tried running GCHP's CMake build (which is based on feature/MAPL_v1.0.0) and I got an error I hadn't seen before. I've tracked down the problem—it's a limitation of CMake—and the solution is easy, but I think this is something the GCHP devs should know about.

tl;dr: When CMake generates Makefiles, a limitation of its dependency scanner is that it can't handle #if defined( xxxx ) statements in module declarations. Could we change the #if defined( MODEL_GEOS ) statement in Chem_GridCompMod.F90 to #ifdef MODEL_GEOS, and recommend developers only use #ifdef statements in module declarations in the future?

@lizziel: this relates to the preprocessor conditionals that facilitate GEOS and GEOS-Chem shared files (specifically, 122faa1)

The problem (explained)

When I tried to build GCHP I got the following error

Scanning dependencies of target GIGC 
[ 98%] Building Fortran object GCHP/CMakeFiles/GIGC.dir/GEOS_ctmEnvGridComp.F90.o 
[ 98%] Building Fortran object GCHP/CMakeFiles/GIGC.dir/gigc_historyexports_mod.F90.o 
[ 98%] Building Fortran object GCHP/CMakeFiles/GIGC.dir/gigc_providerservices_mod.F90.o
[ 98%] Building Fortran object GCHP/CMakeFiles/GIGC.dir/gigc_chunk_mod.F90.o 
[ 98%] Building Fortran object GCHP/CMakeFiles/GIGC.dir/Chem_GridCompMod.F90.o 
Error copying Fortran module "mod/geoschemchem_gridcompmod".  Tried "mod/GEOSCHEMCHEM_GRIDCOMPMOD.mod" and "mod/geoschemchem_gridcompmod.mod". 
GCHP/CMakeFiles/GIGC.dir/depend.make:58: recipe for target 'GCHP/CMakeFiles/GIGC.dir/Chem_GridCompMod.F90.o.provides.build' failed 
make[3]: *** [GCHP/CMakeFiles/GIGC.dir/Chem_GridCompMod.F90.o.provides.build] Error 1
GCHP/CMakeFiles/GIGC.dir/build.make:78: recipe for target 'GCHP/CMakeFiles/GIGC.dir/Chem_GridCompMod.F90.o.provides' failed
make[2]: *** [GCHP/CMakeFiles/GIGC.dir/Chem_GridCompMod.F90.o.provides] Error 2
CMakeFiles/Makefile2:121: recipe for target 'GCHP/CMakeFiles/GIGC.dir/all' failed
make[1]: *** [GCHP/CMakeFiles/GIGC.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

I got this for CMake version 3.5 (GEOS-Chem's minimum requirement) and version 3.15 (latest release).

This error looked weird because Chem_GridCompMod.F90 declares the GEOSCHEMchem_GridCompMod module iff MODEL_GEOS is defined which it isn't (I checked the verbose compile log).

https://github.com/geoschem/gchp/blob/71543ab21f7420f013fa62a4fa40108e5201e976/Chem_GridCompMod.F90#L44-L48

So, I dug around online and found a number of issues related to how CMake's Makefile generator engine does preprocessing for dependency scanning for Fortran files. IIUC when CMake is generating Makefiles, it does an "approximate" preprocessing [1,2] before scanning USE statements and MODULE declarations to create a dependency tree.

It looks like the problem is that this "approximate preprocessor" can't handle ()-enclosed conditions [1], and it looks like this is how Chem_GridCompMod.F90 is shared between GEOS and GEOS-Chem (see 122faa1).

What are supported by the "approximate preprocessor" though (and tested in CMake's unit tests) are #ifdef statements.

Note: #ifdef xxxx and #if defined( xxxx ) are equivalent as long as there is only one definition.

References:

[1] https://gitlab.kitware.com/cmake/cmake/issues/17398 (very similar)
[2] https://cmake.org/pipermail/cmake-developers/2015-July/025558.html

A solution

Unfortunately, because this is a limitation of CMake, it seems like a small change to Chem_GridCompMod.F90 is the easiest solution.

This change would be replacing the #if defined( MODEL_GEOS) statement in the module declaration in Chem_GridCompMod.F90 with #ifdef MODEL_GEOS , and in the future, recommending that developers only use #ifdef conditionals in module declarations.

In Chem_GridCompMod.F90:

-   44  #if defined( MODEL_GEOS )
+   44  #ifdef MODEL_GEOS
    45  MODULE GEOSCHEMchem_GridCompMod
    46  #else
    47  MODULE Chem_GridCompMod
    48  #endif

- 7604  #if defined( MODEL_GEOS )
+ 7604  #ifdef MODEL_GEOS
  7605  END MODULE GEOSCHEMchem_GridCompMod
  7606  #else
  7607  END MODULE Chem_GridCompMod
  7608  #endif

This fixes building GCHP with CMake. I believe #ifdef xxxx and if defined( xxxx ) are equivalent as long as there's only one definition that's being checked.

Alternatives (if changing the code isn't an option)

Because this is shared code, I realize there could be reasons that changing the code isn't possible. If that's the case, here are a two alternative alternatives

Use a different build system. The Ninja build system does full preprocessing before generating the dependency tree. GCHP builds fine with Ninja (rather than Make) without any modifications (I tried building with Ninja this morning).
I could probably come up with a CMake-based workaround using regex's but I'm very hesitant about this.

[BUG/ISSUE] There doesn't seem to be a way to turn on traditional debug printing

The GCHP wiki says "Change "ND70" in input.geos from 0 to 1 to turn on extra GEOS-Chem print statements in the main log file. " but I tried adding a Diagnostics menu to my input.geos and it just gets ignored. Is there a way to get the LPRT flag set to true without recompiling?

[BUG/ISSUE] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run

I tried running a GCHP C48 run on Odyssey but the job hung right after printing out the GIGCenv timer results.

AGCM Date: 2016/07/01  Time: 01:00:00
 
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:      OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
 Writing:     72 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
 Writing:     72 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4

  Times for GIGCenv
TOTAL                   :       1.069
INITIALIZE              :       0.000
RUN                     :       0.418
etc.

HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc

The script I used to submit the job is:
gchp.run.txt

And here is the full log:
gchp.log.txt

A similar run (done by @lizziel) with Ifort 17.0.4 instead of gfortran 8.2 finished OK. Am wondering if the Gfortran compiler is not totally compatible with MAPL (or at least it seems to produce issues that we don't see when using ifort).

[FEATURE REQUEST] Allow createRunDir.sh to run silently (no interactive questions)

Problem

createRunDir.sh uses interactive questions. Although this is instructive for first-time users, it makes the installation process hard to automate. Automation is crucial in many scenarios: continuous integration (#36), creating new cloud images and containers, allowing Spack installation and Conda installation (geoschem/geos-chem#47). Spack has a thread on the difficulty of handling interactive questions: spack/spack#7983. In general, removing all interactive parts in the installation process will make any kinds of automatic deployments much easier.

So far I've been using a hacky script to automate the process of building GCHP. To handle the questions from createRunDir.sh, I simply pipe the answers to it:

rm $HOME/.geoschem/config
printf "$HOME/ExtData \n 2 \n 1 \n $HOME/tutorial \n gchp_standard \n n" | ./createRunDir.sh

Each item corresponds to: input data path, simulation type (2 = standard), metfield type (1 = GEOSFP), rundir parent path, rundir name, whether to version-control rundir.

There are multiple problems with this:

The printf body is difficult to understand, as it only shows the answers, not the questions.
It is difficult to track the changes in the questions. For example 8a64556 adds the path to gFTL, so the new answer string should be changed to "$HOME/ExtData \n $HOME/gFTL/install ..."
The config file only answers part of the questions (file paths, but not simulation types), and the existence of this file will change the total number of questions asked.

Suggestions

Ideally, the script will have strictly two modes: interactive and silent, and nothing in between (no partial answers, to avoid incomplete & stale config files)

The config file should describe all the questions including simulation types, etc. Then, the rundir creation is simply:

./createRunDir.sh --silent=config

The config file may be created manually, or generated by running the script interactively.

As a reference, Intel compilers also have an interactive installation process by default, but also supports Silent Installation.

[BUG/ISSUE] GCHP finishes successfully but drops a core file at the end of the run

GCHP 12.1.1 runs normally and prints out all timing information at the end, but nevertheless drops a core file at the end of the run. This might be particular to the libraries on our Odyssey cluster.

We are using

CentOS Linux release 7.4.1708 (Core)
gfortran/gcc 8.2.0
openmpi 3.1.1
netCDF 4.1.3
slurm 17.11.12

Not a huge deal but if anyone has seen the same issue with similar libraries, then please let us know. Have a hunch this was caused by a local update to SLURM.

[BUG/ISSUE] Mass not conserved when FV advection is off

In GEOS-Chem Classic the species concentration is scaled by the ratio of current and previous delta dry pressures across the box each timestep to conserve mass. In GCHP the scaling is applied by the finite volume cube-sphere dynamical core (FV3) and therefore is not necessary. However, if advection is turned off, the scaling is required.

There used to be functionality that the scaling was applied if (1) not the first timestep and (2) advection was off. The first timestep was skipped because the previous delta dry air pressure (from the last timestep) was not available to use in the scaling. This worked fine for single runs but caused discrepancies between multiple consecutive runs versus a single long run over the same time period. For this reason we turned off the scaling temporarily.

A better solution is to turn on the scaling if (1) advection is off, it is the first timestep, and the previous dry air pressure is in the restart file, or (2) it is not the first timestep and advection is off. This solution requires saving out the delta dry air pressures to the restart file, done by adding it to the MAPL internal state.

[BUG/ISSUE] Potentially incorrect warning about species initial values

I notice this warning message even if the restart file initial_GEOSChem_rst.c24_standard.nc exists:

 Initialized species from INTERNAL state: NO
    WARNING: using background values from species database
 Initialized species from INTERNAL state: O3
    WARNING: using background values from species database
 Initialized species from INTERNAL state: PAN
    WARNING: using background values from species database
 Initialized species from INTERNAL state: CO
    WARNING: using background values from species database

Is this a wrong warning? From the output files it seems like correct initial conditions are used.

See, for example, the log file ubuntu_timing_1day.log in #6

[BUG/ISSUE] Olson landmap data is being read in as all zeroes for GCHP simulations with gfortran (but not ifort)

Long story short, the Olson land map data seems to be coming in as all zeroes from the Import State for GCHP simulations that use gfortran. This issue is most certainly the root cause of previously-mentioned issues #13 and #14.

My GC-classic and GCHP code were at the following commits:

GEOS-Chem repository
   Path        : /local/ryantosca/GCHP/gf82/Code.12.1.1
   Branch      : bugfix/GCHP_issues
   Last commit : Add a better error trap in Compute_Olson_Landmap_GCHP
   Date        : Fri Dec 21 12:03:57 2018 -0500
   User        : Bob Yantosca
   Hash        : 7db80f35
   Git status  : 
GCHP repository
   Path        : /local/ryantosca/GCHP/gf82/Code.12.1.1/GCHP
   Branch      : bugfix/GCHP_issues
   Last commit : Now exit if Compute_Olson_Landmap_GCHP returns with failure
   Date        : Fri Dec 21 12:07:09 2018 -0500
   User        : Bob Yantosca
   Hash        : 7bd7110
   Git status  : M Chem_GridCompMod.F90

Note that I modified the code to put in an error trap that will exit if all elements of State_Met%LandTypeFrac are zero (or more precisely, when the variable maxFracInd is zero).

I have narrowed down the issue to this code section of GCHP/Chem_GridCompMod.F90, where the Olson land map data is obtained from the Import State and copied into State_Met%LandFracType.

       IF ( FIRST ) THEN
       
          ! Set Olson fractional land type from import (ewl)
          If (am_I_Root) Write(6,'(a)') 'Initializing land type ' // &
                           'fractions from Olson imports'
          Ptr2d => NULL()
          DO T = 1, NSURFTYPE
       
             ! Create two-char string for land type
             landTypeInt = T-1
             IF ( landTypeInt < 10 ) THEN
                WRITE ( landTypeStr, "(A1,I1)" ) '0', landTypeInt
             ELSE
                WRITE ( landTypeStr, "(I2)" ) landTypeInt  
             ENDIF
             importName = 'OLSON' // TRIM(landTypeStr)
       
             ! Get pointer and set populate State_Met variable
             CALL MAPL_GetPointer ( IMPORT, Ptr2D, TRIM(importName),  &
                                    notFoundOK=.TRUE., __RC__ )
             If ( Associated(Ptr2D) ) Then
                If (am_I_Root) Write(6,*)                                &
                     ' ### Reading ' // TRIM(importName) // ' from imports'
                State_Met%LandTypeFrac(:,:,T) = Ptr2D(:,:)
             ELSE
                WRITE(6,*) TRIM(importName) // ' pointer is not associated'
             ENDIF
       
             ! Nullify pointer
             Ptr2D => NULL()
          ENDDO
          
          ! Compute State_Met variables IREG, ILAND, IUSE, and FRCLND
          CALL Compute_Olson_Landmap_GCHP( am_I_Root, State_Met, RC=STATUS )
          VERIFY_(STATUS)    ! <--- I added this error trap
       ENDIF

Also not shown above are some debug print statements.

I compiled and ran a C24 GCHP Rn-Pb-Be simulation with gfortran 8.2, using 6 cores of Odyssey. The modules were:

Currently Loaded Modules:
  1) git/2.17.0-fasrc01   5) gcc/8.2.0-fasrc01       9) hdf5/1.8.12-fasrc12
  2  gmp/6.1.2-fasrc01    6) openmpi/3.1.1-fasrc01  10) netcdf/4.1.3-fasrc02
  3) mpfr/3.1.5-fasrc01   7) zlib/1.2.8-fasrc07
  4) mpc/1.0.3-fasrc06    8) szip/2.1-fasrc02

And I got the following output in the gchp.log file:

  ### Reading OLSON01 from imports
 %%%LTF:            0           5          25   0.0000000000000000        0.0000000000000000     
  ### Reading OLSON02 from imports
  ### Reading OLSON03 from imports
 %%%LTF:            0           5          25   0.0000000000000000        0.0000000000000000     
 ... etc...
  ### Reading OLSON72 from imports
 %%%LTF:            0           5          25   0.0000000000000000        0.0000000000000000     
===============================================================================
GEOS-Chem ERROR: Invalid value of maxFracInd: 0!  This can indicate a problem 
reading Olson data from the Import State, and that State_Met%LandTypeFrac 
array has all zeroes.
 -> at Compute_Olson_Landmap_GCHP (in module GeosCore/olson_landmap_mod.F90)
===============================================================================

GIGCchem::Run_                                1863
GIGCchem::Run2                                1281
GCHP::Run                                      420
MAPL_Cap                                       792
===> Run ended at Fri Dec 21 12:25:57 EST 2018

The error message is from the new error trap that I committed to the bugfix/GCHP_issues branches of the gchp and geos-chem repos on Github. The numbers in each line beginning with "%%%LTF" indicate the core number, I & J value, and the sum of State_Met%LandTypeFrac for that I,J and Olson type. As you can see all of the Olson values are coming into State_Met%LandTypeFrac as zeroes.

It seems that the issue is happening somewhere in MAPL, as reading in the Olson data uses a new feature of MAPL to return the fraction of the grid box that is covered by land type N, where N is an integer.

In MAPL_ExtDataGridCompMod.F90, then there is this snippet, where there are calls down to MAPL_CFIORead

     if (present(field)) then
        if (trans == MAPL_HorzTransOrderBilinear) then
           call MAPL_CFIORead(name, file, time, field, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                __RC__) 
        else if (trans == MAPL_HorzTransOrderBinning) then
           call MAPL_CFIORead(name, file, time, field, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                Conservative = .true., __RC__) 
        else if (trans == MAPL_HorzTransOrderSample) then
           call MAPL_CFIORead(name, file, time, field, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                Conservative = .true., Voting = .true., __RC__) 
        else if (trans == MAPL_HorzTransOrderFraction) then
           call MAPL_CFIORead(name, file, time, field, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                Conservative = .true., getFrac = val, __RC__) 
        end if

     else if (present(bundle)) then

        if (trans == MAPL_HorzTransOrderBilinear) then
           call MAPL_CFIORead(file, time, bundle, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                __RC__)
        else if (trans == MAPL_HorzTransOrderBinning) then
           call MAPL_CFIORead(file, time, bundle, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                Conservative = .true., __RC__)
        else if (trans == MAPL_HorzTransOrderSample) then
           call MAPL_CFIORead(file, time, bundle, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                Conservative = .true., Voting = .true., __RC__)
        else if (trans == MAPL_HorzTransOrderFraction) then
           call MAPL_CFIORead(file, time, bundle, &
                time_is_cyclic=.false., time_interp=.false., ignoreCase = ignoreCase, &
                Conservative = .true., getFrac = val, __RC__)
        end if

     end if

The relevant calls are the ones where trans == MAPL_HorzTransOrderFraction. In MAPL_CFIO, there are further calls to MAPL_HorzTransformRun, which is where I suspect the error may be happening. MAPL_HorzTransformRun is an overloaded interface for several other module procedures

Unfortunately, at this point my knowledge of the innards of MAPL is not very comprehensive. If anyone has any other suggestions to try, then please let me know. My guess is that deep into MAPL there is some code that gfortran isn't parsing properly, or for which an unexpected side-effect is occurring.

NOTE: This could potentially be caused by the ESMF version which is 5.2. ESMF 5.2 pre-dates the newest versions of gfortran, so there could conceivably be some incompatibility. But who knows.

THE UPSHOT: Until we find & fix this issue, we should not use gfortran for GCHP simulations. While GCHP can run on the AWS cloud in tutorial mode, the error is still present and you will get erroneous output.

I verifiied that compiling and running GCHP using ifort 17 correctly read the Olson land map values from the Import State into State_Met%LandFracType. So this issue only happens with GNU Fortran.

Also, I will mark #13 and #14 as closed, as this issue is the root cause.

[BUG/ISSUE] Fullchem run failure in 12.7.0+ at c180+ due to reduced timesteps

This issue is the same as geoschem/geos-chem#219 for the GEOS-Chem repository. I traced that HEMCO issue to a GCHP issue that has gone unnoticed since versions prior to 12.7.0. An update that went into 12.7.0 brought the problem to light since it resulted in run crash.

The problem is that logical isChemTime is false in the warm GEOS-Chem restart phase at the beginning of the run if the default timesteps are reduced. This now happens automatically in GCHP starting at c180. isChemTime is used to determine what GEOS-Chem components are turned on for the run per timestep. If it is false then several components do not run, including emissions, and this causes a new update in 12.7.0 that uses emissions year from the HcoState%Clock object to fail.

This can be diagnosed by looking at the log for a c24 run, comparing a run with default timesteps with one with reduced timesteps. For c24 with default timesteps (20 min chem, 10 min dyn):

Doing warm GEOS-Chem restart
GEOS-Chem phase           -1 :
DoConv   :  T
DoDryDep :  T
DoEmis   :  T
DoTend   :  F
DoTurb   :  T
DoChem   :  T
DoWetDep :  T

However, at c24 with lowered timesteps (10 min chem, 5 min dyn) you get this instead. The ones that are false here but true above are all because of IsChemTime.

Doing warm GEOS-Chem restart
GEOS-Chem phase           -1 :
DoConv   :  T
DoDryDep :  F
DoEmis   :  F
DoTend   :  F
DoTurb   :  T
DoChem   :  F
DoWetDep :  T

IsChemTime is set using an ESMF alarm and here is the code (with out-of-date comments that need updating! The files referenced are for GEOS):

    ! Query the chemistry alarm.
    ! This checks if it's time to do chemistry, based on the time step
    ! set in AGCM.rc (GEOSCHEMCHEM_DT:). If the GEOS-Chem time step is not
    ! specified in AGCM.rc, the heartbeat will be taken (set in MAPL.rc).
    ! ----------------------------------------------------------------------
    CALL MAPL_Get(STATE, RUNALARM=ALARM, __RC__)
    IsChemTime = ESMF_AlarmIsRinging(ALARM, __RC__)

The timesteps in the run logs all look right, so unless there is a new timestep parameter in the config needed for new MAPL this is probably not the culprit:

GCHP.rc:

<   HEARTBEAT_DT: 300
---
>   HEARTBEAT_DT: 600
25,29c25,29
< SOLAR_DT: 300
< IRRAD_DT: 300
< RUN_DT:   300
< GIGCchem_DT: 600
< DYNAMICS_DT: 300
---
> SOLAR_DT: 600
> IRRAD_DT: 600
> RUN_DT:   600
> GIGCchem_DT: 1200
> DYNAMICS_DT: 600

CAP.rc:

< HEARTBEAT_DT:  300
---
> HEARTBEAT_DT:  600

I am going to continue to look at this. If anyone has ideas, please put your thoughts here.

[DISCUSSION] Stretched grid or nested grid for local mesh refinement

Had a conservation with @LiamBindle and just want to put together my thoughts on whether to choose stretched grid or nested grid. I am still not fully convinced that grid stretching is the right choice. It is indeed the easiest solution from a software engineering perspective. But since the potential primary developer @LiamBindle is good at software engineering, implementing the alternative way (nested grid) might not be a super big challenge. More detailed below.

Major references

FV3 main website:

https://www.gfdl.noaa.gov/fv3/
Especially the grid part: https://www.gfdl.noaa.gov/fv3/fv3-grids/

The original papers on nested and stretched grids:

A Two-Way Nested Global-Regional Dynamical Core on the Cubed-Sphere Grid (https://journals.ametsoc.org/doi/full/10.1175/MWR-D-11-00201.1)
High-Resolution Climate Simulations Using GFDL HiRAM with a Stretched Global Grid (https://journals.ametsoc.org/doi/full/10.1175/JCLI-D-15-0389.1)

Cubed-sphere and stretched grid illustration:

http://acmg.seas.harvard.edu/geos/cubed_sphere.html

Pros and Cons of the two approaches

Reasons to favor the stretched cubed-sphere:

Easy to implement. It requires minimal modification to the software architecture (GMAO MAPL). The MPI domain decomposition would work exactly the same as the normal cubed-sphere grid. On the contrary, the nested version requires more fundamental chances to MAPL, to allow two different domains to run at the same time (also some load-balancing issues).
It automatically enables a "two-way" feedback between the coarse and the fine grids. The two-way nesting requires more careful tuning to get the dynamics right (reflecting waves, etc.). I don't think offline advection will have those problems, though.

Reasons to favor the nested cubed-sphere:

Less regridding problems. I don't think we will have the native stretched metfield archive, so we will probably need to regrid the metfield from the normal cubed-sphere grid (or even the current lat-lon) to the stretched cubed-sphere. I am seriously concerned about how the wind field (and mass fluxes in the future) is processed. Recall the early wind regridding & transport problem around the poles, due to lat-lon to cube regridding. The nested grid doesn't have this problem because both the inner and the outer regions are still normal cubed-sphere, and the wind regridding does not change the direction of vector basis. Vector regridding is not a very well studied problem (for recent work see Pletzer and Hayek, 2019), and gets even tricker when coupled with advection.
More flexible time stepping. The stretched grid requires a uniform time step for all regions. The CFL is basically limited by the smallest cell. The time step on the coarser side will be unnecessarily short. This wastes computing resources. On the contrary, a nested domain can use a different time step.
Easier to handle resolution-dependent parameterizations. With grid nesting you still get a chance to set different parameters for the global and the nested domains. With grid stretching you almost have to use the same scheme everywhere. This might be less a problem with the recent implementation of the (somewhat) grid-independent convection and emission. But still worth pointing out.
It is a more flexible grid refinement. For example, to study the US-China transport problem, you can place two nested domains over the two countries. With stretching you can only zoom-in to one country; the other side has to be coarse.

My take-aways

Overall I feel that the stretched grid will involve more verification on the numerical scheme but less software engineering disasters, while the nested grid will involve more software engineering but less numerical troubles. For scientific uses I feel that grid nesting is a bit more flexible.

I recommend talking to the GFDL-FV3 team before starting any real work. They will have a much better idea on how the two methods compare...

It would also be useful to collect opinions from users, to see which one is more in-demand.

[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off

Have been looking at this issue.

Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)

In HISTORY.rc I turned on only these collections
(1) SpeciesConc_avg : only archived SpeciesConc_NO
(2) SpeciesConc_inst : only archived SpeciesConc_NO
(3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP
(4) StateMet_inst: only archived Met_AD

This run (1 hour) on 6 cores finished with all timing information:

GIGCenv: total 0.346
GIGCchem total: 123.970
Dynamics total: 18.741
GCHP total: 140.931
HIST total: 0.264
EXTDATA total: 133.351

So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.

This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?

Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.

But if I run with no diagnostics turned on then the run dies at 10 minutes

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% USING O3 COLUMNS FROM THE MET FIELDS! %%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     - RDAER: Using online SO4 NH4 NIT!
     - RDAER: Using online BCPI OCPI BCPO OCPO!
     - RDAER: Using online SALA SALC
     - DO_STRAT_CHEM: Linearized strat chemistry at 2016/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
     - LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!
  
 Setting history variable pointers to GC and Export States:
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.638E+03  4.409E+03      2.223E+03  2.601E+03  3.258E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.823E+04  0.000E+00
MAPL_ExtDataInterpField                       3300
EXTDATA::Run_                                 1471
MAPL_Cap                                       777
application called MPI_Abort(MPI_COMM_WORLD, 21856) - process 0

From the traceback it looks as if it's hanging in interpolating a field in ExtData.

[BUG/ISSUE] Slow memory leak in GCHP 12.5.0

GCHP 12.5.0 currently has a slow memory leak due to updates to MAPL ExtData. NASA is aware of the issue and will be developing a fix soon. The memory leak is more pronounced for long simulations and can result in a simulation crashing due to exceeding memory available. Breaking up long runs into smaller consecutive runs avoids the issue but we find that doing so changes the output. Lizzie Lundgren (GCST) is investigating differences in GCHP output that happen when restarting a simulation, which may or may not be a problem with GEOS-Chem Classic as well. We are documenting all single versus multi-run differences at http://ftp.as.harvard.edu/gcgrid/geos-chem/validation/multi_vs_single_run/. Look there to assess if the scale of the differences would be a problem for you.

[BUG/ISSUE] Error during compile: /usr/bin/ld: cannot find -lmpi_cxx. What is libmpi_cxx.so?

Hi,

I'm trying to compile GCHP in a singularity container that I made (similar to @JiaweiZhuang's singularity container but with Open MPI 2.1.2 instead of MPICH) and I am running into trouble during my "make compile_clean". Specifically, during compilation in <GEOS-Chem source code>/GCHP/ESMF/src/apps/ESMF_Info, I receive the following error:

$  CODE_DIR=<abs path to my CodeDir symlink>
$  mpif90  -fno-second-underscore -m64 -mcmodel=small -pthread -L$CODE_DIR/GCHP/ESMF/Linux/lib  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/ -Wl,-rpath,$CODE_DIR/GCHP/ESMF/Linux/lib  -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.8.5/ -o $CODE_DIR/GCHP/ESMF/Linux/bin/binO/Linux.gfortran.64.openmpi.default/ESMF_Info $CODE_DIR/GCHP/ESMF/obj/objO/Linux.gfortran.64.openmpi.default/src/apps/ESMF_Info/ESMF_Info.o -lesmf  -lmpi_cxx -lrt -lstdc++ -ldl
/usr/bin/ld: cannot find -lmpi_cxx
collect2: error: ld returned 1 exit status

My Open MPI install

I have installed Open MPI 2.1.2 to /usr/local (which is my $MPI_ROOT) and my /usr/local/lib has the following libraries:

$  ls /usr/local/lib
libmca_common_sm.la          libmpi_usempi.la          libopen-rte.la
libmca_common_sm.so          libmpi_usempi.so          libopen-rte.so
libmca_common_sm.so.20       libmpi_usempi.so.20       libopen-rte.so.20
libmca_common_sm.so.20.10.1  libmpi_usempi.so.20.10.1  libopen-rte.so.20.10.2
libmpi.la                    libompitrace.la           liboshmem.la
libmpi_mpifh.la              libompitrace.so           liboshmem.so
libmpi_mpifh.so              libompitrace.so.20        liboshmem.so.20
libmpi_mpifh.so.20           libompitrace.so.20.10.0   liboshmem.so.20.10.2
libmpi_mpifh.so.20.11.1      libopen-pal.la            mpi.mod
libmpi.so                    libopen-pal.so            openmpi
libmpi.so.20                 libopen-pal.so.20         pkgconfig
libmpi.so.20.10.2            libopen-pal.so.20.10.2

My Question

Do you know what libmpi_cxx.so is or where I should be able to find it? Is it simply an MPI runtime library that I could create a symlink to?

The reason I am posting this question to GCHP, rather than Open MPI, is because I noticed that changing $ESMF_COMM to the generic "mpi" changes the library in question to "libmpic++.so". This makes me think that there is some difference between what GCHP's Makefile is expecting, and what my system actually looks like.

I have attached my Singularity file for reference.

Thanks in advance,

Liam

[FEATURE REQUEST] Move GCHP run directory creation to GEOS-Chem

It is cumbersome to maintain GEOS-Chem Classic and GCHP run directory files in separate places. The GEOS-Chem run directory files will move to the GEOS-Chem repository in an upcoming version. Maintaining the run directories per GEOS-Chem version would therefore be easiest if the GCHP run directory files also moved to the same location. GEOS-Chem Classic and GCHP run directory updates would then be trackable together alongside GEOS-Chem source code version updates.

[BUG/ISSUE] Non-species Internal and GEOS-Chem state var precisions differ

Differences are introduced in multiple consecutive runs versus one single run due to low precision of internal state variables set from Chem_Registry.rc.

Background: Non-species internal state variables in GCHP are created using the Chem_Registry.rc file in the GCHP/Registry directory. This file is parsed using GCHP/Shared/Config/bin/mapl_acg.pl, a perl script that generates a header file with the calls to MAPL_AddInternalSpec for each of the variables listed as internal state variables in Chem_Registry.rc. This header file, along with those created for imports and exports, is included in Chem_GridCompMod.F90.

The way GCHP is currently written, variables defined in Chem_Registry.rc are given precision REAL4. This causes discrepancies across consecutive runs if the GEOS-Chem state object is REAL8.

This problem is demonstrated by the addition of DELPDRY in dev/12.7.0, and the other variables need to be looked at to determine if they have the same issue.

One fix is to explicitly add code for the call to MAPL_AddInternalSpec to Chem_GridCompMod.F90 rather than auto-generate it using mapl_acg.pl. To reduce the file length, and to allow different options such as for GEOS, this could be put in a .H file. Alternatively, the call to mapl_acg.pl may be able to be altered, or precision may be able to be added to Chem_Registry.rc. More investigation is needed before deciding on the ultimate fix.

[QUESTION] Do not read emission data when emission is turned off

I remember that in early days, ExtData would skip emission data files if emission is set to F. But now it seems that emission data are always being read:

Two runs take the same time. For short simulations (1~2 days) ExtData takes nearly half of time. Is there a way to skip emission files without messing around ExtData.rc? (I remember that setting the data path to /dev/null would effectively skip it, but that's a cumbersome way). This would make short benchmarking & debugging much more efficient...

[BUG/ISSUE] HDF5 error due to too many open files

Describe the bug

I got this HDF error after 5 days of C180 simulation with 288 cores:

 AGCM Date: 2016/07/06  Time: 00:00:00

 Writing:   6975 Slices (  3 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.Emissions.20160706_0000z.nc4
 Writing:  11736 Slices (  4 Nodes,  4 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc.20160706_0000z.nc4
 Writing:    439 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160706_0000z.nc4
 Writing:    439 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160706_0000z.nc4
There are 248958 HDF5 objects open!

Report: open objects on 72057594037930872
Type = File(72057594037927938) name='/'Type = File(72057594037927939) name='/'Type = File(72057594037927940) name='/'Type = File(72057594037927941) name='/'Type = File(72057594037927942) name='/'Type = File(72057594037927943) name='/'Type = File(72057594037927944) name='/'Type = File(72057594037927945) name='/'Type = File(72057594037927946) name='/'Type = File(72057594037927947) name='/'Type = File(72057594037927948) name='/'Type = File(72057594037927949) name='/'Type = File(72057594037927950) name='/'Type = File(72057594037927951) name='/'Type = File(72057594037927952) name='/'Type = File(72057594037927953) name='/'Type = File(72057594037927954) name='/'Type = File(72057594037927955) name='/'Type = File(7
...

Here's the complete log: run_c180_7days_N8n288_hdf5_error.log

Given that the error occurs after 5 days of simulation, I suspect that GCHP keeps opening new files without closing previous ones.

To Reproduce
Steps to reproduce the behavior:

Same setup as #37
Apply #20 to avoid writing huge checkpoints
Run simulation with 8 c5n.18xlarge EC2 nodes. In runConfig.sh:

NUM_NODES=8
NUM_CORES_PER_NODE=36
NY=48
NX=6

Use default diagnostics containing 4 collections: Emissions, SpeciesConc, StateMet_avg, StateMet_inst. But change the frequency to one-write-per-day:

common_freq="240000"
common_dur="240000"
common_mode="'instantaneous'"

Required information

GEOS-Chem/GCHP version: 12.3.2
Compiler version that you are using: ifort 19.0.4
MPI library and version: OpenMPI 3.1.4
netCDF 4.7.0, netCDF-Fortran 4.4.5
Computational environment: AWS cloud, CentOS7
Are you using "out of the box" code: no changes

[FEATURE REQUEST] Script to extract important error messages from compile log

I end up using this script a lot to extract important error messages from GCHP's extremely dense compile log:

#!/bin/bash

if [[ -f "$1" ]]; then
  logfile=$1
else
  echo "target file does not exist"
  exit 1
fi

# Declare an array of string: https://linuxhint.com/bash_loop_list_strings/
declare -a StringArray=( "Fatal error" "No rule to make target" "No such file or directory" "cannot find")

for str in "${StringArray[@]}"; do
  echo 'lines containing': \'$str\':
  grep --color -i -n """$str""" $logfile
  echo '================'
done

Just put it here in case that's useful for anyone.

For example, below are what you get by running the script on the compile logs from geoschem/GCHP#17 (comment)

$ ./find_errors.sh compile_gchp12.5.0_fail_at_using_eta.log
lines containing: 'Fatal error':
14115:Fatal Error: Can't open module file 'm_set_eta.mod' for reading at (1): No such file or directory
================
lines containing: 'No rule to make target':
================
lines containing: 'No such file or directory':
502:cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory
512:cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory
516:cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory
5990:cat: /gchp/Code.GCHP/GCHP/Shared/Linux/etc/CVSTAG: No such file or directory
6217:cat: /gchp/Code.GCHP/GCHP/Shared/Linux/etc/CVSTAG: No such file or directory
14115:Fatal Error: Can't open module file 'm_set_eta.mod' for reading at (1): No such file or directory
================
lines containing: 'cannot find':
6335:/usr/bin/ld: cannot find -lblas
6336:/usr/bin/ld: cannot find -llapack
6338:/usr/bin/ld: cannot find -lblas
6339:/usr/bin/ld: cannot find -llapack
6344:/usr/bin/ld: cannot find -lblas
6345:/usr/bin/ld: cannot find -llapack
6348:/usr/bin/ld: cannot find -lblas
6349:/usr/bin/ld: cannot find -llapack
6352:/usr/bin/ld: cannot find -lblas
6353:/usr/bin/ld: cannot find -llapack
6355:/usr/bin/ld: cannot find -lblas
6356:/usr/bin/ld: cannot find -llapack
6360:/usr/bin/ld: cannot find -lblas
6361:/usr/bin/ld: cannot find -llapack
6364:/usr/bin/ld: cannot find -lblas
6365:/usr/bin/ld: cannot find -llapack
================

From the output I can quickly find that, there is an error with using m_set_eta.mod, as well as an error with linking -lblas. The true error message is often every far from the end of the compile log (shown by the line numbers at the beginning of each line).

Note that even a successful compile log can still have error messages (hopefully they are harmless?):

$ ./find_errors.sh ./compile_gchp12.5.0_success.log
lines containing: 'Fatal error':
================
lines containing: 'No rule to make target':
================
lines containing: 'No such file or directory':
502:cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory
512:cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory
516:cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory
5990:cat: /gchp/Code.GCHP/GCHP/Shared/Linux/etc/CVSTAG: No such file or directory
6217:cat: /gchp/Code.GCHP/GCHP/Shared/Linux/etc/CVSTAG: No such file or directory
6447:cat: /gchp/Code.GCHP/GCHP/Shared/Linux/etc/CVSTAG: No such file or directory
6456:cat: /gchp/Code.GCHP/GCHP/Shared/Linux/etc/CVSTAG: No such file or directory
================
lines containing: 'cannot find':
================

[BUG/ISSUE] MAPL needs several dirty fixes (changing $VPATH, $CPATH) to compile

This is just to document several dirty fixes to get around MAPL compile error. I originally thought that they are only needed for gfortran; but it turns that ifort needs exactly the same fixes (tested on Ubuntu 16.04/18.04). No idea why such fixes are not needed on Odyssey.

export GC_CODE_DIR=$HOME/tutorial/Code.GCHP  # change this as needed

# --- Fixes for Makefile ---
# Dependencies files *.d in MAPL are messed up in that
# 1. It directly specifies library headers like mpi.h as dependency, which should have been handled by MPI wrapper like mpicc.
# 2. It cannot find *.mod files that are in another directory.
# VPATH allows makefile to search additional directories, although is not a good practice.
# This prevents the error "No rule to make target"
export VPATH=/usr/include/x86_64-linux-gnu:$VPATH # sys/resource.h
export VPATH=$GC_CODE_DIR/GCHP/Shared/MAPL_Base:$VPATH # mapl_*.mod
export VPATH=$GC_CODE_DIR/GCHP/ESMF/Linux/mod:$VPATH # esmf.mod

# --- Fixed for header files ("#include <xxx.h>" statement) ---
# This prevents the error "cannot find..."
export CPATH=$GC_CODE_DIR/GCHP/Shared/MAPL_Base:$CPATH # MAPL_Generic.h
export CPATH=$GC_CODE_DIR/GCHP/Shared/GFDL_fms/shared/include:$CPATH # fms_platform.h

# --- Fixes for module files ("USE xxx" statement) ---
# zonal.f is looking for mapl_constantsmod.mod in its own directory. Even /usr/include/ is not searched
ln -sf $GC_CODE_DIR/GCHP/Shared/MAPL_Base/mapl_constantsmod.mod $GC_CODE_DIR/GCHP/Shared/GEOS_Util/plots/

Maybe the new MAPL will handle these better...

[BUG/ISSUE] HEMCO_Config.rc and ExtData.rc not consistent with GC-classic

For example, GCHP 12.1.0 still uses the old, multi-year CEDS emission (geoschem/geos-chem#12) :

https://github.com/geoschem/gchp/blob/07f4130cdee6a784698b885339a54ba65ede2717/Run/HEMCO_Config.rc_templates/HEMCO_Config.rc.fullchem#L485-L488

which is different from GC-classic's.

Not sure how to keep them synchronized as the rundir generation is now moved to this repo

[BUG/ISSUE] f2py fails while building GFIO_ sources if GC_F_* are defined [feature/MAPL_v1.0.0]

In the MAPL v1 feature branch, an error will be thrown while compiling if the environment variable GC_F_INCLUDE (and possibly the other GC_F_* variables) is defined. Specifically, I get the following shortly after the "building "GFIO_" extension" line in the compile log:

error: unknown file type '' (from '[path/to/include/dir]')

(with my include directory, naturally). This is "fixed" if the environment variable is unset, but that will cause issues when compiling GEOS-Chem if the NetCDF include directories are different for NetCDF-Fortran and NetCDF-C. Note: this error does not appear to be fatal to building GCHP.

[BUG/ISSUE] Run crashes in MAPL when running full chemistry simulation at c360

@lizziel @yantosca
When I use 900 or more cores for c360 simulation, it shows that:

ERROR: cannot create ESMF regridder for var in file. This may be because source grid is too coarse for number of cores. Try reducing number of cores for the simulation.
./MainDataDir/IODINE/v2017-09/CH3I_monthly_emissions_Ordonez_2012_COARDS.nc

And I try to reduce number of cores, such as 720 cores, but the module will cut off, which I think it may be the memory issue.
Therefore, I wonder are there suggesting number of cores for running c360?
Or can offer finer source file which will limit the increasing of number of cores for the simulation?

[BUG/ISSUE] Incorrect regridding if file latitude data ends in +/- 90

Describe the bug
If data is read from a file where the first or last latitude is +/- 90 degrees (as listed in the file's "lat" variable), then the data can be regridded badly. This is clearest when running at very high resolutions.

To Reproduce
Steps to reproduce the behavior:

Run GCHP at C180 with MERRA-2 input data
Wait for failure in wet deposition, due to a negative box height (resulting from bad met read-in). Specifically, conditions at the poles end up outside of the possible range of temperatures (over 400 K) and pressures (over 10,000 hPa)

Expected behavior
MERRA-2 meteorological data should be read correctly.

Required information

GEOS-Chem/GCHP version 12.6.0
gfortran 8.2.0
OpenMPI 3.1.1
NetCDF 4.1.3
Running on Odyssey (huce_intel)
Error produced with both 16 nodes or 20 nodes (24 cores each)

Additional context
I suspect this error will affect any lat/lon input file which has latitudes including +/- 90.

[BUG/ISSUE] $ARCH environment variable breaks ESMF compilation

I suggest changing this code

https://github.com/geoschem/gchp/blob/07f4130cdee6a784698b885339a54ba65ede2717/GIGC.mk#L84-L87

to: (no conditional)

ARCH := $(shell uname -s)

or simply: (since we won't compile GCHP on non-Linux anyways)

ARCH := Linux

On some systems (e.g Amazon Linux AMI), ARCH is set to amd64, then ESMF/build/common.mk will be looking for the nonexistent amd64.gfortran.default in ESMF/build_config/. Forcing ARCH to Linux ensures that Linux.gfortran.default is always used.

The full log with incorrect ARCH for record: compile_error.log

[BUG/ISSUE] "-lhdf5_hl -lhdf5" flags break compilation

Describe the bug

During compilation, some commands contain -lhdf5_hl -lhdf5 but don't specify the HDF5 library path via -L:

mpif90 -L/home/centos/tutorial/Code.GCHP/GCHP/Shared/Linux/lib -L/home/centos/tutorial/Code.GCHP/GCHP/Shared/Linux/lib  -o sst_sic_EIGTHdeg.x libNSIDC-OSTIA_SST-ICE_blend.a -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-fortran-4.4.5-uv5xocaoik42r6odukzzhjixymhytovx/lib -lnetcdff -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-4.7.0-5fkurucr6jdlwztewiqmqkxen4vvm7xa/lib -lnetcdf -lnetcdf -lhdf5_hl -lhdf5 -lz -lm -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-4.7.0-5fkurucr6jdlwztewiqmqkxen4vvm7xa/lib -lnetcdf -ldl -lc -lpthread -lrt  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lstdc++
mpif90 -L/home/centos/tutorial/Code.GCHP/GCHP/Shared/Linux/lib -L/home/centos/tutorial/Code.GCHP/GCHP/Shared/Linux/lib  -o sst_sic_QUARTdeg.x libNSIDC-OSTIA_SST-ICE_blend.a -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-fortran-4.4.5-uv5xocaoik42r6odukzzhjixymhytovx/lib -lnetcdff -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-4.7.0-5fkurucr6jdlwztewiqmqkxen4vvm7xa/lib -lnetcdf -lnetcdf -lhdf5_hl -lhdf5 -lz -lm -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-4.7.0-5fkurucr6jdlwztewiqmqkxen4vvm7xa/lib -lnetcdf -ldl -lc -lpthread -lrt  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lstdc++
ld: cannot find -lhdf5_hl
ld: cannot find -lhdf5
ld: cannot find -lhdf5_hl
ld: cannot find -lhdf5
gmake[13]: *** [sst_sic_EIGTHdeg.x] Error 1
gmake[13]: *** Waiting for unfinished jobs....
gmake[13]: *** [sst_sic_QUARTdeg.x] Error 1

Full compile log: compile_hdf5_error.log

This can be solved by a dirty fix that copies HDF5 libraries into NetCDF directory:

ln -s $(spack location -i hdf5)/lib/* $(spack location -i netcdf)/lib/

Successful compile log: compile_success.log

To Reproduce

Install libraries via spack -v install netcdf-fortran %intel ^hdf5+fortran+hl ^intel-mpi
Grab GCHP 12.3.2 code, apply #35, and run make build_all

The environment config is:

source $(spack location -i intel)/bin/compilervars.sh -arch intel64  # enable icc/ifort
module load intelmpi  # enable mpicc/mpifort

export I_MPI_CC=icc
export I_MPI_CXX=icpc
export I_MPI_FC=ifort
export I_MPI_F77=ifort
export I_MPI_F90=ifort

export CC=icc
export CXX=icpc
export FC=ifort
export F77=$FC
export F90=$FC

export OMPI_CC=$CC
export OMPI_CXX=$CXX
export OMPI_FC=$FC
export COMPILER=$FC
export ESMF_COMPILER=intel

export ESMF_COMM=intelmpi
export MPI_ROOT=/opt/intel/compilers_and_libraries/linux/mpi/intel64

export NETCDF_HOME=$(spack location -i netcdf)
export NETCDF_FORTRAN_HOME=$(spack location -i netcdf-fortran)

export GC_BIN="$NETCDF_HOME/bin"
export GC_INCLUDE="$NETCDF_HOME/include"
export GC_LIB="$NETCDF_HOME/lib"

export GC_F_BIN="$NETCDF_FORTRAN_HOME/bin"
export GC_F_INCLUDE="$NETCDF_FORTRAN_HOME/include"
export GC_F_LIB="$NETCDF_FORTRAN_HOME/lib"

export PATH=${NETCDF_HOME}/bin:$PATH
export PATH=${NETCDF_FORTRAN_HOME}/bin:$PATH

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${NETCDF_HOME}/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${NETCDF_FORTRAN_HOME}/lib

export ESMF_BOPT=O

Environment

GEOS-Chem/GCHP version: 12.3.2
Compiler version: ifort 19.0.4
MPI library and version: Intel MPI 2019.4
netCDF and netCDF-Fortran library version: 4.7.0 and 4.4.5
Computational environment: AWS cloud, CentOS 7
Code modifications: No change to GC-classic part; apply #35 to GCHP part

[BUG/ISSUE] H2O2AfterChem vertically flipped in restart

The internal state variable H2O2AfterChem is incorrectly updated with the same vertical orientation as H2O2AfterChem stored in State_Chm. This will have implications for consecutive full chemistry simulations since H2O2AfterChem in the GCHP restart is used to populate State_Chm%H2O2AfterChem at the start of the run if present. This bug has been present since 12.2.0 when this field was added to the internal state.

[FEATURE REQUEST] Using pre-built ESMF to reduce compile time and memory

This issue is just to track the progress of separating ESMF from GCHP source code. This would solve #4.

Note that ESMF is already a conda package: https://anaconda.org/conda-forge/esmf
(This is how my xESMF gets installed)
But it only contains 7.0.0+, not the ancient 5.2.0 used in GCHP.

[DISCUSSION] Multi-node MPI runs with Singularity or Charliecloud containers

Killian Murphy from University of York (@kilicomu) is interested in testing multi-node container runs. If that works it will greatly simplify GCHP installation on HPC cluster environment -- a wonderful achievement!

Here I am putting together all the information I know.

Major references

Official docs:

Singularity documentation: https://www.sylabs.io/docs/
Singularity GitHub repo: https://github.com/sylabs/singularity
Charliecloud documentation: https://hpc.github.io/charliecloud/
Charliecloud GitHub repo: https://github.com/hpc/charliecloud

Papers:

Singularity: Scientific containers for mobility of compute (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177459)
Charliecloud: unprivileged containers for user-defined software stacks in HPC (https://dl.acm.org/citation.cfm?id=3126925)
A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds (https://www.osti.gov/biblio/1483132). This paper (from Sandia Lab) tested Singularity on Cray, using HPCG with up to 768 cores.
Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? (https://dl.acm.org/citation.cfm?id=3147231) This paper (from OSU) tested Singularity on Intel Haswell and KNL clusters, using MPI collectives with up to 128 cores.

Deployment at institutions

Singularity on Harvard Odyssey: https://www.rc.fas.harvard.edu/resources/documentation/software/singularity-on-odyssey/
Singularity on Stanford Sherlock: https://www.sherlock.stanford.edu/docs/software/using/singularity/
Charliecloud on NASA Pleiades: https://www.nas.nasa.gov/hecc/support/kb/running-a-charliecloud-container-image-on-pleiades_561.html

Status of multi-node MPI runs

None of the above deployments have documented how to run multi-node MPI jobs. The Pleiades doc even warns that

WARNING: Running MPI applications across multiple Pleiades nodes with containers is not currently recommended, as the tests performed by NAS staff are not always successful. If you want to experiment with it, be aware that you need to have some MPI implementations available outside of the containers.

One major issue is MPI compatiblity, as noted by Singularity 2.5 docs:

Another result of the Singularity architecture is the ability to properly integrate with the Message Passing Interface (MPI). Work has already been done for out of the box compatibility with Open MPI (both in Open MPI v2.1.x as well as part of Singularity).

What are supported Open MPI Version(s)? To achieve proper container’ized Open MPI support, you should use Open MPI version 2.1. There are however three caveats:

Open MPI 1.10.x may work but we expect you will need exactly matching version of PMI and Open MPI on both host and container (the 2.1 series should relax this requirement)

Open MPI 2.1.0 has a bug affecting compilation of libraries for some interfaces (particularly Mellanox interfaces using libmxm are known to fail). If your in this situation you should use the master branch of Open MPI rather than the release.

Using Open MPI 2.1 does not magically allow your container to connect to networking fabric libraries in the host. If your cluster has, for example, an infiniband network you still need to install OFED libraries into the container. Alternatively you could bind mount both Open MPI and networking libraries into the container, but this could run afoul of glib compatibility issues (its generally OK if the container glibc is more recent than the host, but not the other way around)

Singularity 3.0+ even completely removes the docs on multi-node MPI runs.

I've also heard that NVIDIA is doing some multi-node container development: https://info.nvidia.com/emea-gpu-accelerated-multi-node-hpc-workloads-regpage.html
But haven't found more publicly-available information.

Where to start

Before testing GCHP at all, one should run OSU or Intel MPI benchmarks. We can effectively predict GCHP performance just from MPI latency/bandwidth. I am particularly curious about the latency overhead of containers.

This is definitely at an early stage. Any successful tests and reproducible examples will be valuable.

[QUESTION] A question about process geos-chem with by pbs?

#! /bin/csh -f

#PBS -N testgeoschem
#PBS -q batch
#PBS -l walltime=8:00:00
#PBS -l nodes=2:ppn=40
#PBS -o /public/cmaq/standard/geoschem/UT/runs/4x5_standard
#PBS -e /public/cmaq/standard/geoschem/UT/runs/4x5_standard
#PBS -V
mpiexec.hydra -f machinefile -np 80 ./geos

[BUG/ISSUE] "ar: memuse.o: No such file or directory" error when building GCHP 12.5.0 with gcc9.2.0

Describe the bug

I got this compile error when building GCHP 12.5.0 with GCC 9.2.0 on Ubuntu (will test other environments later):

ar cr libGFDL_fms_r4.a amip_interp.o astronomy.o axis_utils.o column_diagnostics.o constants.o atmos_ocean_fluxes.o coupler_types.o ensemble_manager.o data_override.o diag_axis.o diag_data.o diag_grid.o diag_manager.o diag_output.o diag_util.o cloud_interpolator.o drifters_comm.o drifters_core.o drifters.o drifters_input.o drifters_io.o quicksort.o stock_constants.o xgrid.o fft99.o fft.o field_manager.o fm_util.o fms.o fms_io.o test_fms_io.o horiz_interp_bicubic.o horiz_interp_bilinear.o horiz_interp_conserve.o horiz_interp.o horiz_interp_spherical.o horiz_interp_type.o memuse.o memutils.o create_xgrid.o gradient_c2l.o gradient.o grid.o interp.o mosaic.o mosaic_util.o read_mosaic.o mpp_data.o mpp_domains.o mpp.o mpp_io.o mpp_memutils.o mpp_parameter.o mpp_pset.o mpp_utilities.o mpp_efp.o nsclock.o test_mpp_domains.o test_mpp.o test_mpp_io.o test_mpp_pset.o threadloc.o oda_core.o oda_types.o write_ocean_data.o xbt_drop_rate_adjust.o platform.o MersenneTwister.o random_numbers.o sat_vapor_pres.o sat_vapor_pres_k.o station_data.o time_interp_external.o time_interp.o get_cal_time.o time_manager.o gaussian_topog.o topography.o tracer_manager.o tridiagonal.o test_xgrid.o diag_table.o diag_manifest.o 
ar: memuse.o: No such file or directory
../GNUmakefile_r4r8:148: recipe for target 'libGFDL_fms_r4.a' failed
gmake[9]: *** [libGFDL_fms_r4.a] Error 1
gmake[9]: Leaving directory '/home/ubuntu/gchp_12.5.0/Code.GCHP/GCHP/Shared/GFDL_fms/r4'
gmake[9]: Entering directory '/home/ubuntu/gchp_12.5.0/Code.GCHP/GCHP/Shared/GFDL_fms/r8'

Full log: compile_ubuntu_gcc_openmpi_fail.log

memuse.o (and its source memuse.c) is inside GCHP/Shared/MAPL_Base; but the above compile command is executed inside GCHP/Shared/GFDL_fms/r4, which contains all *.o files needed by the ar command except the memuse.o file.

To Reproduce

(I can build a Docker container to show the error if that helps.)

Install libraries via:

spack -v install [email protected]
spack compiler add $(spack location -i [email protected])
spack -v install netcdf-fortran %[email protected] ^netcdf~mpi ^hdf5~mpi+fortran+hl
spack install openmpi+pmi schedulers=slurm %[email protected]

Apply the HDF5 fix from #37 :ln -s $(spack location -i hdf5)/lib/* $(spack location -i netcdf)/lib/
Build GCHP 12.5.0 code. The environment is:

export PATH=$(spack location -i gcc)/bin:$PATH
export PATH=$(spack location -i openmpi)/bin:$PATH
export gFTL=$HOME/gFTL/install/

export CC=gcc
export OMPI_CC=$CC
export CXX=g++
export OMPI_CXX=$CXX
export FC=gfortran
export F77=$FC
export F90=$FC
export OMPI_FC=$FC
export COMPILER=$FC
export ESMF_COMPILER=gfortran
export ESMF_COMM=openmpi
export MPI_ROOT=$(spack location -i openmpi)

export NETCDF_HOME=$(spack location -i netcdf)
export NETCDF_FORTRAN_HOME=$(spack location -i netcdf-fortran)

export GC_BIN="$NETCDF_HOME/bin"
export GC_INCLUDE="$NETCDF_HOME/include"
export GC_LIB="$NETCDF_HOME/lib"

export GC_F_BIN="$NETCDF_FORTRAN_HOME/bin"
export GC_F_INCLUDE="$NETCDF_FORTRAN_HOME/include"
export GC_F_LIB="$NETCDF_FORTRAN_HOME/lib"

export PATH=${NETCDF_HOME}/bin:$PATH
export PATH=${NETCDF_FORTRAN_HOME}/bin:$PATH

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${NETCDF_HOME}/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${NETCDF_FORTRAN_HOME}/lib

export ESMF_BOPT=O

Required information

GEOS-Chem/GCHP version: 12.5.0
Compiler version: gcc 9.2.0
MPI library and version: OpenMPI 3.1.4 (the problem should independent of MPI )
netCDF and netCDF-Fortran library version: 4.7.0 and 4.4.5
Computational environment: AWS cloud, Ubuntu 16.04
Code modifications: No

[BUG/ISSUE] Not printing the missing HEMCO data file that causes model crash

Description

I am testing two different scripts to download HEMCO data for GCHP 12.3.2, for my on-going paper https://github.com/JiaweiZhuang/cloud-gchp-paper.

The original HEMCO.sh downloads most of HEMCO directories, and only uses --exclude to skip very large ones. The new HEMCO_small.sh only downloads the minimum, exact files, using the log parser script at geoschem/geos-chem-cloud#25 (comment).

The model runs successfully with the original, large dataset, but crashes with the new streamlined, smaller dataset, without printing relevant error messages. I set DEBUG_LEVEL: 5 in CAP.rc and Verbose: 3 in HEMCO_Config.rc, but the model still doesn't print which data file is missing. DEBUG_LEVEL: 20 doesn't dump more useful error messages.

Log files and error messages

Complete log files:
Success run, with original script / large HEMCO data:

Failed run, with new script / small HEMCO data:

The error occurs at:

 Opened shortcut bracket: GFAS
   - Skip content of this bracket:  T
 Closed shortcut bracket: GFAS
   - Skip following lines:  F
===============================================================================
GEOS-Chem ERROR: Error encountered in "HCOX_Init"!
 -> at HCOI_GC_Init (in module GeosCore/hcoi_gc_main_mod.F90)

THIS ERROR ORIGINATED IN HEMCO!  Please check the HEMCO log file for 
additional error messages!
===============================================================================

The successful one will proceed to:

 Opened shortcut bracket: GFAS
   - Skip content of this bracket:  T
 Closed shortcut bracket: GFAS
   - Skip following lines:  F
   --> Isoprene to SOA-Precursor  1.500000000000000E-002
   --> Isoprene direct to SOA (Simple)  1.500000000000000E-002
   --> Monoterpene to SOA-Precursor  4.409171294611898E-002
   --> Monoterpene direct to SOA (Simple)  4.409171294611898E-002
   --> Othrterpene to SOA-Precursor  5.000000000000000E-002
   --> Othrterpene direct to SOA (Simple)  5.000000000000000E-002

File list

Here's the HEMCO-small directory, with the total size of 50G:

ACET   ALD2          BCOC_BOND  C2H6_2010    DUST_DEAD  EMEP   GMI     MEGAN    NEI2011   OFFLINE_LNOX   POET   SOILNOX  TIMEZONES  VOLCANO
AEIC   AnnualScalar  BIOFUEL    DICE_Africa  EDGARv42   GEIA   IODINE  MIX      NH3       OFFLINE_SSALT  RETRO  STRAT    TOMS_SBUV
AFCID  APEI          BROMINE    DMS          EDGARv43   GFED4  MASKS   NEI2005  NOAA_GMD  OMOC           SOA    STREETS  UVALBEDO

Here's the HEMCO-large directory, with the total size of 168G:

ACET          BIOBURN    CORBETT_SHIP  FINN         kgyr_to_kgm2s.sh  NEI2005            OFFLINE_LNOX   raw_data    STRAT      VISTAS
AEIC          BIOFUEL    COUNTRY_ID    GEIA         LIGHTNOX          NEI2011            OFFLINE_SFLUX  RCP         STREETS    VOLCANO
AFCID         BRAVO      DICE_Africa   GFED2        MACCITY           NEI2011_ag_only    OFFLINE_SSALT  README      TAGGED_CO  WEEKSCALE
ALD2          BROMINE    DMS           GFED3        MAP_A2A           NEI2011ek          OH             RETRO       TAGGED_O3  XIAO
AnnualScalar  C2H6_2010  DUST_DEAD     GFED4        MASAGE_NH3        NEI99              OLSON_MAP      RONO2       TIMEZONES  Yuan_XLAI
APEI          CAC        DUST_GINOUX   GMI          MASKS             NH3                OMOC           RRTMG       TNO
ARCTAS_SHIP   CEDS       EDGAR         grids        MEGAN             NOAA_GMD           OXIDANTS       SAMPLE_BCs  TOMS_SBUV
BB4CMIP6      CH3I       EDGARv42      HTAP         MERCURY           O3                 PARANOX        SF6         TrashEmis
BCOC_BOND     CHLA       EDGARv43      ICOADS_SHIP  MIX               OFFLINE_AEROSOL    POET           SOA         UCX
BCOC_COOKE    CO2        EMEP          IODINE       MODIS_XLAI        OFFLINE_LIGHTNING  POPs           SOILNOX     UVALBEDO

Is there a way to find out which directory is missing in the HEMCO-small one?

[BUG/ISSUE] GCHP terminating when reading NEI99 seasonal scaling

Hi GCST.

I have attached my runConfig.sh, HEMCO_Config.rc, and model run logs for both a c48 and c360 run.

At both C48 and C360, running from between 2015-12-01 00:00:00 and 2015-12-07 00:00:00, GCHP terminates seemingly due to being unable to read information about NEI99 seasonal scaling.

The relevant log entries are:

>> Reading  CO from ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
 DEBUG: Scanning fixed file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc for side L
 DEBUG: Opening file: ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
WARNING: Requested sample not found in file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc

and:

  >> >> Reading times from ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
  >> >> File timing info: 0019990101 0000000000 0012
 DEBUG: GetBracketTimeOnSingleFile called for ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
  >> >> Reading times from fixed (F) file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
 >> >> >> File start    : 1999-01-01 00:00:00
 >> >> >> File end      : 1999-12-01 00:00:00
 >> >> >> Time requested: 2015-12-02 00:00:00
 DEBUG: Extrapolation flags (0) are F F F T for file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
 DEBUG: Requested time is after or on last available sample in file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
 DEBUG: Extrapolation flags (2) are F F F F for file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
WARNING: Requested sample not found in file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc
 ERROR: Bracket timing request failed on fixed file ./MainDataDir/NEI2005/v2014-09/scaling/NEI99.season.geos.1x1.nc for side L

At both resolutions, after these messages appear in the log, an MPI_ABORT is issued from one process and this cascades to the others. At a glance, it looks as though the extrapolation flags aren't being handled in the same way as for other datasets.

Apologies in advance if I have not applied some config that I obviously need to apply!

NEI99_ISSUE_REPORT.tar.gz

[BUG/ISSUE] Timing results are incomplete

I've successfully run GCHP 12.1.0 on AWS, with both Ubuntu (gcc 7.3.0) and CentOS (gcc 7.3.1).
CentOS was already working before; I am glad that Ubuntu now also works (again with lots of dirty fixes). I strongly prefer Ubuntu as it has a lot more pre-packaged libraries and the environment is a lot faster to build (no need to compile libraries from source).

Full logs for record:

However, for both OS, the log files only show the time for GIGCenv, but no other components.

  Times for GIGCenv
TOTAL                   :       6.834
INITIALIZE              :       0.000
RUN                     :       6.832
GenInitTot              :       0.003
--GenInitMine           :       0.003
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.000
--GenFinalMine          :       0.000
GenRecordTot            :       0.001
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000
  
HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 12335 RUNNING AT ip-172-31-80-21
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Does the same issue happen on Odyssey?

[BUG/ISSUE] Crashes when writing both StateMet_avg and StateMet_inst

After the fix geoschem/GCHP#6 (comment) there remains one issue: with both StateMet_avg and StateMet_inst turned on, the run crashes at 01:00 when writting the first diagnostics.

Tested with GC version 12.1.1 and GCHP branch bugfix/GCHP_issues, and OpenMPI3.

Those cases can finish and print full timing info:

Only StateMet_inst: run_StateMet_inst.log
Only StateMet_avg: run_StateMet_avg.log
StateMet_avg SpeciesConc_inst SpeciesConc_avg: run_3diag.log

Those cases crash:

StateMet_inst and StateMet_avg: run_StateMet_both.log
The default StateMet_inst StateMet_avg SpeciesConc_inst SpeciesConc_avg: run_4diag.log

Not a critical problem as I can just turn off StateMet for either tutorial or benchmark purpose. Will proceed to make the tutorial AMI.

[BUG/ISSUE] Invalid compile command for hemco_standalone.x on Ubuntu

At the very end of compilation, hemco_standalone.x is compiled by:

mpifort -cpp -w -std=legacy -fautomatic -fno-align-commons -fconvert=native -fno-range-check -O3 -funroll-loops -mcmodel=medium -fbacktrace -DLINUX_GFORTRAN -DEXTERNAL_GRID -DNC_DIAG -DGEOS_FP -DNC_HAS_COMPRESSION -DESMF_ -DUSE_REAL8 hcoi_esmf_mod.o hemco_standalone.o hcoi_standalone_mod.o -L../../lib -lHCOI -lHCOX -lHCO -lGeosUtil -lHeaders -lNcUtils -L/usr/lib -lnetcdff -L/usr/lib -lnetcdf -lnetcdf -L/usr/lib -lnetcdf -L/home/ec2-user/tutorial/gchp_standard/CodeDir/GCHP/Shared/Linux/lib -lMAPL_Base -lMAPL_cfio -lGMAO_mpeu -lGMAO_pilgrim -L/home/ec2-user/tutorial/gchp_standard/CodeDir/GCHP/Shared/Linux/lib -lFVdycoreCubed_GridComp -lfvdycore -lGFDL_fms -lGEOS_Shared -lGMAO_hermes -lrt /home/ec2-user/tutorial/gchp_standard/CodeDir/GCHP/ESMF/Linux/lib/libesmf.so -L/usr/local/mpich/bin/../lib64 -lmpich -lmpichf90 -o hemco_standalone.x

This runs on CentOS but crashes on Ubuntu (the extremely common undefined reference to `__netcdf_MOD_nf90_create' error), because Ubuntu requires -lnetcdff -lnetcdf to appear after the object that uses it (https://stackoverflow.com/a/13953322/8729698).

This can be fixed by adding -lnetcdf -lnetcdff to the very end of the command:

mpifort -I/usr/include -cpp -w -std=legacy -fautomatic -fno-align-commons -fconvert=native -fno-range-check -O3 -funroll-loops -mcmodel=medium -fbacktrace -DLINUX_GFORTRAN -DEXTERNAL_GRID -DNC_DIAG -DBPCH_TPBC -DGEOS_FP -DNC_HAS_COMPRESSION -DESMF_ -DUSE_REAL8 hcoi_esmf_mod.o hemco_standalone.o hcoi_standalone_mod.o -L../../lib -lHCOI -lHCOX -lHCO -lGeosUtil -lHeaders -lNcUtils -L/usr/lib -lnetcdff -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -lnetcdf -lnetcdf -L/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu/hdf5/serial -lhdf5_hl -lhdf5 -lpthread -lsz -lz -ldl -lm -lcurl -L/home/ubuntu/tutorial/gchp_standard/CodeDir/GCHP/Shared/Linux/lib -lMAPL_Base -lMAPL_cfio -lGMAO_mpeu -lGMAO_pilgrim -L/home/ubuntu/tutorial/gchp_standard/CodeDir/GCHP/Shared/Linux/lib -lFVdycoreCubed_GridComp -lfvdycore -lGFDL_fms -lGEOS_Shared -lGMAO_hermes -lrt /home/ubuntu/tutorial/gchp_standard/CodeDir/GCHP/ESMF/Linux/lib/libesmf.so -L/usr/bin/../lib64 -lmpich -lmpichf90 -lnetcdf -lnetcdff -o hemco_standalone.x

But how to add this to the Makefile gets a bit tricky.

A simpler way is to skip the compilation of hemco_standalone.x, since it is not used in GCHP anyway (does it even work?).

This can be done by changing the dependency in HEMCO/Interfaces/Makefile
from

all:
        @${MAKE} lib
        @${MAKE} exe

all:
        @${MAKE} lib

Here's the successful compile log with hemco_standalone.x skipped: compile_no_hemco_success.log
Here's the original, failed compile log: compile_hemco_error.log

My question is

Can we skip hemco_standalone.x by default?
How to do this only for GCHP while not affect GC-classic?

[BUG/ISSUE] Early termination at different points depending on diagnostics configuration

Trying to summarize different behaviors in #6 #8 #9 as things are getting messy...

1. Crashes at the first time step (at 00:10)

No output: comment out everything in HISTORY.rc
Too few output: only one collection SpeciesConc_inst with only two species SpeciesConc_NO and SpeciesConc_O3 in it.

Typical error message is

 Setting history variable pointers to GC and Export States:
 SpeciesConc_NO
 SpeciesConc_O3
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.723E+03  4.494E+03  2.306E+03  2.684E+03  3.260E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.852E+04  0.000E+00
 offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3                     54
DYNAMICSRun                                    703
GCHP::Run                                      407
MAPL_Cap                                       792

2. Crashes when writing the first diagnostics file (at 01:00)

Too many output: the default 4 collections SpeciesConc_avg SpeciesConc_inst StateMet_avg StateMet_inst with hundreds of variables.

Typical error message is

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
 Writing:    510 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
 Writing:    510 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4
                MAPL_CFIOWriteBundlePost      1908
HistoryRun                                    2947
MAPL_Cap                                       833
application called MPI_Abort(MPI_COMM_WORLD, 21944) - process 0

Don't seem a memory problem; still happens on r5.4xlarge with 128 GB RAM.

3. Crashes right before printing timing information

Just-enough-outputs: one collection SpeciesConc_inst with hundred of species
Two collections SpeciesConc_inst and SpeciesConc_avg with hundred of default species
Two collections StateMet_avg and StateMet_inst with default variables. This means each single collection won't cause the no. 2 problem.
Default, 4-collection outputs but read incorrect v2018-11 restart file.

Times for GIGCenv
...
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3126 RUNNING AT ip-172-31-0-74
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

By tweaking the amount of output, I can get more timing info printed, but the run still ends with BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES:

With two collections SpeciesConc_inst and SpeciesConc_avg and each with two species, the model can print both Times for GIGCenv and Times for GIGCchem,
With two collections StateMet_avg and StateMet_inst and default variables, the model can print Times for GIGCenv Times for GIGCchem Times for DYNAMICS Times for GCHP. But Times for HIST and Times for EXTDATA are still missing.

4. Run to the end with the full timing information printed

This means no error message occurs. The model prints all the way down to Times for EXTDATA and creates the new cap_restart files.

I can only make this happen with very tricky configurations:

Two collections SpeciesConc_inst and SpeciesConc_avg with hundred of default species, and turning off emission in input.geos. (Note that turning off transport using runConfig.sh has no impact on the error).

Log file for this only-successful-run so far: run_two_collections_emission_off.log

Environment

All tests are performed with MPICH 3.3 and gcc 7.3.0 on Ubuntu 18.04 (ami-0a5973f14aad7413a).

I also have OpenMPI 2.1 working (scripts).

With no diagnostics, it does not crash at 00:10 (fixes no. 1 error).
The timing info is still incomplete (no. 3 error still exits, geoschem/GCHP#6 (comment))
With any diagnostics, it simply hangs when creating the first diagnostics (similar to the no. 2 error):

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
[hangs forever]

I consider this even worse because basically no diagnostics can be archived. With MPICH at least it is functioning in some cases.

[BUG/ISSUE] Run failure in transport tracers simulation with 12.6.2

An update made to 12.6.2 for where to set the State_Met array for albedo inadvertently broke the transport tracers simulation. The code was moved from Chem_GridCompMod.F90 to Includes_Before_Run.H and in the process was slimmed down to remove the handling for if the UV_ALBEDO MAPL pointer is not found. The UV albedo is not used in the transport tracers simulations and therefore the call to get the pointer now results in error.

[BUG/ISSUE] MAPL pFIO failure at high core counts

GCHP 12.5.0 uses online ESMF regridding weights rather than external tile files for regridding. Due to ESMF domain decomposition rules this can result in a MAPL error if an input grid is too coarse for a run’s configured core count. We have seen this problem for 4°x5° input files when using >600 cores. GMAO has fixed this problem in a more recent version of MAPL. All 4°x5° input files are replaced with higher resolution files for GCHP 12.5.0 to avoid this issue. However, users may still run into problems if running with thousands of cores.

[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour

I ran a GCHP c48 run AWS cloud using

AMI         : container_geoschem_tutorial_2018121
Machine     : r4.4xlarge
Diagnostics : SpeciesConc_avg and SpeciesConc_inst

and it died after an hour.

In runConfig.sh:

# Make sure your settings here match the resources you request on your
# cluster in your run script!!!
NUM_NODES=1
NUM_CORES_PER_NODE=12
NY=12
NX=1

# MAPL shared memory option (0: off, 1: on). Keep off unless you know what
# you are doing. Contact GCST for more information if you have memory
# problems you are unable to fix.
USE_SHMEM=0

#------------------------------------------------
#   Internal Cubed Sphere Resolution
#------------------------------------------------
CS_RES=48    # 24 ~ 4x5, 48 ~ 2x2.5, 90 ~ 1x1.25, 180 ~ 1/2 deg, 360 ~ 1/4 deg

...
Start_Time="20160701 000000"
End_Time="20160701 010000"
Duration="00000000 010000"
....
common_freq="010000"
common_dur="010000"
common_mode="'time-averaged'"

The Docker commands were:

docker pull docker pull geoschem/gchp_model
docker run --rm -it -v $HOME/ExtData:/ExtData -v $HOME/OutputDir:/OutputDir geoschem/gchp_model
mpirun -np 12 -oversubscribe --allow-run-as-root ./geos | tee gchp.log.c48

Tail end of log file:

 AGCM Date: 2016/07/01  Time: 00:10:00

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 22262d174fea exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I then commented out SpeciesConc_avg from the HISTORY.rc file and re-ran.
Now, the only diagnostic active was SpeciesConc_inst. This also died at 1 hour:

AGCM Date: 2016/07/01  Time: 01:00:00

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
free(): invalid next size (normal)

Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0  0x7efd1c1dd2da in ???
#1  0x7efd1c1dc503 in ???
..etc..

Times for GIGCenv
TOTAL                   :       0.726
INITIALIZE              :       0.000
RUN                     :       0.723
...etc...
HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 0 on node 22262d174fea exited on signal 6 (Aborted).
--------------------------------------------------------------------------

This message:

free(): invalid next size (normal)

might be indicative of an out-of-bounds error, perhaps where we deallocate arrays (or fields of State_* objects).

[BUG/ISSUE] Resolution-dependent bias in surface ozone in 12.5.0

There is a positive bias in surface ozone over the southern ocean in GCHP 12.5.0. It shows in comparisons with both GCC 12.5.0 and GCHP 12.4.0 in the 12.5.0 1-month benchmark. The issue is resolution-dependent and occurs in FV3. This is an open issue on GEOS github: GEOS-ESM/GEOSgcm_GridComp#48.

[BUG/ISSUE] Change in MAPL vertical flip rules impacting mesospheric chemistry

Starting in version 12.5.0 there are differences in certain halogens in the stratosphere between GEOS-Chem Classic and GCHP. Investigation of the differences shows the origination of the problem is the mesosphere, as shown here:

The problem can be traced to a change in the MAPL rules for flipping the vertical axis of imports introduced in 12.5.0. We are working on a fix.

[FEATURE REQUEST] ESMF v8 public release

Seb Eastham (MIT) reported that using the ESMF v8.0.0 public release in the GCHP 12.6 series allows usage of OpenMPI 4. ESMF v8 is scheduled to be the recommended version starting in GEOS-Chem 13.0.0 when it will be used as an externally built library. However, since it is readily compatible with earlier versions of GCHP (12.5+), we will update ESMF included in the 12.7.0 to be the version 8 public release.

[BUG/ISSUE] Do not write checkpoint file at the very beginning

I am aware of this config:

https://github.com/geoschem/gchp/blob/7a4589c276876b6674800f4e4137b575e4def4f5/Run/runConfig.sh_template#L55-L66

It effectively sets GCHP.rc to:

# Settings for production of restart files
#---------------------------------------------------------------
# Record frequency (HHMMSS) : Frequency of restart file write
#                             Can exceed 24 hours (e.g. 1680000 for 7 days)
# Record ref date (YYYYMMDD): Reference date; set to before sim start date
# Record ref time (HHMMSS)  : Reference time
RECORD_FREQUENCY: 100000000
RECORD_REF_DATE: 20160701
RECORD_REF_TIME: 000000

However, GCHP still writes out a checkpoint file at the very beginning:

...
CFIO: Reading ./MainDataDir/MASKS/v2018-09/AF_LANDMASK.geos.05x0666.global.nc at 19850101 000000
 NOT using buffer I/O for file: TileFiles/DC0540xPC0361_CF0024x6C.bin
CFIO: Reading ./MainDataDir/MASKS/v2018-09/China_mask.generic.1x1.nc at 19850101 000000
CFIO: Reading ./MainDataDir/MASKS/v2018-09/India_mask.generic.1x1.nc at 19850101 000000
   Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
 Using parallel NetCDF for file: 
 gcchem_internal_checkpoint_c24.nc.20160701_0000z.bin
 offline_tracer_advection
 Initialized species from INTERNAL state: NO
...

For a C180 run, this file is 27GB (!!) and takes long time to write:

$ du -sh gcchem_internal_checkpoint_c180.nc.20160701_0000z.bin
27G	gcchem_internal_checkpoint_c180.nc.20160701_0000z.bin

Is there an option to turn it off?

[BUG/ISSUE] MODIS LAI not properly updated at correct time

When comparing results from a 2-day run versus 2 consecutive 1-day runs for dry deposition only on (emission, chemistry, wet deposition, mixing, convection, and advection off) there are differences when there should be none. The pattern of differences indicates a possible problem with leaf area index (LAI).

MODIS LAI should be time-interpolated with daily frequency so a 2-day run would include an LAI update. Fixing the values to be constant across the 2-day runs removes the differences, as shown below. However, this is a not an acceptable fix since LAI should be updated daily.

[FEATURE REQUEST] Move GCHP files relevant to GEOS to GEOS-Chem

Most of the files included in the GCHP repository are not needed within the NASA GEOS model, but including GEOS-Chem within GEOS requires including GCHP. This can be avoided by moving certain high-level GCHP files to the GEOS-Chem repository. Then the GEOS-Chem repository alone can be dropped in GEOS.

[QUESTION]Should it make cleanup_output everytime at the beginging of smulation?

Hi @lizziel @yantosca
I use the v12.6.0 and find that the model will cut off sometimes at any module resolution.
And if I make cleanup_output at the beginning of the simulation, it will improve.
I wonder whether the cleanup_output will effect the simulation, and should I make cleanup_output everytime?

[BUG/ISSUE] "-lblas -llapack" breaks compilation with ifort/gfortran (GCHP 12.5.0)

Describe the bug

New MAPL uses -lblas -llapack in Shared/Config/math.mk (0f050c1). This breaks the build when using Intel compiler:

/home/centos/gchp_12.5.0/Code.GCHP/GCHP/Shared/Linux/lib/libGMAO_gfio.a /home/centos/gchp_12.5.0/Code.GCHP/GCHP/Shared/Linux/lib/libGMAO_eu.a -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-fortran-4.4.5-feoqwqgpgo6l3wyozlqj4leesuhnddjw/lib -lnetcdff -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-4.7.0-5xnzracbnatiq32kh5vukohqckxqtacj/lib -lnetcdf -lnetcdf -lhdf5_hl -lhdf5 -lz -lm -L/home/centos/spack/opt/spack/linux-centos7-x86_64/intel-19.0.4/netcdf-4.7.0-5xnzracbnatiq32kh5vukohqckxqtacj/lib -lnetcdf -lblas -llapack -ldl -lc -lpthread -lrt  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lstdc++
ld: cannot find -llapack
ld: cannot find -llapack
ld: cannot find -llapack
ld: cannot find -llapack

Full logs (the error is independent of MPI):

With ifort, the Makefile should in principle pick up MKL instead of BLAS/LAPACK:
https://github.com/geoschem/gchp/blob/90f24d9aeab58c988da1f88263ce7b471f799367/Shared/Config/math.mk#L18-L21

But in fact it is still linking LAPACK:
https://github.com/geoschem/gchp/blob/90f24d9aeab58c988da1f88263ce7b471f799367/Shared/Config/math.mk#L39-L45

To Reproduce
Steps to reproduce the behavior:

Install Intel compiler 19.0.4 following Spack documentation
Install NetCDF via Spack by spack spec netcdf-fortran %intel ^netcdf~mpi ^hdf5~mpi+fortran+hl (independent of MPI installation). OpenMPI can be installed via spack install openmpi+pmi schedulers=slurm % intel. Intel MPI is pre-installed on the AWS cluster.
Build GCHP 12.5.0. The environment config is the same as #37, plus export gFTL=....

Environment

GEOS-Chem/GCHP version: 12.5.0
Compiler version: ifort 19.0.4
MPI library and version: OpenMPI 3.1.4 and Intel MPI 19.0.4
netCDF and netCDF-Fortran library version: 4.7.0 and 4.4.5
Computational environment: AWS cloud, CentOS 7
Code modifications: No change to GC-classic part; apply #35 to GCHP part

geoschem / gchp_legacy Goto Github PK

gchp_legacy's People

Contributors

Stargazers

Watchers

Forkers

gchp_legacy's Issues

Problem

Suggested fix

Additional info

The problem (explained)

A solution

Alternatives (if changing the code isn't an option)

Problem

Suggestions

Major references

Pros and Cons of the two approaches

My take-aways

My Open MPI install

My Question

Major references

Status of multi-node MPI runs

Where to start

Description

Log files and error messages

File list

Environment

Recommend Projects

Recommend Topics

Recommend Org