Comments (22)
Pushed the fix to deallocate State_Diag to the bugfix/GCHP_issues branch. This will be merged into the next GCHP version.
from gchp_legacy.
I discovered that we were not deallocating State_Diag at the end of a GCHP run. So I now pass State_Diag from Chem_GridCompMod to routine GIGC_CHUNK_FINAL. Then within GIGC_CHUNK_FINAL, we call Cleanup_State_Diag. This now prints out all timing info. But the run still drops core at the end as in #11.
Here is a snippet of the log file at the last timestep of the run:
AGCM Date: 2016/07/01 Time: 03:00:00
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0230z.nc4
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0300z.nc4
Times for GIGCenv
TOTAL : 1.013
INITIALIZE : 0.000
RUN : 1.009
GenInitTot : 0.003
--GenInitMine : 0.003
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.001
--GenFinalMine : 0.001
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc
Times for GIGCchem
TOTAL : 254.818
INITIALIZE : 6.647
RUN : 244.785
FINALIZE : 0.000
DO_CHEM : 240.695
CP_BFRE : 0.048
CP_AFTR : 1.741
GC_CONV : 15.002
GC_EMIS : 12.660
GC_DRYDEP : 0.042
GC_FLUXES : 0.000
GC_TURB : 8.096
GC_CHEM : 193.611
GC_WETDEP : 10.119
GC_DIAGN : 0.000
GenInitTot : 2.440
--GenInitMine : 2.440
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.744
--GenFinalMine : 0.744
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
-----------------------------------------------------
Block User time System Time Total Time
-----------------------------------------------------
TOTAL 506.1147 0.0000 506.1147
COMM_TOTAL 1.1536 0.0000 1.1536
COMM_TRAC 1.1536 0.0000 1.1536
FV_TP_2D 40.6695 0.0000 40.6695
Times for DYNAMICS
TOTAL : 56.285
INITIALIZE : 2.143
RUN : 56.273
FINALIZE : 0.000
GenInitTot : 0.011
--GenInitMine : 0.011
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.001
--GenFinalMine : 0.001
GenRecordTot : 0.001
--GenRecordMine : 0.001
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
Times for GCHP
TOTAL : 310.187
RUN : 301.969
GIGCenv : 1.015
GIGCchem : 252.394
DYNAMICS : 58.406
GenInitTot : 7.447
--GenInitMine : 0.429
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.775
--GenFinalMine : 0.001
GenRecordTot : 0.002
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
Times for HIST
TOTAL : 61.884
Initialize : 0.268
Finalize : 0.002
Run : 61.613
--Couplers : 0.611
--I/O : 61.092
----IO Create : 1.741
----IO Write : 61.092
-----IO Post : 0.722
-----IO Wait : 8.456
-----IO Write : 52.216
-ParserRun : 0.000
GenInitTot : 0.000
--GenInitMine : 0.000
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.002
--GenFinalMine : 0.002
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
Times for EXTDATA
TOTAL : 139.762
Initialize : 7.516
Run : 132.240
-Read_Loop : 131.591
--CheckUpd : 0.103
--Read : 121.072
--Swap : 0.005
--Bracket : 15.497
-Interpolate : 0.649
GenInitTot : 0.005
--GenInitMine : 0.005
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.004
--GenFinalMine : 0.004
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
NOT using buffer I/O for file: cap_restart
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f74b41fd2da in ???
#1 0x7f74b41fc503 in ???
#2 0x7f74b3639f1f in ???
#3 0x7f74b20ff6b4 in ???
#4 0x7f74b20c2151 in ???
#5 0x7f74b2053016 in ???
#6 0x7f74b1fa76e1 in ???
#7 0x7f74b20512af in ???
#8 0x7f74b20ff982 in ???
#9 0x7f74b2051c10 in ???
#10 0x7f74b1fa3ff4 in ???
#11 0x7f74b1f3cacc in ???
#12 0x7f74b363e040 in ???
#13 0x7f74b363e139 in ???
#14 0x7f74b438e083 in ???
#15 0x561eb68d3cef in ???
#16 0x561eb66991be in ???
#17 0x7f74b361cb96 in ???
#18 0x561eb66991e9 in ???
#19 0xffffffffffffffff in ???
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 0 on
node ip-172-31-88-121 exiting improperly. There are three reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
You can avoid this message by specifying -quiet on the mpirun command line.
from gchp_legacy.
We have run into this problem in the past. It is an issue with cleanup of state_met_mod (State_Met Finalize) and only occurs when the model was built with gfortran 6+. What version of fortran are you using? I will test with it on Odyssey in our dev branch.
from gchp_legacy.
I have tested with gfortran 7.2.1/7.3.0/7.3.1; all have the same issue
from gchp_legacy.
I successfully compiled and ran to completion with gfortran 7.1.0 on Odyssey. Are you sure you are using 12.1.0 and not 12.0.0? There was an issue in 12.0.0 that caused this symptom and it was fixed in 12.0.1 with geoschem/geos-chem@8084db0.
from gchp_legacy.
Yes I am using 12.1.0. I will build a public GCHP AMI today and you can use it for debugging. Does this sound good?
from gchp_legacy.
Thanks, one of us will take a look. For what it's worth I also tried 8.2.0 on Odyssey with success.
from gchp_legacy.
OK I've made a public AMI: ami-0a5973f14aad7413a
in region US East (N. Virginia)
Use at least r5.2xlarge
to provide enough memory.
After login, simply
$ cd ~/tutorial/gchp_standard
$ mpirun -np 6 ./geos | tee run_gchp.log
I am still writing formal documentation but this should be enough for testing.
from gchp_legacy.
Interestingly, after I fix the restart file issue by geoschem/GCHP#8 (comment), even Times for GIGCenv
is not printed. Now the run terminates even earlier:
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
Writing: 510 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
Writing: 510 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4
MAPL_CFIOWriteBundlePost 1908
HistoryRun 2947
MAPL_Cap 833
Full log files (only differ at the last few lines)
Use ami-0f44e999c80ef6e66
to reproduce this issue.
Update: this is not just about incomplete timing info. The run simply crashes at hour 1. See geoschem/GCHP#8 (comment)
from gchp_legacy.
This new issue (crash after 1hr run) will be addressed in geoschem/GCHP#8. To restrict the problem to the cleanup_state_met issue you can turn off all diagnostics collections in HISTORY.rc.
from gchp_legacy.
Have been looking at this issue.
Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)
In HISTORY.rc I turned on only these collections
(1) SpeciesConc_avg : only archived SpeciesConc_NO
(2) SpeciesConc_inst : only archived SpeciesConc_NO
(3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP
(4) StateMet_inst: only archived Met_AD
This run (1 hour) on 6 cores finished with all timing information:
GIGCenv: total 0.346
GIGCchem total: 123.970
Dynamics total: 18.741
GCHP total: 140.931
HIST total: 0.264
EXTDATA total: 133.351
So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.
This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?
Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.
from gchp_legacy.
Also, if I run with no diagnostics turned on then the run dies at 10 minutes
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% USING O3 COLUMNS FROM THE MET FIELDS! %%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- RDAER: Using online SO4 NH4 NIT!
- RDAER: Using online BCPI OCPI BCPO OCPO!
- RDAER: Using online SALA SALC
- DO_STRAT_CHEM: Linearized strat chemistry at 2016/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
- LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
--- Chemistry done!
--- Do wetdep now
--- Wetdep done!
Setting history variable pointers to GC and Export States:
AGCM Date: 2016/07/01 Time: 00:10:00
Memuse(MB) at MAPL_Cap:TimeLoop= 4.638E+03 4.409E+03 2.223E+03 2.601E+03 3.258E+03
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 1.823E+04 0.000E+00
MAPL_ExtDataInterpField 3300
EXTDATA::Run_ 1471
MAPL_Cap 777
application called MPI_Abort(MPI_COMM_WORLD, 21856) - process 0
From the traceback it looks as if it's hanging in interpolating a field in ExtData.
from gchp_legacy.
The run also crashes at 00:10 if I only save one collection SpeciesConc_inst
with only two species SpeciesConc_NO
and SpeciesConc_O3
in it.
--- Chemistry done!
--- Do wetdep now
--- Wetdep done!
Setting history variable pointers to GC and Export States:
SpeciesConc_NO
SpeciesConc_O3
AGCM Date: 2016/07/01 Time: 00:10:00
Memuse(MB) at MAPL_Cap:TimeLoop= 4.723E+03 4.494E+03 2.306E+03 2.684E+03 3.260E+03
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 1.852E+04 0.000E+00
offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3 54
DYNAMICSRun 703
GCHP::Run 407
MAPL_Cap 792
But with two collections SpeciesConc_avg
and SpeciesConc_inst
, each with only two species SpeciesConc_NO
and SpeciesConc_O3
in it, the run is able to finish and print full timing information:
Writing: 144 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0530z.nc4
Writing: 144 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0600z.nc4
Times for GIGCenv
TOTAL : 2.252
INITIALIZE : 0.000
RUN : 2.250
GenInitTot : 0.004
--GenInitMine : 0.003
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.000
--GenFinalMine : 0.000
GenRecordTot : 0.001
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc
Times for GIGCchem
TOTAL : 505.760
INITIALIZE : 3.617
RUN : 498.376
FINALIZE : 0.000
DO_CHEM : 488.864
CP_BFRE : 0.121
CP_AFTR : 4.080
GC_CONV : 36.070
GC_EMIS : 0.000
GC_DRYDEP : 0.119
GC_FLUXES : 0.000
GC_TURB : 17.966
GC_CHEM : 403.528
GC_WETDEP : 19.443
GC_DIAGN : 0.000
GenInitTot : 2.719
--GenInitMine : 2.719
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.963
--GenFinalMine : 0.963
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
-----------------------------------------------------
Block User time System Time Total Time
-----------------------------------------------------
TOTAL 815.4433 0.0000 815.4433
COMM_TOTAL 3.3098 0.0000 3.3098
COMM_TRAC 3.3097 0.0000 3.3097
FV_TP_2D 90.1448 0.0000 90.1448
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3126 RUNNING AT ip-172-31-0-74
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
from gchp_legacy.
Based on these reports I think the issue of not completing the timing report is not related to cleaning up the State_Met array. I ran into similar symptoms sometimes, but not all times, on the Harvard Odyssey cluster when there was an issue writing out diagnostic files. I found that if I reduced the total memory across all diagnostics then the run went to completion. Sometimes when the problem was encountered the log appeared to have a problem writing the timing.
This issue started on the Odyssey cluster after the operating system upgrade from CentOS6 to CentOS7. We were using OpenMPI 1.10.3 when the switch happening. Upgrading to OpenMPI 2 did not fix the issue. Then upgrading to OpenMPI 3 (either 3.0 or 3.1) corrected the problem for unknown reasons. I therefore think this issue is related to OS and MPI version compatibility, although we never figured out the cause so this is just a theory.
from gchp_legacy.
Tested with OpenMPI 2.2.1, still cannot print the full timing (#10)
run_openmpi_3hour_zero_collection.log
from gchp_legacy.
This is choking in Cleanup_State_Met.
from gchp_legacy.
I suggest seeing if this issue goes away after upgrading to OpenMPI 3.
If it does not, adding print statements through cleanup_state_met to see where it dies should give insight on if cleanup of a specific met-field is a problem. Many State_met fields point to MAPL imports so proper deallocation/nullify can be an issue due to the ordering of state_met and import cleanup.
from gchp_legacy.
Tested with gcc 7.3.0 + OpenMPI 3.1.3, the timing info is still incomplete. Again stops at Times for GIGCenv
run_openmpi3_default_diags.log
Can be reproduced in ami-01074a30392daa0f9
from gchp_legacy.
I cannot use the tricks for MPICH3 (saving SpeciesConc_inst
and SpeciesConc_avg
and turning off emission #10) to get full timing for OpenMPI3.
It always terminates at Times for GIGCenv
from gchp_legacy.
It seems that the code is exiting the Finalize_ routine of Chem_GridCompMod.F90 normally, since the end of PET00000.GEOSCHEMchem.log is written:
### GIGCchem::Finalize_ | Cleanup on PET # 00000
I also put some debug output in GIGC_GridCompMod::Finalize_ and it appears it is getting calling GIGC_CHUNK_FINAL and destroying the state objects properly. So wherever the core dump is, it is maybe a level or two higher.
In the meantime, I'll push my edits for deallocating State_Diag to a bugfix branch of GCHP so that we can use it going forward.
from gchp_legacy.
Thanks for the fix. Now the cloud should have exactly the same behavior as Odyssey?
from gchp_legacy.
Yes, I think the cloud will behave the same.
One clarification: I updated the GCC and GCHP folders to 12.1.1, then rebuilt with "make clean_gc; make compile_standard".
from gchp_legacy.
Related Issues (20)
- [BUG/ISSUE] Incorrect regridding if file latitude data ends in +/- 90 HOT 4
- [BUG/ISSUE] Not printing the missing HEMCO data file that causes model crash HOT 13
- [BUG/ISSUE] Change in MAPL vertical flip rules impacting mesospheric chemistry HOT 1
- [BUG/ISSUE] H2O2AfterChem vertically flipped in restart HOT 2
- [BUG/ISSUE] MODIS LAI not properly updated at correct time HOT 6
- [BUG/ISSUE] Run failure in transport tracers simulation with 12.6.2 HOT 1
- [FEATURE REQUEST] ESMF v8 public release HOT 1
- [BUG/ISSUE] Run crashes in MAPL when running full chemistry simulation at c360 HOT 6
- [QUESTION]Should it make cleanup_output everytime at the beginging of smulation? HOT 4
- [BUG/ISSUE] Fullchem run failure in 12.7.0+ at c180+ due to reduced timesteps HOT 4
- [DISCUSSION] This repository will be retired in version 13.0.0
- [BUG/ISSUE] Monthly diagnostics incorrect for Feb in leap years if using multi-run option HOT 1
- [QUESTION] Error in MAPL_IO.F90 reading restart file?
- [BUG/ISSUE]Invoking MPI_ABORT causes Open MPI to kill all MPI processes when run GCHP at c360 HOT 9
- [BUG/ISSUE]make build_all, gchp error in ESMF: cpp/node/detail/node_iterator.h(64): error: namespace "std" has no member "addressof" HOT 2
- [BUG/ISSUE] compiling GCHP 12.8.2 HOT 8
- [QUESTION] Compiling GCHP failed HOT 5
- [BUG/ISSUE] Non-advected species concentrations not copied for output restart file
- [QUESTION] The dimensions of gchp restart file HOT 3
- [QUESTION] Why does GCHP fail when meteorology turned on in 12.9.3 HEMCO_Config.rc HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gchp_legacy.