Giter Site home page Giter Site logo

Comments (23)

JesusTorrado avatar JesusTorrado commented on August 30, 2024

Hi Lukas,

Let's see if the other bug is related to this one.

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

Update on this.

As guessed in my first post the huge number of repeats causes cobaya to crash early on. When manually setting the number of repeats (as described in #35) it doesn't crash as fast which allowed me to monitor the memory.

Running on only 1 core shows how the memory steadily rises (it went beyond 100g within the first 15min) and eventually crashes. Running on 16 cores the memory accumulates on each core and hence it crashes much quicker. I've played around with the following parameters with the same results:

  • number of cores
  • n_repeats (via speed blocking)
  • n_live
  • experiment: I've tried Planck 15 TTlowTEB (my target) and SN Pantheon

Note, this accumulation of memory happens when using PolyChord together with CLASS. It doesn't happen when switching the sampler to MCMC nor when switching the theory to CAMB.

@vivianmiranda: You seem to have done some PolyChord runs with cobaya. Did you ever use it alongside CLASS?

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

@lukashergt Have you, by any chance, moved the CLASS folder (or the modules folder containing it) after compiling it? That actually causes a memory leak.

from cobaya.

vivianmiranda avatar vivianmiranda commented on August 30, 2024

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

@lukashergt Have you, by any chance, moved the CLASS folder (or the modules folder containing it) after compiling it? That actually causes a memory leak.

No, I haven't. And to make sure I tried it with a manual installation of CLASS and I tried re-installing it with cobaya. In either case I still get a memory leak.

Do you not get a memory leak when using both CLASS and PolyChord?

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

I am not sure I ever tried that combination. I'll take a look at it as soon as I am able (a little busy updating with Planck 2018 right now, and on leave)

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

Didn't have time to check it yet, sorry! We are doing a quick release, and I have added a not in the documentation (https://cobaya.readthedocs.io/en/latest/cosmo_troubleshooting.html#running-out-of-memory).

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

@lukashergt Could I ask you to perform a little test? Would you please greatly reduce the prior boundaries around some point at which you are sure CLASS is not failing? (with debug=True if possible, and please attach the last lines)

I have the feeling that the error has to do with CLASS not releasing memory when if fails (errors at extreme points are ignored and assigned zero likelihood). PolyChord's thorough exploration of the prior region produces a lot of those errors, which may explain why fills up the memory but MCMC does not.

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

Indeed, the memory leak seems to be connected to regions of parameter space for which classy fails.


With too wide priors I get a lot of [classy] Computation of cosmological products failed and [Likelihood] Theory code computation failed and memory accumulation.

Here is one block of the debug output:

2019-08-28 15:12:47,288 [Model] Posterior to be computed for parameters {'gal545_A_217': 156.3685628417924, 'ps_A_143_217': 217.29285808198276, 'ps_A_143_143': 344.89545216705295, 'ps_A_100_100': 281.97719964606586, 'gal545_A_143': 15.630630705699392, 'theta_s_1e2': 5.267436541731474, 'xi_sz_cib': 0.44208128821227216, 'A_sz': 2.168155952112577, 'calib_100T': 0.9985359258023604, 'tau_reio': 0.325731929042888, 'n_s': 1.152197234248646, 'calib_217T': 1.0008927619614418, 'A_planck': 1.000470736481457, 'ps_A_217_217': 320.88956340879645, 'A_cib_217': 66.86002599605827, 'logA': 3.8903936091455673, 'gal545_A_100': 7.759427716677548, 'omega_b': 0.08559348417245202, 'omega_cdm': 0.1820260029398179, 'gal545_A_143_217': 61.25608221477475, 'ksz_norm': 9.014416933242488}
 2019-08-28 15:12:47,289 [prior] Evaluating prior at array([3.89039361e+00, 1.15219723e+00, 5.26743654e+00, 8.55934842e-02,
       1.82026003e-01, 3.25731929e-01, 1.00047074e+00, 6.68600260e+01,
       4.42081288e-01, 2.16815595e+00, 2.81977200e+02, 3.44895452e+02,
       2.17292858e+02, 3.20889563e+02, 9.01441693e+00, 7.75942772e+00,
       1.56306307e+01, 6.12560822e+01, 1.56368563e+02, 9.98535926e-01,
       1.00089276e+00])
 2019-08-28 15:12:47,290 [prior] Got logpriors = [-55.67649702298467, -2.512054827305687]
 2019-08-28 15:12:47,291 [Likelihood] Got input parameters: OrderedDict([('A_s', 4.89301420853194e-09), ('n_s', 1.152197234248646), ('100*theta_s', 5.267436541731474), ('omega_b', 0.08559348417245202), ('omega_cdm', 0.1820260029398179), ('m_ncdm', 0.06), ('tau_reio', 0.325731929042888), ('A_planck', 1.000470736481457), ('cib_index', -1.3), ('A_cib_217', 66.86002599605827), ('xi_sz_cib', 0.44208128821227216), ('A_sz', 2.168155952112577), ('ps_A_100_100', 281.97719964606586), ('ps_A_143_143', 344.89545216705295), ('ps_A_143_217', 217.29285808198276), ('ps_A_217_217', 320.88956340879645), ('ksz_norm', 9.014416933242488), ('gal545_A_100', 7.759427716677548), ('gal545_A_143', 15.630630705699392), ('gal545_A_143_217', 61.25608221477475), ('gal545_A_217', 156.3685628417924), ('calib_100T', 0.9985359258023604), ('calib_217T', 1.0008927619614418)])
 2019-08-28 15:12:47,292 [classy] Computing (state 1)
 2019-08-28 15:12:47,292 [classy] Setting parameters: {'A_s': 4.89301420853194e-09, 'N_ur': 2.0328, 'N_ncdm': 1, 'tau_reio': 0.325731929042888, 'n_s': 1.152197234248646, 'l_max_scalars': 2508, 'lensing': 'yes', 'm_ncdm': 0.06, '100*theta_s': 5.267436541731474, 'output': 'tCl lCl pCl', 'omega_b': 0.08559348417245202, 'omega_cdm': 0.1820260029398179}
 2019-08-28 15:12:49,173 [classy] Computation of cosmological products failed. Assigning 0 likelihood and going on.
 2019-08-28 15:12:49,174 [Likelihood] Theory code computation failed. Not computing likelihood.
 2019-08-28 15:12:49,178 [Model] Computed derived parameters: {'A': 4.89301420853194, 'clamp': 2.5506408927077397, 'A_s': 4.89301420853194e-09, 'rs_drag': nan, 'Omega_Lambda': nan, 'H0': nan, 'YHe': nan, 'omegamh2': nan, 'Omega_m': nan, 'chi2__CMB': nan, 'z_reio': nan, 'age': nan}

Using a narrower prior range on the input parameters seems to work (I don't have a finished run, yet, but the memory is not accumulating anymore...).

I have the feeling that the error has to do with CLASS not releasing memory when if fails (errors at extreme points are ignored and assigned zero likelihood). PolyChord's thorough exploration of the prior region produces a lot of those errors, which may explain why fills up the memory but MCMC does not.

I agree. Does this point to a missing cosmo.struct_cleanup() or similar in the event where zero likelihood is assigned?

from cobaya.

vivianmiranda avatar vivianmiranda commented on August 30, 2024

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

I am still running on cobaya 1.2.1, where I don't get memory problems with Cobaya+CAMB+PolyChord. When setting a too wide prior range the code speeds through assigning zero likelihood to "out of bounds parameters" and continues on without memory accumulating.

Thus, I'd say the memory leak that @vivianmiranda experiences with cobaya 2.0 seems unrelated to this issue.

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

@lukashergt struct_cleanup is called just before setting parameters, so that it is guaranteed to be called once per error. I have the feeling that it's a CLASS problem, as in struct_cleanup missing something it CLASS initialization has not been completed. Since using the narrower priors is an effective workaround, let's leave it here. I'll leave this issue open until I have more time to investigate it (I am on parental leave right now), and will pass it on to the CLASS developers if applicable.

@vivianmiranda Weird! When CAMB crashes (before your fix), does it raise a CAMBError? Could you check whether changing to modification to raise a CAMBError in the lensing module (so that Cobaya gets to know about the failure to compute observables), it still leaks memory?

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

@lukashergt The CLASS problem is being dealt with, and should be fixed in the next release, happening in a few days, and installed by default by the next Cobaya sub-version release. I will leave this open until then.

@vivianmiranda Any update on this?

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

In addition to original issue:
The memory leak happens for curvature runs even for fairly narrow priors [-0.15, 0.15], i.e. narrowing the prior doesn't work as a temporary fix in this case.

@JesusTorrado
Do you happen to know when the CLASS update might happen?
Is there a development branch that I could already use now?

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

CLASS v2.8 came out last week. Unfortunately it does not fix this issue.

It actually got even worse. Where before (v2.7.2) I was able to run for flat universes with sufficiently narrow priors and only got the leak for curved universes or when the prior ranges were too big, now (v2.8.1) I actually have a memory leak even for flat universes with the same (narrow) priors that previously worked.

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

Hi Lukas,

Please, try CLASS 2.8.2 and let me know if it fixes the memory leaks (I've tested it with broad priors and including free omega_k; it leaks a tiny bit of memory, like a few 10's of Mb, in some corner cases only). It should also solve lesgourg/class_public#299

There is a small chance it will not work in cobaya/master. If that's the case, feel free to use the devel branch instead (to be merged soon).

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

Currently trying things out.

I too still notice a bit of leak, but possibly this won't prevent a run from finishing.

I get a cryptic "python3.6 terminated with signal 11" error when running with Planck18 and TTTEEE (and with curvature). With TT only I don't get that error. Did you stumble onto something similar? Might try the devel branch for this.

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

I didn't notice the segfault, but I was testing in devel with TTTEEE. If you do get it in devel, please let me know it you get the same in devel, and the specific parameters with which CLASS was called just before the segfault, printed in the debug output. Planck's clik is the same in both branches, actually.

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

Hi Jesus,

I've tried both master and devel branch. The flat case works fine again, but I didn't manage to get a curvature run to finish.

I've tested it with broad priors and including free omega_k; it leaks a tiny bit of memory, like a few 10's of Mb, in some corner cases only

Did you get a PolyChord run to finish including omega_k? I manage to get it started, but it crashes after a couple of hours. What prior on omega_k did you use? Could you maybe send me the corresponding .yaml file?

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

I haven't tried letting it finish: just generated a large set of initial samples for PolyChord. My input file is attached (extension changed to .log because otherwise github doesn't take it).
classk.log

Can you give me some more info on the crash? Could you run again with debug on, please?

from cobaya.

lukashergt avatar lukashergt commented on August 30, 2024

Hi Jesus,
sorry for taking long with getting back to you!

Can you give me some more info on the crash? Could you run again with debug on, please?

I have retried things with the devel branch and CLASS version 2.8.2. The yaml file that you posted only tests lowl.TT and lensing. I have found that lowl.TT and highl.TTTEEE seem to work fine. A little bit of memory accumulation, but not so much to stop the run.

However, lowl.EE causes the problem. (Maybe related to #43 ?) This happens faster when only running lowl.EE and takes longer when running with other likelihoods. Also, different from #43, the error does not happen at the first call to lowl.EE. Does this mean a certain part of parameter space causes the problem for the lowl.EE likelihood?

This does not seem to be a memory issue anymore. When running lowl.EE only, the error happens within the first 20min. I monitored the memory, which was fine.

Debug output and error file are attached:
test_omegak_EE_18378439.out.log
test_omegak_EE_18378439.err.log
test_omegak_EE.interactive.log


error overview

  • When sending the job off via a slurm script, it crashes after a while with a "signal 11" error (see the .err file):
python3.6:262768 terminated with signal 11 at PC=2b89e4d4205e SP=7ffe97c0de90. 
  • When I run it interactively, it crashes with an Intel error (see the .interactive file):
PolyChord: Next Generation Nested Sampling
copyright: Will Handley, Mike Hobson & Anthony Lasenby
  version: 1.16
  release: 1st March 2019
    email: [email protected]

Run Settings
nlive    :      64
nDims    :       8
nDerived :      13
Doing Clustering
Generating equally weighted posteriors
Generating weighted posteriors
Clustering on posteriors
Writing a resume file tochains/test/test_omegak_EE/raw_polychord_output/test_omegak_EE.resume

generating live points


all live points generated

number of repeats:           14          40
started sampling


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 123210 RUNNING AT cpu-e-1056
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

Hi @lukashergt

I have been doing some more testing on this. In particular I have run your input file for a few hours last evening (enough to get all initial live samples, and some more) and found no crash. The kind of errors that appear in your logs and that I find too (extreme amplitudes in the clik EE likelihood) look like they are being properly caught. So my only reasonable hypothesis right now is that it's ifort-related (I run with GNU). Maybe reducing logzero to -1e30 may solve it, as in the bug you reported at handley-lab/anesthetic#70 ? (I any case, -1e30 will be the new default in devel)

from cobaya.

JesusTorrado avatar JesusTorrado commented on August 30, 2024

Since this should be solved by the last fix to logzero, I am closing this.

from cobaya.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.