Comments (23)
Hi Lukas,
Let's see if the other bug is related to this one.
from cobaya.
Update on this.
As guessed in my first post the huge number of repeats causes cobaya to crash early on. When manually setting the number of repeats (as described in #35) it doesn't crash as fast which allowed me to monitor the memory.
Running on only 1 core shows how the memory steadily rises (it went beyond 100g within the first 15min) and eventually crashes. Running on 16 cores the memory accumulates on each core and hence it crashes much quicker. I've played around with the following parameters with the same results:
- number of cores
- n_repeats (via speed blocking)
- n_live
- experiment: I've tried Planck 15 TTlowTEB (my target) and SN Pantheon
Note, this accumulation of memory happens when using PolyChord together with CLASS. It doesn't happen when switching the sampler to MCMC nor when switching the theory to CAMB.
@vivianmiranda: You seem to have done some PolyChord runs with cobaya. Did you ever use it alongside CLASS?
from cobaya.
@lukashergt Have you, by any chance, moved the CLASS folder (or the modules
folder containing it) after compiling it? That actually causes a memory leak.
from cobaya.
from cobaya.
@lukashergt Have you, by any chance, moved the CLASS folder (or the
modules
folder containing it) after compiling it? That actually causes a memory leak.
No, I haven't. And to make sure I tried it with a manual installation of CLASS and I tried re-installing it with cobaya. In either case I still get a memory leak.
Do you not get a memory leak when using both CLASS and PolyChord?
from cobaya.
I am not sure I ever tried that combination. I'll take a look at it as soon as I am able (a little busy updating with Planck 2018 right now, and on leave)
from cobaya.
Didn't have time to check it yet, sorry! We are doing a quick release, and I have added a not in the documentation (https://cobaya.readthedocs.io/en/latest/cosmo_troubleshooting.html#running-out-of-memory).
from cobaya.
@lukashergt Could I ask you to perform a little test? Would you please greatly reduce the prior boundaries around some point at which you are sure CLASS is not failing? (with debug=True if possible, and please attach the last lines)
I have the feeling that the error has to do with CLASS not releasing memory when if fails (errors at extreme points are ignored and assigned zero likelihood). PolyChord's thorough exploration of the prior region produces a lot of those errors, which may explain why fills up the memory but MCMC does not.
from cobaya.
Indeed, the memory leak seems to be connected to regions of parameter space for which classy fails.
With too wide priors I get a lot of [classy] Computation of cosmological products failed
and [Likelihood] Theory code computation failed
and memory accumulation.
Here is one block of the debug output:
2019-08-28 15:12:47,288 [Model] Posterior to be computed for parameters {'gal545_A_217': 156.3685628417924, 'ps_A_143_217': 217.29285808198276, 'ps_A_143_143': 344.89545216705295, 'ps_A_100_100': 281.97719964606586, 'gal545_A_143': 15.630630705699392, 'theta_s_1e2': 5.267436541731474, 'xi_sz_cib': 0.44208128821227216, 'A_sz': 2.168155952112577, 'calib_100T': 0.9985359258023604, 'tau_reio': 0.325731929042888, 'n_s': 1.152197234248646, 'calib_217T': 1.0008927619614418, 'A_planck': 1.000470736481457, 'ps_A_217_217': 320.88956340879645, 'A_cib_217': 66.86002599605827, 'logA': 3.8903936091455673, 'gal545_A_100': 7.759427716677548, 'omega_b': 0.08559348417245202, 'omega_cdm': 0.1820260029398179, 'gal545_A_143_217': 61.25608221477475, 'ksz_norm': 9.014416933242488}
2019-08-28 15:12:47,289 [prior] Evaluating prior at array([3.89039361e+00, 1.15219723e+00, 5.26743654e+00, 8.55934842e-02,
1.82026003e-01, 3.25731929e-01, 1.00047074e+00, 6.68600260e+01,
4.42081288e-01, 2.16815595e+00, 2.81977200e+02, 3.44895452e+02,
2.17292858e+02, 3.20889563e+02, 9.01441693e+00, 7.75942772e+00,
1.56306307e+01, 6.12560822e+01, 1.56368563e+02, 9.98535926e-01,
1.00089276e+00])
2019-08-28 15:12:47,290 [prior] Got logpriors = [-55.67649702298467, -2.512054827305687]
2019-08-28 15:12:47,291 [Likelihood] Got input parameters: OrderedDict([('A_s', 4.89301420853194e-09), ('n_s', 1.152197234248646), ('100*theta_s', 5.267436541731474), ('omega_b', 0.08559348417245202), ('omega_cdm', 0.1820260029398179), ('m_ncdm', 0.06), ('tau_reio', 0.325731929042888), ('A_planck', 1.000470736481457), ('cib_index', -1.3), ('A_cib_217', 66.86002599605827), ('xi_sz_cib', 0.44208128821227216), ('A_sz', 2.168155952112577), ('ps_A_100_100', 281.97719964606586), ('ps_A_143_143', 344.89545216705295), ('ps_A_143_217', 217.29285808198276), ('ps_A_217_217', 320.88956340879645), ('ksz_norm', 9.014416933242488), ('gal545_A_100', 7.759427716677548), ('gal545_A_143', 15.630630705699392), ('gal545_A_143_217', 61.25608221477475), ('gal545_A_217', 156.3685628417924), ('calib_100T', 0.9985359258023604), ('calib_217T', 1.0008927619614418)])
2019-08-28 15:12:47,292 [classy] Computing (state 1)
2019-08-28 15:12:47,292 [classy] Setting parameters: {'A_s': 4.89301420853194e-09, 'N_ur': 2.0328, 'N_ncdm': 1, 'tau_reio': 0.325731929042888, 'n_s': 1.152197234248646, 'l_max_scalars': 2508, 'lensing': 'yes', 'm_ncdm': 0.06, '100*theta_s': 5.267436541731474, 'output': 'tCl lCl pCl', 'omega_b': 0.08559348417245202, 'omega_cdm': 0.1820260029398179}
2019-08-28 15:12:49,173 [classy] Computation of cosmological products failed. Assigning 0 likelihood and going on.
2019-08-28 15:12:49,174 [Likelihood] Theory code computation failed. Not computing likelihood.
2019-08-28 15:12:49,178 [Model] Computed derived parameters: {'A': 4.89301420853194, 'clamp': 2.5506408927077397, 'A_s': 4.89301420853194e-09, 'rs_drag': nan, 'Omega_Lambda': nan, 'H0': nan, 'YHe': nan, 'omegamh2': nan, 'Omega_m': nan, 'chi2__CMB': nan, 'z_reio': nan, 'age': nan}
Using a narrower prior range on the input parameters seems to work (I don't have a finished run, yet, but the memory is not accumulating anymore...).
I have the feeling that the error has to do with CLASS not releasing memory when if fails (errors at extreme points are ignored and assigned zero likelihood). PolyChord's thorough exploration of the prior region produces a lot of those errors, which may explain why fills up the memory but MCMC does not.
I agree. Does this point to a missing cosmo.struct_cleanup()
or similar in the event where zero likelihood is assigned?
from cobaya.
from cobaya.
I am still running on cobaya 1.2.1, where I don't get memory problems with Cobaya+CAMB+PolyChord. When setting a too wide prior range the code speeds through assigning zero likelihood to "out of bounds parameters
" and continues on without memory accumulating.
Thus, I'd say the memory leak that @vivianmiranda experiences with cobaya 2.0 seems unrelated to this issue.
from cobaya.
@lukashergt struct_cleanup
is called just before setting parameters, so that it is guaranteed to be called once per error. I have the feeling that it's a CLASS problem, as in struct_cleanup
missing something it CLASS initialization has not been completed. Since using the narrower priors is an effective workaround, let's leave it here. I'll leave this issue open until I have more time to investigate it (I am on parental leave right now), and will pass it on to the CLASS developers if applicable.
@vivianmiranda Weird! When CAMB crashes (before your fix), does it raise a CAMBError
? Could you check whether changing to modification to raise a CAMBError
in the lensing module (so that Cobaya gets to know about the failure to compute observables), it still leaks memory?
from cobaya.
@lukashergt The CLASS problem is being dealt with, and should be fixed in the next release, happening in a few days, and installed by default by the next Cobaya sub-version release. I will leave this open until then.
@vivianmiranda Any update on this?
from cobaya.
In addition to original issue:
The memory leak happens for curvature runs even for fairly narrow priors [-0.15, 0.15]
, i.e. narrowing the prior doesn't work as a temporary fix in this case.
@JesusTorrado
Do you happen to know when the CLASS update might happen?
Is there a development branch that I could already use now?
from cobaya.
CLASS v2.8 came out last week. Unfortunately it does not fix this issue.
It actually got even worse. Where before (v2.7.2) I was able to run for flat universes with sufficiently narrow priors and only got the leak for curved universes or when the prior ranges were too big, now (v2.8.1) I actually have a memory leak even for flat universes with the same (narrow) priors that previously worked.
from cobaya.
Hi Lukas,
Please, try CLASS 2.8.2 and let me know if it fixes the memory leaks (I've tested it with broad priors and including free omega_k; it leaks a tiny bit of memory, like a few 10's of Mb, in some corner cases only). It should also solve lesgourg/class_public#299
There is a small chance it will not work in cobaya/master
. If that's the case, feel free to use the devel
branch instead (to be merged soon).
from cobaya.
Currently trying things out.
I too still notice a bit of leak, but possibly this won't prevent a run from finishing.
I get a cryptic "python3.6 terminated with signal 11" error when running with Planck18 and TTTEEE (and with curvature). With TT only I don't get that error. Did you stumble onto something similar? Might try the devel branch for this.
from cobaya.
I didn't notice the segfault, but I was testing in devel
with TTTEEE. If you do get it in devel
, please let me know it you get the same in devel, and the specific parameters with which CLASS was called just before the segfault, printed in the debug output. Planck's clik
is the same in both branches, actually.
from cobaya.
Hi Jesus,
I've tried both master
and devel
branch. The flat case works fine again, but I didn't manage to get a curvature run to finish.
I've tested it with broad priors and including free omega_k; it leaks a tiny bit of memory, like a few 10's of Mb, in some corner cases only
Did you get a PolyChord run to finish including omega_k? I manage to get it started, but it crashes after a couple of hours. What prior on omega_k did you use? Could you maybe send me the corresponding .yaml
file?
from cobaya.
I haven't tried letting it finish: just generated a large set of initial samples for PolyChord. My input file is attached (extension changed to .log because otherwise github doesn't take it).
classk.log
Can you give me some more info on the crash? Could you run again with debug on, please?
from cobaya.
Hi Jesus,
sorry for taking long with getting back to you!
Can you give me some more info on the crash? Could you run again with debug on, please?
I have retried things with the devel
branch and CLASS version 2.8.2
. The yaml file that you posted only tests lowl.TT
and lensing
. I have found that lowl.TT
and highl.TTTEEE
seem to work fine. A little bit of memory accumulation, but not so much to stop the run.
However, lowl.EE
causes the problem. (Maybe related to #43 ?) This happens faster when only running lowl.EE
and takes longer when running with other likelihoods. Also, different from #43, the error does not happen at the first call to lowl.EE
. Does this mean a certain part of parameter space causes the problem for the lowl.EE
likelihood?
This does not seem to be a memory issue anymore. When running lowl.EE
only, the error happens within the first 20min. I monitored the memory, which was fine.
Debug output and error file are attached:
test_omegak_EE_18378439.out.log
test_omegak_EE_18378439.err.log
test_omegak_EE.interactive.log
error overview
- When sending the job off via a slurm script, it crashes after a while with a "signal 11" error (see the
.err
file):
python3.6:262768 terminated with signal 11 at PC=2b89e4d4205e SP=7ffe97c0de90.
- When I run it interactively, it crashes with an Intel error (see the
.interactive
file):
PolyChord: Next Generation Nested Sampling
copyright: Will Handley, Mike Hobson & Anthony Lasenby
version: 1.16
release: 1st March 2019
email: [email protected]
Run Settings
nlive : 64
nDims : 8
nDerived : 13
Doing Clustering
Generating equally weighted posteriors
Generating weighted posteriors
Clustering on posteriors
Writing a resume file tochains/test/test_omegak_EE/raw_polychord_output/test_omegak_EE.resume
generating live points
all live points generated
number of repeats: 14 40
started sampling
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 123210 RUNNING AT cpu-e-1056
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
from cobaya.
Hi @lukashergt
I have been doing some more testing on this. In particular I have run your input file for a few hours last evening (enough to get all initial live samples, and some more) and found no crash. The kind of errors that appear in your logs and that I find too (extreme amplitudes in the clik EE likelihood) look like they are being properly caught. So my only reasonable hypothesis right now is that it's ifort
-related (I run with GNU). Maybe reducing logzero
to -1e30 may solve it, as in the bug you reported at handley-lab/anesthetic#70 ? (I any case, -1e30 will be the new default in devel
)
from cobaya.
Since this should be solved by the last fix to logzero, I am closing this.
from cobaya.
Related Issues (20)
- Option to relax some consitency checks at resume (was "YOLO mode") HOT 3
- __main__.py HOT 2
- Odd covmat matching HOT 1
- Importance minimization HOT 1
- Access to zstar from theories and likelihoods HOT 2
- `WantTensors: true` in `extra_args` results in a segfault (when also using `external_primordial_pk: true` HOT 2
- Cobaya icon in dark mode HOT 3
- Error installing likelihood data HOT 6
- `oversample_thin: true` does not seem to reduce output HOT 6
- Backward Compatibility with python 3.9 HOT 4
- "The sum of logpriors in the sample is not consistent." when resuming chains HOT 3
- Something went wrong when looking for a covmat HOT 1
- Script invocation is broken on Python 3.9 HOT 4
- Python invocation not doing anything HOT 4
- FutureWarning
- Interpolation error creating a delta chi2 = 20 on DESI likelihood HOT 7
- Installing DESI data: could not be found error HOT 4
- cobaya-install cannot name '__obsolete__' from 'cobaya' (unknown location) HOT 4
- cobaya-install planck_2018_highl_plik.TTTEEE fails HOT 2
- bao.generic likelihood not working for the distances of type Dv_over_rs HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cobaya.