Comments (20)
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
@attamatti Do you know if the original issue is resolved by the changes Dari mentioned?
from relion.
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
Hi Özkan Yildiz, the behavior that you're describing is different than that shown by the original issue, hence it's moved to the new Issue #53.
from relion.
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
We repeated now the same 2D classification with v2.0.b10 after doing it with v2.0.b9 where we observed hang ups of the GPU cards at the end of its 7th iteration. The processes stop exactly at the same iteration 7 and at the same time and one GPU keeps on running on 100 % while the other goes to 0%.
We therefore observed the behavior of the two GPU cards during this iteration.
It seems that before the event of loosing one GPU, the memory consumption (and 3% does not seem to be much) that was equal on both our 2 GPU cards in previous iterations gets shifted during iteration 7 to only one GPU to the double amount (see below). And after a while the second zero-memory consuming GPU goes down (see below) after idling around some time at 100 %. So it looks like one GPU card took over the whole work of the second card.
The iteration with the hang up takes longer than it should in theory, and the output hangs just before the mouse gets to its end. We noticed that after around 30 min (assuming this would be the time needed for the whole iteration) the time for the expectation starts rising to about 50 min and than one of the cards does not consume memory anymore.
Maybe there is something wrong with parallelisation?
The temperatures for both GPU cards seem to be OK and there is no sign that one GPU card has a temperature problem. We are doing 3D classifications on the same GPU cards and they work perfectly fine. We are also doing the same 2D classification on only CPUs and it is running fine.
Now, we are going to try to do the same 2D classification on only one GPU card (with half the number of CPUs in order to see if the same thing happens.
Usual output of nvidia-smi dmon during iteration 6
# gpu pwr temp sm mem enc dec mclk pclk
# Idx W C % % % % MHz MHz
0 149 79 99 4 0 0 3304 1177
1 166 65 99 3 0 0 3304 1366
0 151 79 99 3 0 0 3304 1177
1 162 65 99 4 0 0 3304 1366
0 148 79 99 4 0 0 3304 1177
1 167 65 99 4 0 0 3304 1366
0 154 79 99 3 0 0 3304 1177
1 163 65 99 4 0 0 3304 1366
0 145 79 99 4 0 0 3304 1177
1 164 65 99 4 0 0 3304 1366
0 152 79 99 4 0 0 3304 1177
1 158 65 99 4 0 0 3304 1366
0 159 79 99 4 0 0 3304 1177
1 165 65 99 4 0 0 3304 1366
0 150 79 99 3 0 0 3304 1177
1 160 65 99 4 0 0 3304 1366
output of nvidia-smi dmon when the iteration 7 hangs and one GPU goes down
# gpu pwr temp sm mem enc dec mclk pclk
# Idx W C % % % % MHz MHz
0 95 68 100 0 0 0 3304 1189
1 117 60 99 6 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 120 60 99 6 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 123 60 99 6 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 121 60 99 6 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 116 60 99 6 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 145 64 99 4 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 137 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 141 64 99 4 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 140 64 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 136 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 136 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 133 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 139 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 142 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 141 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 130 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 135 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 136 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 135 63 99 5 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 132 63 68 2 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 93 62 1 0 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 95 62 3 0 0 0 3304 1366
0 95 68 100 0 0 0 3304 1189
1 80 61 5 0 0 0 3304 1163
0 95 68 100 0 0 0 3304 1189
1 76 60 0 0 0 0 3304 1163
0 95 68 100 0 0 0 3304 1189
1 76 60 0 0 0 0 3304 1163
0 95 68 100 0 0 0 3304 1189
1 76 60 0 0 0 0 3304 1163
0 95 68 100 0 0 0 3304 1189
1 76 60 0 0 0 0 3304 1163
Commandline
mpirun -n 35 relion_refine_mpi --o Class2D/job074/run --i Extract/job071/particles.star --dont_combine_weights_via_disc --pool 50 --ctf --iter 30 --tau2_fudge 2 --particle_diameter 190 --K 30 --flatten_solvent --zero_mask --strict_highres_exp 15 --oversampling 1 --psi_step 12 --offset_range 3 --offset_step 2 --norm --scale --j 1 --gpu
Output of iteratin 6 and 7
Auto-refine: Estimated accuracy angles= 14.1 degrees; offsets= 6.9 pixels
CurrentResolution= 11.52 Angstroms, which requires orientationSampling of at least 6.92308 degrees for a particle of diameter 190 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 8640
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 2 NrTranslations= 9
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 276480
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 1 NrTranslations= 36
=============================
Estimated memory for expectation step > 0.316617 Gb.
Estimated memory for maximization step > 0.000545025 Gb.
Expectation iteration 6 of 30
36.85/36.85 min ............................................................~~(,_,">
Maximization ...
1/ 1 sec ............................................................~~(,_,">
Estimating accuracies in the orientational assignment ...
1/ 1 sec ............................................................~~(,_,">
Auto-refine: Estimated accuracy angles= 9.1 degrees; offsets= 5.45 pixels
CurrentResolution= 10.9964 Angstroms, which requires orientationSampling of at least 6.54545 degrees for a particle of diameter 190 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 8640
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 2 NrTranslations= 9
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 276480
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 1 NrTranslations= 36
=============================
Estimated memory for expectation step > 0.324049 Gb.
Estimated memory for maximization step > 0.000579759 Gb.
Expectation iteration 7 of 30
49.70/50.05 min ...........................................................~~(,_,">
stops at this point and no error message is shown.
from relion.
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
A similar issue have been addressed in commit e5d5e0f (beta version v2.0.b10). Since I haven't been able to reproduce this particular error, I cannot know for sure if it has been fixed. Please reopen this issue if it still persists.
from relion.
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
It is unlikely that the error is an effect of this particular data, but rather its combination with the specific Relion run-settings up to the point where the error was encountered. Hence it is more likely that we can reproduce the error if we start of from a point as close as possible prior to the error. To do this we'll need specifically this file 15jul18a_b_00007gr_00008sq_v01_00002hl_00002en.frames_b_a.mrcs and the output file run_it003_classes.mrcs
If you could send us these we'll have the issue fixed in no time. Thanks!
from relion.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
Do you have an MD5 sum for it? Just to make sure we're not getting errors from corrupted files.
I'm downloading it now, so as soon as we know we have it intact, I'll let you know and you can take down the data. You can edit you post to remove the link if you don't want it permanently visible here as well.
from relion.
Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):
@bforsbe I'm happy to send you the data that are causing this crash, but it's too big (8.6 Gb) to post here.
Here it is: https://drive.google.com/open?id=0B1Q2t9VcshTFNHFObktqZC15V1U
from relion.
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
To fix this issue we'll need to be able to reproduce the error on our nodes with the proper debugging tools.
The files that we'll need are the output-files from the iteration previous to the one that crashed (iteration 7), including the data, model, optimiser and sampling star-files and the class-mrc-files. Also, the single mrcs-file that contains the erroneous particle:
Extract/job018/frames/15jul18a_b_00007gr_00008sq_v01_00002hl_00002en.frames_b_a.mrcs
from relion.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
I can't make sense of why the run crashed based on the attached files, unfortunately. In order to diagnose it I need data so that I can run it myself and see the error in a reproducible manner.
from relion.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
Sure, just fork the repo, commit to your fork and submit a pull-request! Thanks!
from relion.
Original comment by Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov):
OK, but it's phase plate data, so you'll probably need to fix #29 first to reproduce it. Or I can push my local fix if you give me write access.
from relion.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
Sound like a different issue, albeit with similar errors being reported. I would prefer a separate issue on your error ( and if possible a reproducible test-case which continues from the problematic iteration and contains only the particle which can't be reconciled) .
from relion.
Original comment by Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov):
I think I'm having the same issue during 3D refinement. The particles it complains about are perfectly fine visually, no abnormal values both in real and Fourier space. After I remove the particles, the current iteration runs through without complaints, but then during the next iteration different particles cause the same problem.
Do you need more data, or do you already have an idea what might be causing this?
from relion.
Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):
Here they are for the crashed datset ~25K and the smaller one that worked.
from relion.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
_model.star and _optimiser.star -files might be good as well, and mostly from iteration 3, iteration 25 is not that important.
from relion.
Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):
I ran it again this time using only the first 100 micrographs ~2500 particles, and it worked without issues.
Here are the sampling files from iter003 from the full dataset (~25000 parts) which locks up, and iters003 and 025 the truncated dataset, which worked.
I can send along the data files later, but I'm on wifi now
from relion.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
Matts observation makes sense, in that you find non-zero differences ( in his case min_diff = 5087.93 ) but that once these are converted to weights, all weights are zero. On set in the conversion is the subtraction of the smallest difference. That is; if all differences are identical, then all differences get set to zero and thus the sum of all weights is also 0. This obviously can't be normalised, sorted or ranked in any way, so problems ensue. Having blank classes as input would look exactly like this.
So I think there might be an unfortunate set of circumstances causing the previous iteration to not write classes correctly.
Is if possible to post the output from this refinement for the laast few iterations before you saw the issue? I probably don't need the mrcs-files with the classes, I just need to know if they are blank. But he _sampling.star, _data.star and so on, these files could be useful to fix the bug.
from relion.
Original comment by craigyk (Bitbucket: craigyk, GitHub: craigyk):
I'll have to find exactly what particle set I ran caused the error to hunt it down for you- I have been working with a slightly different particle set and haven't run into the problem again.
from relion.
Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):
I just had a similar error. It happens at the beginning of iteration 4 every time I run it.
#!unix
exp_fn_img= 000001@Extract/job004/micrographs/micrograph00001.mrcs
ipart= 0 adaptive_fraction= 0.999
min_diff2= 5087.93
Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
filteredSize == 0
File: /fbs/emsoftware2/LINUX/fbsmi/relion2-beta/src/gpu_utils/cuda_ml_optimiser.cu line: 1552
In thread 0
It freezes up and leaves the job on the GPU, which has to be cleared manually.
I deleted the offending particle from the star file and it just errors out at the same place (beginning of iteration 4) on the next particle
#!unix
exp_fn_img= 000002@Extract/job004/micrographs/micrograph00001.mrcs
ipart= 0 adaptive_fraction= 0.999
min_diff2= 4826.25
Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
filteredSize == 0
File: /fbs/emsoftware2/LINUX/fbsmi/relion2-beta/src/gpu_utils/cuda_ml_optimiser.cu line: 1552
In thread 0
Also the class averages from iteration 3 are just blanks black. The class averages from iteration 2 look as expected.
from relion.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
It looks like it occurred at the very start of an iteration, does it behave similarly if you run again, continuing from the last completed iteration? If so, would it be possible for you to make a star-file with these 10 particles, and construct a very small set of files that can reproduce the error? If you provide this I can have a look fairly easily.
from relion.
Related Issues (20)
- compile relion 5 error HOT 21
- Bayesian polishing aborting with mpi issues HOT 3
- Relion (v4 or v5) crashing after resuming Refine3D from optimiser.star HOT 4
- DynaMight running issue "Index tensor must have the same number of dimensions as self tensor" HOT 3
- Relion 4.0.0 initial 3D GPU/noGPU refinement doesn't seem to put multiple threads to different CPUs HOT 2
- Relion5 Beta DynaMight deformed backprojection error HOT 3
- Optimizing 2D classification and selection in schemes HOT 1
- MultiBody refinement MPI message truncated Error - relion 5.0 HOT 2
- Relion 5.0 fails to reconstruct tomograms after tilt-series alignment HOT 10
- Function remove_duplicate in star_handler HOT 1
- Refine3D starting from 1/2 maps doesn't appear to initialise references from 1/2 map 2. HOT 7
- relion 5 refinement failed with memory issue HOT 18
- somthing wrong while installing HOT 3
- some problem in auto-refine HOT 5
- Relion-5.0: ImportTomo fails to unbin coordinates of particles
- Blush not applied when continuing classification from optimiser.star
- Relion-5.0: GPU memory errors in Class3D on subtomograms HOT 7
- Relion-5.0 Tomo relion_python_tomo_view error
- Variables in relion_schemegui.py HOT 1
- Install problem related to blushing HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from relion.