Giter Site home page Giter Site logo

Comments (20)

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


@attamatti Do you know if the original issue is resolved by the changes Dari mentioned?

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):


Hi Özkan Yildiz, the behavior that you're describing is different than that shown by the original issue, hence it's moved to the new Issue #53.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):


We repeated now the same 2D classification with v2.0.b10 after doing it with v2.0.b9 where we observed hang ups of the GPU cards at the end of its 7th iteration. The processes stop exactly at the same iteration 7 and at the same time and one GPU keeps on running on 100 % while the other goes to 0%.
We therefore observed the behavior of the two GPU cards during this iteration.
It seems that before the event of loosing one GPU, the memory consumption (and 3% does not seem to be much) that was equal on both our 2 GPU cards in previous iterations gets shifted during iteration 7 to only one GPU to the double amount (see below). And after a while the second zero-memory consuming GPU goes down (see below) after idling around some time at 100 %. So it looks like one GPU card took over the whole work of the second card.
The iteration with the hang up takes longer than it should in theory, and the output hangs just before the mouse gets to its end. We noticed that after around 30 min (assuming this would be the time needed for the whole iteration) the time for the expectation starts rising to about 50 min and than one of the cards does not consume memory anymore.
Maybe there is something wrong with parallelisation?

The temperatures for both GPU cards seem to be OK and there is no sign that one GPU card has a temperature problem. We are doing 3D classifications on the same GPU cards and they work perfectly fine. We are also doing the same 2D classification on only CPUs and it is running fine.

Now, we are going to try to do the same 2D classification on only one GPU card (with half the number of CPUs in order to see if the same thing happens.

Usual output of nvidia-smi dmon during iteration 6

# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0   149    79    99     4     0     0  3304  1177
    1   166    65    99     3     0     0  3304  1366
    0   151    79    99     3     0     0  3304  1177
    1   162    65    99     4     0     0  3304  1366
    0   148    79    99     4     0     0  3304  1177
    1   167    65    99     4     0     0  3304  1366
    0   154    79    99     3     0     0  3304  1177
    1   163    65    99     4     0     0  3304  1366
    0   145    79    99     4     0     0  3304  1177
    1   164    65    99     4     0     0  3304  1366
    0   152    79    99     4     0     0  3304  1177
    1   158    65    99     4     0     0  3304  1366
    0   159    79    99     4     0     0  3304  1177
    1   165    65    99     4     0     0  3304  1366
    0   150    79    99     3     0     0  3304  1177
    1   160    65    99     4     0     0  3304  1366


output of nvidia-smi dmon when the iteration 7 hangs and one GPU goes down

# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0    95    68   100     0     0     0  3304  1189
    1   117    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   120    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   123    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   121    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   116    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   145    64    99     4     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   137    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   141    64    99     4     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   140    64    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   136    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   136    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   133    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   139    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   142    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   141    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   130    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   135    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   136    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   135    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   132    63    68     2     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1    93    62     1     0     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1    95    62     3     0     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1    80    61     5     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163

Commandline

mpirun -n 35 relion_refine_mpi --o Class2D/job074/run --i Extract/job071/particles.star --dont_combine_weights_via_disc --pool 50 --ctf --iter 30 --tau2_fudge 2 --particle_diameter 190 --K 30 --flatten_solvent --zero_mask --strict_highres_exp 15 --oversampling 1 --psi_step 12 --offset_range 3 --offset_step 2 --norm --scale --j 1 --gpu

Output of iteratin 6 and 7

 Auto-refine: Estimated accuracy angles= 14.1 degrees; offsets= 6.9 pixels
 CurrentResolution= 11.52 Angstroms, which requires orientationSampling of at least 6.92308 degrees for a particle of diameter 190 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 8640
 OrientationalSampling= 11.25 NrOrientations= 32
 TranslationalSampling= 2 NrTranslations= 9
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 276480
 OrientationalSampling= 5.625 NrOrientations= 256
 TranslationalSampling= 1 NrTranslations= 36
=============================
 Estimated memory for expectation  step > 0.316617 Gb.
 Estimated memory for maximization step > 0.000545025 Gb.
 Expectation iteration 6 of 30
36.85/36.85 min ............................................................~~(,_,">
 Maximization ...
   1/   1 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ... 
   1/   1 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 9.1 degrees; offsets= 5.45 pixels
 CurrentResolution= 10.9964 Angstroms, which requires orientationSampling of at least 6.54545 degrees for a particle of diameter 190 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 8640
 OrientationalSampling= 11.25 NrOrientations= 32
 TranslationalSampling= 2 NrTranslations= 9
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 276480
 OrientationalSampling= 5.625 NrOrientations= 256
 TranslationalSampling= 1 NrTranslations= 36
=============================
 Estimated memory for expectation  step > 0.324049 Gb.
 Estimated memory for maximization step > 0.000579759 Gb.
 Expectation iteration 7 of 30
49.70/50.05 min ...........................................................~~(,_,">

stops at this point and no error message is shown.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):


A similar issue have been addressed in commit e5d5e0f (beta version v2.0.b10). Since I haven't been able to reproduce this particular error, I cannot know for sure if it has been fixed. Please reopen this issue if it still persists.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):


It is unlikely that the error is an effect of this particular data, but rather its combination with the specific Relion run-settings up to the point where the error was encountered. Hence it is more likely that we can reproduce the error if we start of from a point as close as possible prior to the error. To do this we'll need specifically this file 15jul18a_b_00007gr_00008sq_v01_00002hl_00002en.frames_b_a.mrcs and the output file run_it003_classes.mrcs

If you could send us these we'll have the issue fixed in no time. Thanks!

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Do you have an MD5 sum for it? Just to make sure we're not getting errors from corrupted files.

I'm downloading it now, so as soon as we know we have it intact, I'll let you know and you can take down the data. You can edit you post to remove the link if you don't want it permanently visible here as well.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):


@bforsbe I'm happy to send you the data that are causing this crash, but it's too big (8.6 Gb) to post here.
Here it is: https://drive.google.com/open?id=0B1Q2t9VcshTFNHFObktqZC15V1U

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):


To fix this issue we'll need to be able to reproduce the error on our nodes with the proper debugging tools.
The files that we'll need are the output-files from the iteration previous to the one that crashed (iteration 7), including the data, model, optimiser and sampling star-files and the class-mrc-files. Also, the single mrcs-file that contains the erroneous particle:

Extract/job018/frames/15jul18a_b_00007gr_00008sq_v01_00002hl_00002en.frames_b_a.mrcs

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


@attamatti

I can't make sense of why the run crashed based on the attached files, unfortunately. In order to diagnose it I need data so that I can run it myself and see the error in a reproducible manner.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Sure, just fork the repo, commit to your fork and submit a pull-request! Thanks!

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov):


OK, but it's phase plate data, so you'll probably need to fix #29 first to reproduce it. Or I can push my local fix if you give me write access.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Sound like a different issue, albeit with similar errors being reported. I would prefer a separate issue on your error ( and if possible a reproducible test-case which continues from the problematic iteration and contains only the particle which can't be reconciled) .

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Dimitry Tegunov (Bitbucket: DTegunov, GitHub: DTegunov):


I think I'm having the same issue during 3D refinement. The particles it complains about are perfectly fine visually, no abnormal values both in real and Fourier space. After I remove the particles, the current iteration runs through without complaints, but then during the next iteration different particles cause the same problem.

Do you need more data, or do you already have an idea what might be causing this?

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):


Here they are for the crashed datset ~25K and the smaller one that worked.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


_model.star and _optimiser.star -files might be good as well, and mostly from iteration 3, iteration 25 is not that important.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):


I ran it again this time using only the first 100 micrographs ~2500 particles, and it worked without issues.
Here are the sampling files from iter003 from the full dataset (~25000 parts) which locks up, and iters003 and 025 the truncated dataset, which worked.

I can send along the data files later, but I'm on wifi now

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Matts observation makes sense, in that you find non-zero differences ( in his case min_diff = 5087.93 ) but that once these are converted to weights, all weights are zero. On set in the conversion is the subtraction of the smallest difference. That is; if all differences are identical, then all differences get set to zero and thus the sum of all weights is also 0. This obviously can't be normalised, sorted or ranked in any way, so problems ensue. Having blank classes as input would look exactly like this.

So I think there might be an unfortunate set of circumstances causing the previous iteration to not write classes correctly.

Is if possible to post the output from this refinement for the laast few iterations before you saw the issue? I probably don't need the mrcs-files with the classes, I just need to know if they are blank. But he _sampling.star, _data.star and so on, these files could be useful to fix the bug.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by craigyk (Bitbucket: craigyk, GitHub: craigyk):


I'll have to find exactly what particle set I ran caused the error to hunt it down for you- I have been working with a slightly different particle set and haven't run into the problem again.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Matt Iadanza (Bitbucket: attamatti, GitHub: attamatti):


I just had a similar error. It happens at the beginning of iteration 4 every time I run it.

#!unix

exp_fn_img= 000001@Extract/job004/micrographs/micrograph00001.mrcs

 ipart= 0 adaptive_fraction= 0.999
 min_diff2= 5087.93
Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
filteredSize == 0
File: /fbs/emsoftware2/LINUX/fbsmi/relion2-beta/src/gpu_utils/cuda_ml_optimiser.cu line: 1552

In thread 0

It freezes up and leaves the job on the GPU, which has to be cleared manually.

I deleted the offending particle from the star file and it just errors out at the same place (beginning of iteration 4) on the next particle

#!unix

 exp_fn_img= 000002@Extract/job004/micrographs/micrograph00001.mrcs

 ipart= 0 adaptive_fraction= 0.999
 min_diff2= 4826.25
Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
filteredSize == 0
File: /fbs/emsoftware2/LINUX/fbsmi/relion2-beta/src/gpu_utils/cuda_ml_optimiser.cu line: 1552

In thread 0

Also the class averages from iteration 3 are just blanks black. The class averages from iteration 2 look as expected.

from relion.

bforsbe avatar bforsbe commented on May 13, 2024

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


It looks like it occurred at the very start of an iteration, does it behave similarly if you run again, continuing from the last completed iteration? If so, would it be possible for you to make a star-file with these 10 particles, and construct a very small set of files that can reproduce the error? If you provide this I can have a look fairly easily.

from relion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.