Giter Site home page Giter Site logo

empty files in SplitVCF/ about relernn HOT 10 CLOSED

kr-colab avatar kr-colab commented on May 27, 2024
empty files in SplitVCF/

from relernn.

Comments (10)

haotianzh avatar haotianzh commented on May 27, 2024 1

Thanks for explicit explanation, really appreciate for that. I continue training model, I double checked your code and I believe training process can be done without using files in splitVCFs/, it only take files in train/ test/ vali/ as input. A trained model finally came out ! then I run ReLERNN_PREDICT, and I got prediction on my dataset. It fits well ! like this:

chrom	start	end	nSites	recombRate
2L	0	12000	1279	7.011262966537199e-09
2L	12000	24000	1228	6.794673033929348e-09
2L	24000	36000	1300	7.005089610367522e-09
2L	36000	48000	1248	5.483054833838868e-09
2L	48000	60000	1242	7.830751937096838e-09
2L	60000	72000	1326	7.366499054189659e-09
2L	72000	84000	1308	6.544700230987467e-09
2L	84000	96000	1272	6.22089463822116e-09

For the problem I mentioned above, I have the same idea with you, there must be something overwrite the files in splitVCFs/.

from relernn.

jradrion avatar jradrion commented on May 27, 2024

Hmm, that's odd. Can you provide the full output from just the ReLERNN_SIMULATE part of example_pipeline.sh.

from relernn.

haotianzh avatar haotianzh commented on May 27, 2024

Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.

Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible

Simulating with window size = 211000 bp.
Training set:
Simulate...
Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.

Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible

Simulating with window size = 211000 bp.
Training set:
Simulate...
Validation set:
Simulate...
Test set:
Simulate...

SIMULATIONS FINISHED!

SANITY CHECK
====================
numSegSites Min Mean Max
Simulated: 367 1033 2232
InputVCF 2L:0-840000: 238 909 1741
InputVCF 2R:0-1669000: 411 1000 1754
InputVCF 3L:0-742000: 143 909 1777
InputVCF 3R:0-1963000: 358 1000 1759
InputVCF X:0-1250000: 127 1000 1720

***ReLERNN_SIMULATE.py FINISHED!***\

I got this log when running "bash example_pipeline.sh >> log.out". (windows 10 64-bit, Nvidia RTX 2080 Super with cuda/10.2, tensorflow/2.2.0, cudnn/7.6.5)

Here are files listed in SplitVCF/:

$ ls -l example_output/splitVCFs/
total 0
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_2L
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_2R
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_3L
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_3R
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_X

from relernn.

jradrion avatar jradrion commented on May 27, 2024

Can you try removing the example_output/directory and run example_pipeline.sh again. This following part of your output is not making sense:

Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.

Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible

Simulating with window size = 211000 bp.
Training set:
Simulate...
Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.

The line, Warning: no demographic history file found. All training data will be simulated under demographic equilibrium., should only ever be printed once. It looks like the code might be running simultaneously as multiple processes. Are you running this on a cluster or a machine where you might somehow be executing ReLERNN_SIMULATE more than once? I'm not sure what could be causing your output.

from relernn.

jradrion avatar jradrion commented on May 27, 2024

also, can you provide the full command you are executing in the console

from relernn.

andrewkern avatar andrewkern commented on May 27, 2024

could also be multiple threads writing to the same output file?

from relernn.

haotianzh avatar haotianzh commented on May 27, 2024

Here is another logout.

Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.

Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible

Simulating with window size = 211000 bp.
Training set:
Simulate...
Validation set:
Simulate...
Test set:
Simulate...

SIMULATIONS FINISHED!

SANITY CHECK
====================
numSegSites			Min	Mean	Max
Simulated:			367	1033	2232
InputVCF 2L:0-840000:		238	909	1741
InputVCF 2R:0-1669000:		411	1000	1754
InputVCF 3L:0-742000:		143	909	1777
InputVCF 3R:0-1963000:		358	1000	1759
InputVCF X:0-1250000:		127	1000	1720


***ReLERNN_SIMULATE.py FINISHED!***

I use single CPU for that, not multiprocessing. and I just follow your instruction like this:

cd example
bash example_pipeline.sh

Everything seems to be running correctly, I do have its simulated files in train/ test/ vali/. But I am not sure if it is normal case when all files in splitVCFs/ are empty or will it has impact on training process ? Now, I ignore that and use ReLERNN_TRAIN to training model on my own dataset, I don't know if I can still get accurate results.
Do you have any pretrained model that can be use directly to make a prediction without training ?
btw, what does loss mean? For example, loss=0.11 does that mean difference between prediction and true recombination rate is 0.11, what is the unit of that. like 1e-8? or something else?

from relernn.

jradrion avatar jradrion commented on May 27, 2024

Hmm, this is definitely odd. For example, in order to calculate the numbers that are printed below

SANITY CHECK
====================
numSegSites			Min	Mean	Max
Simulated:			367	1033	2232
InputVCF 2L:0-840000:		238	909	1741

the code must first read the files in splitVCFs/ that you are saying are empty. So they must be being overwritten by another process after the fact.

My only suggestion at this point would be to try running ReLERNN_SIMULATE using --nCPU 1 to see if you get the same result.

from relernn.

jradrion avatar jradrion commented on May 27, 2024

I don't know if I can still get accurate results.

No, if the files in splitVCFs/ are empty you will not be able to run ReLERNN_PREDICT, as the predictions are made on those files.

Do you have any pretrained model that can be use directly to make a prediction without training ?

No, you have to train on simulated data before making predictions.

btw, what does loss mean? For example, loss=0.11 does that mean difference between prediction and true recombination rate is 0.11, what is the unit of that. like 1e-8? or something else?

No, the loss reported during training can not be directly interpreted as a recombination rate, as the raw rates are normalize before training. You have to run ReLERNN_PREDICT to get the actual recombination rate predictions. The following is a blog post that should better explain loss/loss functions https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/

from relernn.

jradrion avatar jradrion commented on May 27, 2024

OK, sounds good. I'm glad it's working for you. I'm going to go ahead and close this issue now.

from relernn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.