Comments (10)
Thanks for explicit explanation, really appreciate for that. I continue training model, I double checked your code and I believe training process can be done without using files in splitVCFs/
, it only take files in train/ test/ vali/
as input. A trained model finally came out ! then I run ReLERNN_PREDICT
, and I got prediction on my dataset. It fits well ! like this:
chrom start end nSites recombRate
2L 0 12000 1279 7.011262966537199e-09
2L 12000 24000 1228 6.794673033929348e-09
2L 24000 36000 1300 7.005089610367522e-09
2L 36000 48000 1248 5.483054833838868e-09
2L 48000 60000 1242 7.830751937096838e-09
2L 60000 72000 1326 7.366499054189659e-09
2L 72000 84000 1308 6.544700230987467e-09
2L 84000 96000 1272 6.22089463822116e-09
For the problem I mentioned above, I have the same idea with you, there must be something overwrite the files in splitVCFs/
.
from relernn.
Hmm, that's odd. Can you provide the full output from just the ReLERNN_SIMULATE
part of example_pipeline.sh
.
from relernn.
Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.
Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible
Simulating with window size = 211000 bp.
Training set:
Simulate...
Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.
Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible
Simulating with window size = 211000 bp.
Training set:
Simulate...
Validation set:
Simulate...
Test set:
Simulate...
SIMULATIONS FINISHED!
SANITY CHECK
====================
numSegSites Min Mean Max
Simulated: 367 1033 2232
InputVCF 2L:0-840000: 238 909 1741
InputVCF 2R:0-1669000: 411 1000 1754
InputVCF 3L:0-742000: 143 909 1777
InputVCF 3R:0-1963000: 358 1000 1759
InputVCF X:0-1250000: 127 1000 1720
***ReLERNN_SIMULATE.py FINISHED!***\
I got this log when running "bash example_pipeline.sh >> log.out". (windows 10 64-bit, Nvidia RTX 2080 Super with cuda/10.2, tensorflow/2.2.0, cudnn/7.6.5)
Here are files listed in SplitVCF/:
$ ls -l example_output/splitVCFs/
total 0
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_2L
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_2R
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_3L
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_3R
-rw-r--r-- 1 zht 197121 0 10/15 21:30 example_X
from relernn.
Can you try removing the example_output/
directory and run example_pipeline.sh
again. This following part of your output is not making sense:
Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.
Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible
Simulating with window size = 211000 bp.
Training set:
Simulate...
Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.
The line, Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.
, should only ever be printed once. It looks like the code might be running simultaneously as multiple processes. Are you running this on a cluster or a machine where you might somehow be executing ReLERNN_SIMULATE
more than once? I'm not sure what could be causing your output.
from relernn.
also, can you provide the full command you are executing in the console
from relernn.
could also be multiple threads writing to the same output file?
from relernn.
Here is another logout.
Warning: no demographic history file found. All training data will be simulated under demographic equilibrium.
Accessibility mask found: calculating the proportion of the genome that is masked...
1.3% of genome inaccessible
Simulating with window size = 211000 bp.
Training set:
Simulate...
Validation set:
Simulate...
Test set:
Simulate...
SIMULATIONS FINISHED!
SANITY CHECK
====================
numSegSites Min Mean Max
Simulated: 367 1033 2232
InputVCF 2L:0-840000: 238 909 1741
InputVCF 2R:0-1669000: 411 1000 1754
InputVCF 3L:0-742000: 143 909 1777
InputVCF 3R:0-1963000: 358 1000 1759
InputVCF X:0-1250000: 127 1000 1720
***ReLERNN_SIMULATE.py FINISHED!***
I use single CPU for that, not multiprocessing. and I just follow your instruction like this:
cd example
bash example_pipeline.sh
Everything seems to be running correctly, I do have its simulated files in train/ test/ vali/
. But I am not sure if it is normal case when all files in splitVCFs/
are empty or will it has impact on training process ? Now, I ignore that and use ReLERNN_TRAIN
to training model on my own dataset, I don't know if I can still get accurate results.
Do you have any pretrained model that can be use directly to make a prediction without training ?
btw, what does loss mean? For example, loss=0.11
does that mean difference between prediction and true recombination rate is 0.11, what is the unit of that. like 1e-8? or something else?
from relernn.
Hmm, this is definitely odd. For example, in order to calculate the numbers that are printed below
SANITY CHECK
====================
numSegSites Min Mean Max
Simulated: 367 1033 2232
InputVCF 2L:0-840000: 238 909 1741
the code must first read the files in splitVCFs/
that you are saying are empty. So they must be being overwritten by another process after the fact.
My only suggestion at this point would be to try running ReLERNN_SIMULATE
using --nCPU 1
to see if you get the same result.
from relernn.
I don't know if I can still get accurate results.
No, if the files in splitVCFs/
are empty you will not be able to run ReLERNN_PREDICT
, as the predictions are made on those files.
Do you have any pretrained model that can be use directly to make a prediction without training ?
No, you have to train on simulated data before making predictions.
btw, what does loss mean? For example, loss=0.11 does that mean difference between prediction and true recombination rate is 0.11, what is the unit of that. like 1e-8? or something else?
No, the loss reported during training can not be directly interpreted as a recombination rate, as the raw rates are normalize before training. You have to run ReLERNN_PREDICT
to get the actual recombination rate predictions. The following is a blog post that should better explain loss/loss functions https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/
from relernn.
OK, sounds good. I'm glad it's working for you. I'm going to go ahead and close this issue now.
from relernn.
Related Issues (20)
- Error with ReLERNN example script. HOT 2
- ReLERNN train TF2 model.fit memory leak and errors HOT 12
- ReLERNN_TRAIN_POOL is slow HOT 3
- Genome bed file HOT 10
- tensorflow needed in requirements.txt HOT 1
- Installation instructions HOT 1
- Unable to allocate memory with ReLERNN_TRAIN_POOL HOT 4
- memory allocate error on PREDICT HOT 6
- loss: nan - val_loss nan HOT 3
- Demographic history error message
- Recommendations for how to parameterize ReLERNN
- Chromosome length bounded to 20 Mbp HOT 1
- Hi Im wondering how to generate the example.vcf haplotype file format
- Question about software usage HOT 1
- Is there a method to convert the result to other window size based results HOT 1
- ReLERNN_PREDICT_HOTSPO unable to run HOT 1
- RELERNN SIMULATE issue with vcf? HOT 7
- what is the recommended threshold for maf filtering HOT 3
- Which python version needed to successfully run ReLERNN? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from relernn.