Comments (12)
hey there-- if you are getting the library error I would guess something is wrong in your nvidia setup. How did you install tf / the cuda tools?
Also I'm not sure if ReLERNN is ready for tf2.1, but I do know @jradrion has it working on tf2....
from relernn.
Hi Andrew,
I installed TF 2.1 with pip. According to the readme ReLERNN is tested on TF2.1 so I thought I give it a go and tried matching dependencies as good as possible.
Do you think the library error is related to memory usage?
from relernn.
so this memory leak seems to be on the TF side, but the error you report has to do with nvidia tools-- how was cuda installed on your system?
from relernn.
not sure, it loads with python when run on a GPU node
from relernn.
okay so one question for your admins is what happened to nvinfer
. that seems to be either installed somewhere that TF can't find it or not installed
from relernn.
I will ask them whats going on.
So generally would you say rather use Relernn with TF 1.x for now?
from relernn.
@LZeitler I had not noticed a memory leak issue in my testing of ReLERNN with tf2.1. However, I testing by running only ~10 epochs for speed, and our machine has a fairly large amount of memory, so it's possible that I'll find this issue when training for more epochs. I will be testing this ASAP.
As far as the warning your first describe, I also get that warning at every epoch. I had seen a comment about it being a spurious in this thread, and temporarily ignored it since everything else appeared to be working. However, users are now saying it is connected to the memory leak issue.
The warning about Could not load dynamic library 'libnvinfer.so.6'
appears to be connected to not having TensorRT installed. TensorRT should not be necessary, but the warning did go away once we installed it.
I'll do some more testing and report back.
from relernn.
@LZeitler I have not forgotten about you. I'm still debugging a number of issues related to this move to tf2.
from relernn.
@LZeitler I removed multiprocessing from model.fit, and I was able to run a full-sized dataset to >400 epochs without running into any memory issues. Could you try to pull these changes and reinstall. Please let me know if you are still having issues.
from relernn.
@jradrion I pulled and reinstalled.
For now I'm ignoring the warnings related to TensorRT.
However, testing with the example pipeline, I am now getting some other warnings:
2020-03-02 15:39:32.734313: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[IteratorGetNext/_2]]
2020-03-02 15:39:32.734285: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 200 batches). You may need to use the repeat() function when building your dataset.
WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
Traceback (most recent call last):
File "/cluster/home/zeitlerl/.local/bin/ReLERNN_TRAIN", line 117, in <module>
main()
File "/cluster/home/zeitlerl/.local/bin/ReLERNN_TRAIN", line 107, in main
gpuID=args.gpuID)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/ReLERNN/helpers.py", line 371, in runModels
model.load_weights(network[1])
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 234, in load_weights
return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 1222, in load_weights
hdf5_format.load_weights_from_hdf5_group(f, self.layers)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 699, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3323, in batch_set_value
x.assign(np.asarray(value, dtype=dtype(x)))
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 819, in assign
self._shape.assert_is_compatible_with(value_tensor.shape)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 1110, in assert_is_compatible_with
raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (2610, 256) and (2036, 256) are incompatible
from relernn.
@jradrion Running on big dataset works now! Thanks for the fix! Memory issue is also resolved!
from relernn.
Hi @LZeitler thanks for bringing the issue with the example scripts to my attention. I had to bump up the number of training simulations to avoid this error with a fixed number of epochs. ReLERNN/examples/example_pipeline.sh
should be working now. Glad to see you are no longer having memory issues with your big dataset.
I'm going to go ahead and close this issue. Please let me know if you come across any other problems.
Best,
Jeff
from relernn.
Related Issues (20)
- Error running example of ReLERNN HOT 6
- Error with ReLERNN example script. HOT 2
- ReLERNN_TRAIN_POOL is slow HOT 3
- Genome bed file HOT 10
- tensorflow needed in requirements.txt HOT 1
- Installation instructions HOT 1
- Unable to allocate memory with ReLERNN_TRAIN_POOL HOT 4
- memory allocate error on PREDICT HOT 6
- loss: nan - val_loss nan HOT 3
- empty files in SplitVCF/ HOT 10
- Demographic history error message
- Recommendations for how to parameterize ReLERNN
- Chromosome length bounded to 20 Mbp
- Hi Im wondering how to generate the example.vcf haplotype file format
- Question about software usage HOT 1
- Is there a method to convert the result to other window size based results HOT 1
- ReLERNN_PREDICT_HOTSPO unable to run HOT 1
- RELERNN SIMULATE issue with vcf? HOT 7
- what is the recommended threshold for maf filtering HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from relernn.