Comments (7)
Hi @han-x !
Could you give me an information like below to consider from various perspectives?
- server machine info (Ubuntu? OSX? Windows?)
- machine specs (#CPUs, #GPUs, RAM)
- pytorch version
Thanks!
- server machine info: Ubuntu 16.04.7 LTS
- CPU:Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
- RAM:64GB
- GPU:2080 * 2
- pytorch:Version: 1.7.0
from handyrl.
It seems that when running "python main.py -- worker" and "python main.py --train-server" on the same computer, the "worker" process will take up a lot of memory so that this kind of error occurs.
In my last running time, this problem came out after 129400 episodes, and the "worker" took up about 50GB mem, and the "train-server" took up about 10GB mem.
128900 129000 129100
epoch 396
win rate = 0.999 (124.8 / 125)
generation stats = 0.000 +- 0.742
loss = p:-0.000 v:0.012 ent:0.324 total:0.012
updated model(54158)
129200 129300 129400
epoch 397
win rate = 0.991 (125.8 / 127)
generation stats = -0.000 +- 0.742
loss = p:-0.000 v:0.012 ent:0.329 total:0.012
updated model(54282)
Exception in thread Thread-7:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
from handyrl.
Massively thank you for your report.
You are training a big neural network, aren't you?
In current implementation, each Gather process store all models after starting workers.
Rewrite worker.py for storing only latest model is currently the best way to reduce memories.
We are considering enabling to select training scheme and avoid storing unused models.
from handyrl.
Thanks for your reply! But I just used the original GeeseNet...
I will try the method you said, thanks again LoL
from handyrl.
Thanks. That's a strange case...
Since the original GeeseNet was about 500KB, it will occupy only 100MB when in 200 epochs.
It's nothing wrong 129k episodes occupied 10GB in trainer process.
You can increase compress_steps
if you want to save memory, while it will slow down the speed of making batches and sometimes slow down the training speed as a result.
from handyrl.
Hi @han-x !
Could you give me an information like below to consider from various perspectives?
- server machine info (Ubuntu? OSX? Windows?)
- machine specs (#CPUs, #GPUs, RAM)
- pytorch version
Thanks!
from handyrl.
I have noticed a possible cause from your stacktraces. Are you using the codes of current master branch? I think there are some differences between your script and script in master branch.
The similar error happened before and we solved it in #145.
Could you check it? And update your code if old code is used. Thanks.
from handyrl.
Related Issues (20)
- How is multi-agent handled ? HOT 8
- train stop at 690epoch HOT 6
- Unexpected results inference time
- [Question] [Documentation]
- Going further into deep MARL with halite and HandyRL HOT 5
- How do i know when ive reached an optimum while training HOT 3
- OSError when running on Windows HOT 4
- Make it a library HOT 3
- Kaggle GPU's HOT 2
- The cpu memory keeps increasing, and when it is full, an error got! HOT 8
- Customize learning rate HOT 3
- Suggestion: AMP support HOT 3
- [REQUESTING OVERVIEW OF DISTRIBUTED HANDYRL] HOT 4
- TypeError: list indices must be integers or slices, not str HOT 2
- num_parallel affecting learning results HOT 1
- Multiple Ctrl+C is needed to finish every process and thread in the early phase HOT 3
- How to make model consider immediate reward ? HOT 4
- Large scale training HOT 3
- Replay Buffer HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from handyrl.