It's implying a memory error, but my batch size is 4 and I set my num_workers to 1. Oddly it also happens right at the beginning of training.
I'm trying to train a DenseNet initialized using this package on the Covid-19 dataset.
Epoch 1: 0%| | 0/16 [00:00<?, ?it/s]Begin training...
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 920, in wait
ready = selector.select(timeout)
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 814) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/scratch/ssd001/home/ptorabi/dev/dlt/dlt/experiments/baseline1.py", line 112, in <module>
fit_function_kwargs={}
File "/scratch/ssd001/home/ptorabi/dev/dlt/dlt/commons/train.py", line 124, in fit
for batch_index, batch in enumerate(dataloader):
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/h/ptorabi/.anaconda3/envs/torch/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 814) exited unexpectedly
Epoch 1: 0%| | 0/16 [00:00<?, ?it/s]