hqucms / weaver-core Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 49.0 220 KB

Streamlined neural network training.

License: MIT License

Python 100.00%

weaver-core's People

Contributors

Stargazers

Watchers

Forkers

rgerosa neoh1 de-cristo pkufudawei zeusmail sohambhattacharya ekoenig4 selvaggi stephenchao matteomalucchi doloresgarcia colizz lgray lyazj richa2710 gouskos sscruz alexedmundmay lhprojects swuchterl dickychant akanugan ryanliu30 matt-komm senphy sunwayihep jialin-guo1 chengjiang123 ucsd-hep-ex nikoladze yuxdpku zyt1024 nnu-liang komaltauqeer qibin2020 sam-cal abhigyan-ach elodhi hafizabc77 celia-lo akobert yuvalkay nswood zichunhao

weaver-core's Issues

Unexpected error raised while using weaver DataLoader

I am using the weaver DataLoader to load the JetClass dataset (I used the train_load function in train.py). However, the following error was raised when I attempted to launch a run after another run was finished in the same notebook:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f20f854e4d7 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f20f851836b in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f20f85f2fa8 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9c37 (0x7f207a1b2c37 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccec6 (0x7f20f8aeaec6 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f20f8533e77 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f20f852c69e in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f20f852c7b9 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x752478 (0x7f20f8d70478 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f20f8d70805 in /home/ryanliu/.conda/envs/weaver/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x12a067 (0x5580c8b99067 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #11: <unknown function> + 0x18be85 (0x5580c8bfae85 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #12: <unknown function> + 0x120928 (0x5580c8b8f928 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #13: <unknown function> + 0x1d1b3e (0x5580c8c40b3e in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #14: _PyObject_GC_NewVar + 0x245 (0x5580c8b86445 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #15: PyTuple_New + 0x117 (0x5580c8b8bfd7 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #16: <unknown function> + 0x12b11b (0x5580c8b9a11b in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #17: <unknown function> + 0x12ae17 (0x5580c8b99e17 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #18: <unknown function> + 0x12b183 (0x5580c8b9a183 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #19: <unknown function> + 0x12adec (0x5580c8b99dec in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #20: <unknown function> + 0x12b233 (0x5580c8b9a233 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #21: <unknown function> + 0x12adec (0x5580c8b99dec in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #22: <unknown function> + 0x1d53c8 (0x5580c8c443c8 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #23: <unknown function> + 0x1e88c2 (0x5580c8c578c2 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #24: <unknown function> + 0x13cd4b (0x5580c8babd4b in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x4d1d (0x5580c8ba127d in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x13d0 (0x5580c8b9d930 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #28: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x735 (0x5580c8b9cc95 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #30: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x735 (0x5580c8b9cc95 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #32: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #34: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #36: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #37: <unknown function> + 0x13cee4 (0x5580c8babee4 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #38: _PyObject_CallMethodIdObjArgs + 0x16f (0x5580c8bba7af in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #39: PyImport_ImportModuleLevelObject + 0x551 (0x5580c8bb9ab1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x3981 (0x5580c8b9fee1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #41: <unknown function> + 0x1d5852 (0x5580c8c44852 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #42: PyEval_EvalCode + 0x87 (0x5580c8c44797 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #43: <unknown function> + 0x1dcde0 (0x5580c8c4bde0 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #44: <unknown function> + 0x13d934 (0x5580c8bac934 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x5c81 (0x5580c8ba21e1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #46: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x4d1d (0x5580c8ba127d in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #48: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x735 (0x5580c8b9cc95 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #50: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #52: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x332 (0x5580c8b9c892 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #54: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #55: <unknown function> + 0x13cee4 (0x5580c8babee4 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #56: _PyObject_CallMethodIdObjArgs + 0x16f (0x5580c8bba7af in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #57: PyImport_ImportModuleLevelObject + 0x551 (0x5580c8bb9ab1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x3981 (0x5580c8b9fee1 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #59: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x13d0 (0x5580c8b9d930 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #61: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0x13d0 (0x5580c8b9d930 in /home/ryanliu/.conda/envs/weaver/bin/python)
frame #63: _PyFunction_Vectorcall + 0x6f (0x5580c8bac73f in /home/ryanliu/.conda/envs/weaver/bin/python)

The same error message was repeated eight times: the same number of workers I was using. I suspect that it has something to do with multiprocessing.

dot symbol in ROOT variable names

Hello Huilin,

In my ROOT files some features are stored in a custom class named hits. So variables belong to this class are named as hits.xxx by ROOT automatically (see screenshot). This period symbol seems to cause problem when I want to filt events or define new variables from them. Is there any option to excape . from parsing and pass the name "as it is"?

Here are part of my data configuration file and the error message.

[2023-06-07 06:37:20,319] ERROR: When reading file ../data/fe_13.root: [2023-06-07 06:37:20,320] ERROR: Traceback (most recent call last): File "/Users/xpzhang/IHEPBox/Work/code/ml/ParNet_tuto/weaver/utils/data/fileio.py", line 76, in _read_files a = _read_root(filepath, branches, load_range=load_range, treename=kwargs.get('treename', None)) File "/Users/xpzhang/IHEPBox/Work/code/ml/ParNet_tuto/weaver/utils/data/fileio.py", line 45, in _read_root outputs = tree.arrays(branches, namedecode='utf-8', entrystart=start, entrystop=stop) File "/Users/xpzhang/mambaforge/envs/weaver/lib/python3.7/site-packages/uproot3/tree.py", line 537, in arrays branches = list(self._normalize_branches(branches, awkward0)) File "/Users/xpzhang/mambaforge/envs/weaver/lib/python3.7/site-packages/uproot3/tree.py", line 895, in _normalize_branches raise ValueError("cannot interpret branch {0} as a Python type\n in file: {1}".format(repr(branch.name), self._context.sourcepath)) ValueError: cannot interpret branch b'hits' as a Python type in file: ../data/fe_13.root

A problem running the weaver on M2 Max

Hi,

I'm trying to test training with ParTr on the M2 Max. I've seen that the recent version of the weaver supports the M1, but I don't know if it should work for the M2 too. Here's what I get when running the training with the GPU on with ' --gpus 0':

`
[2023-07-11 17:44:45,693] INFO: Computational complexity: 632.51 MMac
[2023-07-11 17:44:45,693] INFO: Number of parameters: 2.14 M
[2023-07-11 17:44:45,693] INFO: Using loss function CrossEntropyLoss() with options {}
[2023-07-11 17:44:45,756] INFO: Create Tensorboard summary writer with comment test_ParT_Wtag_v1_run1_20230711_174444
Traceback (most recent call last):
File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in
sys.exit(main())
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main
_main(args)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 748, in _main
model = orig_model.to(dev)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/cuda/init.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

real 0m1.443s
user 0m2.215s
sys 0m4.826s
`

But when switching the GPU off by ' --gpus "" \ ' I get a different error:

`
[2023-07-11 17:40:09,773] INFO: Computational complexity: 632.51 MMac
[2023-07-11 17:40:09,773] INFO: Number of parameters: 2.14 M
[2023-07-11 17:40:09,773] INFO: Using loss function CrossEntropyLoss() with options {}
[2023-07-11 17:40:09,820] INFO: Create Tensorboard summary writer with comment test_ParT_Wtag_v1_run1_20230711_174008
[2023-07-11 17:40:09,872] INFO: Optimizer options: {}
[2023-07-11 17:40:09,874] INFO: --------------------------------------------------
[2023-07-11 17:40:09,874] INFO: Epoch #0 training
0it [00:00, ?it/s]
Traceback (most recent call last):
File "/opt/homebrew/anaconda3/envs/weaver/bin/weaver", line 8, in
sys.exit(main())
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 931, in main
_main(args)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/train.py", line 784, in _main
train(model, loss_func, opt, scheduler, train_loader, dev, epoch,
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/nn/tools.py", line 45, in train_classification
for X, y, _ in tq:
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in iter
return self._get_iterator()
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in init
w.start()
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/opt/homebrew/anaconda3/envs/weaver/lib/python3.10/site-packages/weaver/utils/data/config.py", line 204, in getattr
return self.options[name]
KeyError: 'getstate'

real 0m1.322s
user 0m2.492s
sys 0m4.570s
`
Of course, the code works fine when tested in lxslc7 in the IHEP cluster. Is there anyone come across a similar error before? Can someone help with this? Thanks.

Cheers,
Abdualazem.

Question about the "wrap" padding mode

Dear Huilin,

Sorry to bother you and thanks a lot for this handy package!

I have some question about the "wrap" padding mode in the data loader. If I understand correctly, assuming a jagged array as below:
array = [ [1,2,3], [4,5] ]

Doing a "wrap" padding to make the array have the shape of (2,6), in which case each row has 6 elements, will give us:
array_pad = [ [1,2,3,1,2,3], [4,5,4,5,4,5] ]

In the code, I think the "wrap" padding is performed through this function.
And padding the array mentioned above using the _repeat_pad function gives us:
array_pad = [ [1,2,3,4,5,1], [4,5,4,5,1,2] ]

I am wondering whether I misunderstood the concept of "wrap" padding or this is a bug in the implementation?

Thank you in advance and I look forward to your reply!

Best regards,
Ang

Dataloader stuck with uproot>=5.2.0 and num-workers>0

Possible data process in dataset before cut by length limit

Hi, Huilin

To my understanding, the number of data points feed to training is limited by pf_points.length or pf_features.length, and any points exceeding these limits will be discarded. But in some applications, original data are prepared in specific order that is not suitable for applying a direct cut. It may be desirable to have some pre-process to eliminate some bias. Two possible approaches are:

Shuffle the order of the data points (coordinates, features, masks etc..);

def shuffle_data(data):
    random_indices = torch.randperm(data.size(2))
    data = data[:, :, random_indices]
    return data

Sort the data based on the value of a particular feature variable (ascend or descend).

def sort_by_feature(data,i,descending=False):
    # i is the indice of the feature for sorting
    sorted_indices = torch.argsort(data[:,i,:], dim=1, descending=descending)
    data = torch.gather(data, i, sorted_indices.unsqueeze(1).expand(-1, data.size(1), -1))
    return data

It would be nice if any new items could be add in the data config file for this process options. I'm not sure whether I've made myself clear and if this new feature is easy to implement, thanks a lot.

Problem in _clip

Hi!

I am using waever-core on running a sample of flat ntuples (no point clouds), so my data are numbers not vectors.

weaver is crashing, when using automatic scaling. The reason is that the values are numpy arrays dressed up as awkward arrays. In subroutine _clip in utis/data/tools.py is calling the method flatten on the awkward array, which does not work for 1-dim arrays and raises an exception.

One could argue that flatten should be implemented as noop on 1 dim arrays ... but that's not the case ...

I am suggesting to add a.ndim == 1 to the if statement in _clip, as flatten is not required.

def _clip(a, a_min, a_max):
if isinstance(a, np.ndarray) or a.ndim == 1: <======
return np.clip(a, a_min, a_max)
else:
return ak.unflatten(np.clip(ak.to_numpy(ak.flatten(a)), a_min, a_max), ak.num(a))

In case you prefer, I could also send you a PR.

Cheers, Dietrich

hqucms / weaver-core Goto Github PK

weaver-core's People

Contributors

Stargazers

Watchers

Forkers

weaver-core's Issues

Unexpected error raised while using weaver DataLoader

dot symbol in ROOT variable names

A problem running the weaver on M2 Max

Question about the "wrap" padding mode

Dataloader stuck with uproot>=5.2.0 and num-workers>0

Possible data process in dataset before cut by length limit

Problem in _clip

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent