alvinyh / faster-voxelpose Goto Github PK
View Code? Open in Web Editor NEWOfficial implementation of Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection
License: MIT License
Official implementation of Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection
License: MIT License
尊敬的作者,您好!感谢您在基于空间体素化的3D人体姿态估计作出的贡献,我有个问题就是:
我安装了python3.8和相关的requirements后(部分requirement版本不兼容已经替换掉了),开始run trian.py,结果发现在提示11次以上的the bounding box isn't large sufficiently之后,直接退出,停止训练了。我通过断点debug发现在dataloader处停掉了,请问你是否知道问题所在么?我使用的显卡是2080ti 12G。
problem in here:
Epoch: 10
Epoch: [10][0/13] Time: 0.613s (0.613s) Speed: 39.1 samples/s Data: 0.420s (0.420s) Loss: nan (nan) Loss_2d: 0.0019247 (0.0019247) Loss_1d: nan (nan) Loss_bbox: 0.027619 (0.027619) Loss_joint: nan (nan) Memory 0.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
all of training epoch like this,how to deal with it?
i hope get the answer.thank you!
Thank you for sharing such beautiful code. I have a question. In readme, I can find the use of train, which is used to train models, and evaluate, which is used to evaluate models. But I went to predicate from there. For example, I had trained the model, and then I wanted to use three new kinds of campus photos to fuse the trained model into a 3D image. How should I operate, thank you!
My GPU is 3090, so I cannot use the Torch1.4.0+cuda10.1 environment mentioned by the author. The environment I am using is Torch1.11.0+cuda11.3, which is the lowest version of the 30 series. I set gpus to "0" in the configuration file, num_ image is set to 2K, and there is a problem with the image shown in the above figure during operation. This problem is caused by the save_debug_2d_images function. The error prompt indicates that the location of the error is in the library.
I hope someone can help me solve this problem. Thank you very much
Hello authors,
Thank you for sharing the code of this amazing work. I am using your framework to solve a multi-person multi-camera problem in real-time environment. I wanted to ask if we need to train the model again with our own data after setting up the environment or if we could use the pre-trained model and test it on the new data.
It would be really helpful if you could help me out in this aspect.
Thank you for your time.
Thank you very much for your perfect work.
When I would like test your code, I could not find the model_best.pth.tar. Could you please provide the model for testing.
Thanks.
Hi, thank you for this great work.
I'm trying to run your model on my own data, but I cannot run Shelf/Campus pretrained model because there is no pretrained backbone. The backbone for panoptic has 15 keypoints but Shelf/Campus has 17 keypoints, so I get error if I try to load it under Shelf/Campus config. Could you upload it please?
How long does it need for training?
Thanks!
Hi , I have been trying to train the model.
At first I tried the repo as is and I was getting this bug
Then I removed the part that saves the imgs ,note that it works ok when evaluating, and I was getting killed cause of RAM when I had the default 10k samples. I reduced it to 1-5k and I managed to load them succesfully. If I understand correctly in the synthetic data experiments the whole dataset is loaded into RAM.
After that I was getting CUDA OOM, the culprit seemed to be
Faster-VoxelPose/lib/core/function.py
Lines 69 to 70 in 4daaeda
so I removed that part.
The model seemed to train at first but then I was NaN tensor and the accumulated loss was nan as well.
During validation using the debug_save_imgs function , actually works and I can see the predicted 3d poses and their projections. The thing is when I print the final_fused_poses variable, each joint is always [0,0,0,-1..].
The error for each actor is stuck at 0.0.
I have tried training both in an RTX3080 (where I cant use torch 1.4.0 since cuda 11 is the minimum version supported by RTX30xx series) and on colab where I installed the requirements from the requirements.txt file.
感谢前辈将如此优美的代码共享,我有一个疑问,在readme里面,我可以找到train的用法,这是用来训练模型,还可以找到evaluate,这是用来评价模型的好坏。可是我从那里去predicate,比如我已经训练好了模型,然后我想用三种campus照片去通过训练好的模型融合出3d图片来。我该如何操作,谢谢!
Dear authors:
It is grateful to read your paper and code. when i try to run this project to reproduce your paper work. my result is dropped about 2mm, could you explain why ?
is your code responde to this setting? using [5 views; mask; weights;].
my conda environment is that, show in the picture:
my GPU is RTX3090, cuda11.3 , torch1.11.0
I continued learning by following Readme. (data preparation, preprocessing, using smae config file, etc.)
However, we found some performance degradation in all datasets.
I wonder if it's because I missed something in the process. (for examples, data argument set as true?)
有一些关于Campus数据的问题想要请教:
原始数据下载后,每个相机的外参的旋转矩阵和代码库中下载的链接一致,但是平移向量不同,三个相机的外参平移向量分别是【[-1.787557, 1.361094, 5.226973],[4.9229, 1.1614, 6.6849], [-4.9013, 0.5299, 11.2024]】。
而代码库中下载的相机参数的平移向量是[[1774.8953318252247,-5051.695948238737, 1923.3559877015355], [-6240.579909342256, 5247.348264374987, 1947.3802148598609], [11943.56106545541, -1803.8527374133198, 1973.3939116534714]],这个如何解释?
Hello,how should I run the model in real-time through the camera。Thank you。
I'm using Faster-VoxelPose for a project, and I'm experiencing issues during the visualization stage. I suspect these issues might be related to incorrect camera calibration. The results I'm getting are not as expected, especially when using the Shelf and Campus datasets.
Could you please advise on the correct process for camera calibration in Faster-VoxelPose to ensure accurate visualization? Are there any specific parameters or calibration techniques that I should be aware of?
Any tips or common mistakes to avoid in this process would also be highly appreciated.
Thank you for your support.
Thanks for your work! when I run "python run/train.py --cfg configs/panoptic/jln64.yaml" with a " GPUS: '0,1' " setting in jln64.yaml, I got a RuntimeError as follows:
Exception has occurred: RuntimeError
Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "run/../lib/models/voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "run/../lib/models/human_detection_net.py", line 81, in forward
feature_cubes = self.project_layer(heatmaps, meta, cameras, resize_transform)
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "run/../lib/models/project_whole.py", line 80, in forward
sample_grids[c] = self.project_grid(cameras[curr_seq][c], w, h, nbins, resize_transform, device).squeeze(0)
File "run/../lib/models/project_whole.py", line 52, in project_grid
xy = do_transform(xy, resize_transform)
File "run/../lib/utils/transforms.py", line 62, in affine_transform_pts_cuda
out = torch.mm(t, torch.t(pts_homo))
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:277
File "/newerDisk/zzh/python-projects/Faster-VoxelPose/lib/core/function.py", line 42, in train_3d
cameras=cameras, resize_transform=resize_transform)
File "/newerDisk/zzh/python-projects/Faster-VoxelPose/run/train.py", line 140, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/newerDisk/zzh/python-projects/Faster-VoxelPose/run/train.py", line 170, in
main()
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "run/../lib/models/voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "run/../lib/models/human_detection_net.py", line 81, in forward
feature_cubes = self.project_layer(heatmaps, meta, cameras, resize_transform)
File "/newerDisk/zzh/anaconda3/envs/fvpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "run/../lib/models/project_whole.py", line 80, in forward
sample_grids[c] = self.project_grid(cameras[curr_seq][c], w, h, nbins, resize_transform, device).squeeze(0)
File "run/../lib/models/project_whole.py", line 52, in project_grid
xy = do_transform(xy, resize_transform)
File "run/../lib/utils/transforms.py", line 62, in affine_transform_pts_cuda
out = torch.mm(t, torch.t(pts_homo))
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:277
The GPUs used are two RTX 1080ti. The OS is Ubuntu 16.04.7 LTS. The CUDA version is 10.2 and pytorch version is 1.4.0.
There would be no problem if running on single GPU, but the above error will happen in the case of multiple GPUs.
Dear author, thank you for the source code. After installing Python 3.8, I installed the relevant library files according to the requirements. txt. But when I execute:
python run/train. py -- cfg configs/campus/jln64.yaml.
The code was killed before it was recycled for a generation.
The visualization is wrong with the line camera[curr_seq][c] and curr_seq and c are both 01234 but the second parameter should be a letter like k for the shelf camera settings
When I train the model on Panoptic dataset, this error occured.
Traceback (most recent call last):
File "/home/vis/projects/Faster-VoxelPose/run/train.py", line 168, in <module>
main()
File "/home/vis/projects/Faster-VoxelPose/run/train.py", line 138, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/home/vis/projects/Faster-VoxelPose/run/../lib/core/function.py", line 66, in train_3d
accu_loss.backward()
File "/home/vis/anaconda3/envs/pose/lib/python3.9/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/vis/anaconda3/envs/pose/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
[torch.cuda.FloatTensor [1, 32, 1]] is at version 8; expected version 6 instead. Hint: the backtrace further above shows the operation
that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!"
I think the problem is the code below:
if loss_joint > 0:
optimizer.zero_grad()
loss_joint.backward()
optimizer.step()
if accu_loss > 0 and (i + 1) % accumulation_steps == 0:
optimizer.zero_grad()
accu_loss.backward()
optimizer.step()
accu_loss = 0.0
else:
accu_loss += (loss_2d + loss_1d + loss_bbox) / accumulation_steps
There are two loss which apply backward().
If I use only one loss, then it works well. But using both make error like that.
Has anyone solved this problem?
Doesn't it look like Best.pth.tar in repo?
Hello! How many cameras did you use in the context of the article?
How do I make predictions on my own dataset?
Hello,
Thank you for putting together such a brilliant repository.
I want to further train the backbone as well, which I believe is from here: https://github.com/HRNet/HRNet-Human-Pose-Estimation
However, FasterVoxelPose's ResNet seems to be incompatible with HRNet because of output dimensions.
Could you also include the code you used for this part in the readme?
"(ResNet-50 pretrained on COCO dataset and finetuned jointly on Panoptic dataset and MPII)"
Alternatively, could you quickly explain how you modified it? I want to make sure that I don't run into any incompatibility issues.
Thank you again.
Dear authors,
I conducted several tests and can't reproduce the FPS metric you mention in "Table 3: Comparison with SOTA on Panoptic".
My specs:
Python 3.6.10
torch.version: '1.4.0'
torch.version.cuda: '10.1'
GPU: Quadro RTX 6000
Batch size: 1
Example cameras, scene and frame:
Cameras list: ['00_03', '00_06', '00_12', '00_13']
Scene: 160906_pizza1
Frame: 00000347
Config: only difference: CAMERA_NUM: 4
The test I do is without data loading, just inference in the loop:
from imutils.video import FPS
fps = FPS().start()
for i in range(120):
final_poses, poses, proposal_centers, _, input_heatmap = model(views=inputs,
meta=meta,
cameras=cameras,
resize_transform=resize_transform)
fps.update()
fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
[INFO] elasped time: 9.80
[INFO] approx. FPS: 12.24
The FPS I get is ~12, so considerably lower than ~30 that is claimed in the paper.
Am I missing something here?
Thank you!
EDIT:
I did the same test on another machine:
Python 3.6.13
torch.version: '1.7.1+cu110'
GPU: Tesla V100
Other parameters were the same
[INFO] elasped time: 11.99
[INFO] approx. FPS: 10.01
Hello I am trying to use this model in a real world setting.
Could you please explain how the axes should be defined during calibration?(I have them as x,z,y, z being the height dimension)
It appears that each camera predicts its own pose and they are in completely different positions. Nontheless the projection back to the camera works ok for each respective camera if the pred comes from that one.
Dear Authors,
I tried to train the network, but the memory usage on GPU keeps increasing with every iteration.
My specs:
torch: 1.4.0
torchvision: 0.5.0
cuda: 10.1
Steps to reproduce:
I tried to pinpoint the exact spot where the memory leak occurs, but it seems like it happens in different places in the network. The major increase seem to happen:
I use GPUtil for checking, the memory, and put it in the following places:
` if views is not None:
input_heatmaps = torch.stack([self.backbone(view) for view in views], dim=0)
else:
input_heatmaps = torch.stack(input_heatmaps, dim=0)
print("after self.backbone", GPUtil.getGPUs()[0].memoryUsed)
batch_size = input_heatmaps[0].shape[0]
# human detection network
proposal_heatmaps_2d, proposal_heatmaps_1d, proposal_centers, \
bbox_preds = self.pose_net(input_heatmaps, meta)
print("after self.pose_net", GPUtil.getGPUs()[0].memoryUsed)
mask = (proposal_centers[:, :, 3] >= 0)
# joint localization network
fused_poses, poses = self.joint_net(meta, input_heatmaps, proposal_centers.detach(), mask)
print("after self.joint_net", GPUtil.getGPUs()[0].memoryUsed)
# compute the training loss
if self.training:
assert targets is not None, 'proposal ground truth not set'
proposal2gt = proposal_centers[:, :, 3]
proposal2gt = torch.where(proposal2gt >= 0, proposal2gt, torch.zeros_like(proposal2gt))
# compute 2d loss of proposal heatmaps
loss_2d = F.mse_loss(proposal_heatmaps_2d[:, 0], targets['2d_heatmaps'], reduction='mean')
# unravel the 1d gt heatmaps and compute 1d loss
matched_heatmaps_1d = torch.gather(targets['1d_heatmaps'], dim=1, index=proposal2gt.long()\
.unsqueeze(2).repeat(1, 1, proposal_heatmaps_1d.shape[2]))
loss_1d = F.mse_loss(proposal_heatmaps_1d[mask], matched_heatmaps_1d[mask], reduction='mean')
# compute the loss of bbox regression, only apply supervision on gt positions
bbox_preds = torch.gather(bbox_preds, 1, targets['index'].long().view(batch_size, -1, 1).repeat(1, 1, 2))
loss_bbox = F.l1_loss(bbox_preds[targets['mask']], targets['bbox'][targets['mask']], reduction='mean')
del proposal_heatmaps_2d, proposal_heatmaps_1d, bbox_preds
# weighted L1 loss of joint localization
joints_3d = torch.gather(meta[0]['joints_3d'].float(), dim=1, index=proposal2gt.long().view\
(batch_size, -1, 1, 1).repeat(1, 1, self.num_joints, 3))[mask]
joints_vis = torch.gather(meta[0]['joints_3d_vis'].float(), dim=1, index=proposal2gt.long().view\
(batch_size, -1, 1).repeat(1, 1, self.num_joints))[mask].unsqueeze(2)
loss_joint = F.l1_loss(poses[0][mask] * joints_vis, joints_3d[:, :, :2] * joints_vis, reduction="mean") +\
F.l1_loss(poses[1][mask] * joints_vis, joints_3d[:, :, ::2] * joints_vis, reduction="mean") +\
F.l1_loss(poses[2][mask] * joints_vis, joints_3d[:, :, 1:] * joints_vis, reduction="mean") +\
2 * F.l1_loss(fused_poses[mask] * joints_vis, joints_3d * joints_vis, reduction="mean")
loss_dict = {
"2d_heatmaps": loss_2d,
"1d_heatmaps": loss_1d,
"bbox": 0.1 * loss_bbox,
"joint": loss_joint,
"total": loss_2d + loss_1d + 0.1 * loss_bbox + loss_joint
}
else:
loss_dict = None
print("after lossess block", GPUtil.getGPUs()[0].memoryUsed)
# confidence score
fused_poses = torch.cat([fused_poses, proposal_centers[:, :, 3:5].reshape(batch_size,\
-1, 1, 2).repeat(1, 1, self.num_joints, 1)], dim=3)
return fused_poses, poses, proposal_centers.detach(), loss_dict, input_heatmaps`
Readings are as follows 1st iteration:
after self.backbone 2274.0
after self.pose_net 1680.0
after self.joint_net 1680.0
after lossess block 1680.0
2nd iteration
after self.backbone 2630.0
after self.pose_net 2640.0
after self.joint_net 2640.0
after lossess block 2640.0
3rd iteration
after self.backbone 2880.0
after self.pose_net 2890.0
after self.joint_net 2890.0
after lossess block 2890.0
4th iteration
after self.backbone 3132.0
after self.pose_net 3140.0
after self.joint_net 3140.0
after lossess block 3142.0
And it goes on like this until OOM.
Did you experience similar problems?
Thank you for your help.
when I trained the model on panoptic datasets and met such problem. and I use the torch1.13, cuda 11.8.
File "/workspace/faster_voxel_pose/run/train.py", line 181, in
main()
File "/workspace/faster_voxel_pose/run/train.py", line 151, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 41, in train_3d
final_poses, poses, proposal_centers, loss_dict, input_heatmap = model(views=inputs, meta=meta, targets=targets,
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/human_detection_net.py", line 94, in forward
proposal_heatmaps_1d = self.c2c_net(torch.flatten(feature_1d, 0, 1)).view(batch_size, self.max_people, -1)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/cnns_1d.py", line 131, in forward
hm = self.output_hm(x)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/workspace/faster_voxel_pose/run/train.py", line 181, in
main()
File "/workspace/faster_voxel_pose/run/train.py", line 151, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 71, in train_3d
accu_loss.backward()
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1]] is at version 7; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Hi could we use this model on a never seen video?
我的训练环境如下:
python 3.7
torch 1.4
显卡 gtx3080 Ti, 显存12G。
为了节省显存,我把batch设为1,SYNTHETIC 的NUM_DATA设为1000,
运行作者提供的train.py时, 在epoch = 0时 run了一会就会报out of memory:
错误信息如下:
`Epoch: 0
Save the sampling grid in HDN for sequence synthetic
Epoch: [0][0/1000] Time: 563.691s (563.691s) Speed: 0.0 samples/s Data: 6.174s (6.174s) Loss: nan (nan) Loss_2d: 0.0008510 (0.0008510) Loss_1d: nan (nan) Loss_bbox: 0.012933 (0.012933) Loss_joint: nan (nan) Memory 292969472.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000000
Save the sampling grid in JLN for sequence synthetic
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000100
Epoch: [0][100/1000] Time: 0.078s (5.717s) Speed: 12.8 samples/s Data: 0.000s (0.064s) Loss: nan (nan) Loss_2d: 0.0008510 (nan) Loss_1d: nan (nan) Loss_bbox: 0.015552 (383651026093232355625853440229376.000000) Loss_joint: nan (nan) Memory 2886614528.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000200
Epoch: [0][200/1000] Time: 0.077s (2.913s) Speed: 13.0 samples/s Data: 0.000s (0.034s) Loss: nan (nan) Loss_2d: 0.0050989 (nan) Loss_1d: nan (nan) Loss_bbox: 0.039250 (1752363751526759131174129536860160.000000) Loss_joint: nan (nan) Memory 5246115328.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000300
Epoch: [0][300/1000] Time: 0.079s (1.973s) Speed: 12.7 samples/s Data: 0.000s (0.024s) Loss: nan (nan) Loss_2d: inf (nan) Loss_1d: nan (nan) Loss_bbox: 0.025086 (1170183103178998798672389530976256.000000) Loss_joint: nan (nan) Memory 7605616128.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
Epoch: [0][400/1000] Time: 0.078s (1.502s) Speed: 12.9 samples/s Data: 0.000s (0.019s) Loss: nan (nan) Loss_2d: 0.0042343 (nan) Loss_1d: nan (nan) Loss_bbox: 0.034341 (878474771048066240862024315699200.000000) Loss_joint: nan (nan) Memory 9965116928.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000400
Traceback (most recent call last):
File "E:\project\Faster-VoxelPose_23_04_25\run\train.py", line 171, in
main()
File "E:\project\Faster-VoxelPose_23_04_25\run\train.py", line 140, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\core\function.py", line 45, in train_3d
cameras=cameras, resize_transform=resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\parallel\data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\human_detection_net.py", line 81, in forward
feature_cubes = self.project_layer(heatmaps, meta, cameras, resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\project_whole.py", line 84, in forward
cubes[i] = torch.mean(F.grid_sample(heatmaps[i], shared_sample_grid, align_corners=True), dim=0).squeeze(0)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\functional.py", line 2711, in grid_sample
return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 12.00 GiB total capacity; 11.21 GiB already allocated; 0 bytes free; 11.26 GiB reserved in total by PyTorch)
Process finished with exit code 1
`
想请问各位还要什么方法可以减小训练时候的显存消耗来保证12G显存可以训练?
Dear author,recently I tried to run your code on my sever. However, I found the 'NAN or INF found in tensor' when I start training.
I used torch==1.4.0,cudatoolkit==10.1,torchvision==0.5.0 and the others are same to the requirements.txt.
I changed GPUS: '0,1' to GPUS: '0' and NUM_DATA: 1000 to NUM_DATA: 500.
I trained the model for 30 epoch, but it still shows 'NAN or INF found in tensor'.
作者您好,最近我尝试在服务器上运行您的代码,但是始终显示 'NAN or INF found in tensor' 。
我使用的配置和您写的一样,torch==1.4.0,cudatoolkit==10.1,torchvision==0.5.0等等。
由于一些原因,我将多GPU并行处进行了修改,只有一块GPU;同时从1000减少了NUM DATA到500。
我尝试训练到了epoch 30,但是始终显示 'NAN or INF found in tensor'。
在其他人的Issues里,我看到您似乎已经解决了这个问题,这是解决问题的新代码吗?
Dear author! When I try to debug such problem,it seems that the problem comes from the pytorch dataloader! How to solve such problem?
Hello!
I encountered some problems and failed to get the dataset.
I try to visit domedb.perception.cs.cmu.edu/ , however, it redirects me to https://domedb.perception.cs.cmu.edu:5001/, which needs to login.
I also tried to get the dataset with the script ./scripts/getData.sh 171204_pose1_sample
, and got the result as below:
Connecting to domedb.perception.cs.cmu.edu (domedb.perception.cs.cmu.edu)|128.2.220.8|:5001... connected.
ERROR: cannot verify domedb.perception.cs.cmu.edu's certificate, issued by '[email protected],CN=Synology Inc. CA,OU=Certificate Authority,O=Synology Inc.,L=Taipei,ST=Taiwan,C=TW':
Unable to locally verify the issuer's aut
Thank you for your perfect work. I have a question that how could i visualize the results for multi-person?
Thank you very much.
Dear Authors,
thank you for publishing your ispiring work.
In the article you mention that you "deployed our model to a basketball court and a retail store". Did this model need any kind of retraining or finetuning to accout for new camera set-up or calibration?
Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.