brummi / behindthescenes Goto Github PK

Official implementation of the paper: Behind the Scenes: Density Fields for Single View Reconstruction (CVPR 2023)

Python 99.76% Shell 0.24%

3d-reconstruction depth-estimation depth-prediction kitti kitti-360 kitti-dataset nerf self-supervised self-supervised-learning cvpr

behindthescenes's Issues

Does larger MLP affects the final results?

Hi Brummi,

Thanks a lot for the awesome code! I noticed you use a quite small MLP to render the density field, e.g., ResnetFC with small hidden dimension channels (64) and without any ResNet blocks (0).

I wonder if a larger MLP with 1) larger dimensions; 2) more layers will lead to suboptimal results according to your previous experiments.

Thanks a lot for the information :)

Inferior scores of both provided models and trained models

Hi Brummi,

I tried to evaluate your provided model on the KITTI-raw and KITTI-360 datasets, both yielded suboptimal results

KITTI-360

testing image: The unzipped PNG image (w/o preprocessing)
my evaluated results o_acc: 0.944 | ie_acc: 0.771 | ie_rec: 0.439
results on the paper: o_acc: 0.95 | ie_acc: 0.82 | ie_rec: 0.47

KITTI-raw

testing image: kitti-raw image (transformed to .jpg as in monodepth2)
my evaluated results abs_rel: 0.102 | rmse: 4.409 | a1: 0.881
results on the paper: abs_rel: 0.102 | rmse: 4.407 | a1: 0.882

Even using your provided model, there is a large evaluation gap in KITTI-360, where for the ie_acc, the gap is 0.771 v.s. 0.82. Though the KITTI-raw score has little difference from yours, the numbers are not exactly the same. I hope to make sure:

If I should use the preprocessed images for KITTI-360 for evaluation
If some Python environment settings influence scores. Currently, I use PyTorch-2.0

I also observed further performance decline with my own trained model, i.e., for KITTI-raw, abs_rel: 0.104 | rmse: 4.554 | a1: 0.874, for KITTI-360 o_acc: 0.948 | ie_acc: 0.784 | **ie_rec: 0.369**. Can you provide some suggestions to faithfully reproduce your results?

Thank you for your information!

How to generate the training data?

Hi,
When I launch the training process, I met the error below.
I think the training need a processed data, not the raw data.

  File "/rockywin.wang/NeRF/BehindTheScenes/datasets/kitti_360/kitti_360_dataset.py", line 571, in __getitem__
    imgs_p_left, imgs_f_left, imgs_p_right, imgs_f_right = self.load_images(sequence, img_ids, load_left, load_right, img_ids_fish=img_ids_fish)
  File "/rockywin.wang/NeRF/BehindTheScenes/datasets/kitti_360/kitti_360_dataset.py", line 442, in load_images
    img_perspective = cv2.cvtColor(cv2.imread(os.path.join(self.data_path, "data_2d_raw", seq, "image_00", self._perspective_folder, f"{id:010d}.png")), cv2.COLOR_BGR2RGB).astype(np.float32) / 255
cv2.error: OpenCV(4.5.3) /tmp/pip-req-build-afu9cjzs/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

Experiment settings of other works mentioned in the paper

Thanks for your great work! I wonder what's the experiment settings of these 3 works, mono? stereo? or with fisheye?

Wrong depth projection of kitti360 dataset

Hi, I noticed that the load_depth function in kitti_360_dataset.py was wrong since code like
points = points[points[:, 0] >= 0, :] was omitted. Hence the points backward would mask out the forward point in the following process.

Is the result of this method on KITTI-Raw based on stereo cameras?

Is the result of this method on KITTI-Raw based on stereo cameras? I would like to know if there is still a considerable performance with only a monocular camera or surrounding cameras.

Some details I feel confused about

Nice work! There are some details I feel confused about, like the code below. Why do you define two evaluators? They are exactly similar except for the time when they are called. There might be some engineering strategies at play, and I would appreciate the opportunity to learn from them.

# We define two evaluators as they wont have exactly similar roles:
# - `evaluator` will save the best model based on validation score
evaluator = create_evaluator(model, metrics=eval_metrics, criterion=criterion if loss_during_validation else None, config=config)

if vis_loader is not None:
    visualizer = create_evaluator(model, metrics=eval_metrics, criterion=criterion if loss_during_validation else None, config=config)
else:
    visualizer = None

Will the evaluation code be released?

I'm asking because it seems like there haven't been any commits for a few months while the evaluation code is not being released.

details with the keyParam ray_batch_size: 4096

hey Brummi~ another amazing job after momoRec ~
i have some question with the ray_batch_size, Is this an experience param？
If I trained bts on a custom dataset, Does this parameter have a big impact？

"FileNotFoundError: [Errno 2] No such file or directory: 'scripts/videos/trajectories/simple_movement.npy'"

Hi, @Brummi
Can you show me the example code for how to generate the camera trajectories?
So I can set the camera trajectories on our own dataset?

question about visualizing ground truth depth

Hi, I'm currently doing test on the KITTI-360 dataset and trying to visualize the ground truth depth map of the datasample (returned output from the load_depth function of Kitti360Dataset). However it doesn't show the proper depth profile of the corresponding scene image, as shown below:
real image:

visualized gt_depth map:

Are there any additional processings I need to do with the depth maps in order to visualize it properly? Thanks

How to visualize Fig.4 in your paper?

Thanks for your great work! I am interested in visualizing occupancies behind the scenes as depicted in Fig.4 in your paper. I succeed in training in KITTI-360 and generate depth sequences, but the transition to BEV are not very successful. Do you use the script: python scripts/videos/gen_vid_transition.py to get similar results as Fig. 4?

Question about the log file

Thanks for your great work! The log file gets me a little confused. Take epoch2 as an example, it appears 5 times, could you please give some explanations?
Epoch 1 - Evaluation time (seconds): 4.40 - Vis metrics:
abs_rel: 0.16088220118284555
sq_rel: 1.6475946958800654
rmse: 6.0004214998035215
rmse_log: 0.2652074425737867
a1: 0.8132480978965759
a2: 0.9111361503601074
a3: 0.9498652219772339
2023-07-08 16:39:02,464 kitti_raw INFO: Epoch[1] Complete. Time taken: 00:46:03.668
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[2023-07-08 16:39:19,277][ignite.engine.engine.Engine][INFO] - Engine run starting with max_epochs=1.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Evaluation (val): [1/1] 100%|███████████████████████████████████████████████████████████████████████████████████████ [00:00<?]Visualizing
[2023-07-08 16:39:23,725][ignite.engine.engine.Engine][INFO] - Epoch[1] Complete. Time taken: 00:00:04.238
[2023-07-08 16:39:23,725][ignite.engine.engine.Engine][INFO] - Engine run complete. Time taken: 00:00:04.447
2023-07-08 16:39:23,809 kitti_raw INFO:
Epoch 2 - Evaluation time (seconds): 4.45 - Vis metrics:
abs_rel: 0.23017341934595395
sq_rel: 3.088896086703511
rmse: 7.271316281467014
rmse_log: 0.3078563333555099
a1: 0.7813498377799988
a2: 0.8981010913848877
a3: 0.9393996000289917
[2023-07-08 16:47:56,868][ignite.engine.engine.Engine][INFO] - Engine run starting with max_epochs=1.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Evaluation (val): [1/1] 100%|███████████████████████████████████████████████████████████████████████████████████████ [00:00<?]Visualizing
[2023-07-08 16:48:01,295][ignite.engine.engine.Engine][INFO] - Epoch[1] Complete. Time taken: 00:00:04.211
[2023-07-08 16:48:01,296][ignite.engine.engine.Engine][INFO] - Engine run complete. Time taken: 00:00:04.427
2023-07-08 16:48:01,401 kitti_raw INFO:
Epoch 2 - Evaluation time (seconds): 4.43 - Vis metrics:
abs_rel: 0.1650394400605743
sq_rel: 1.998698890271272
rmse: 5.943043702966046
rmse_log: 0.2682881054646016
a1: 0.8516638278961182
a2: 0.9178416728973389
a3: 0.9528733491897583
[2023-07-08 16:56:36,516][ignite.engine.engine.Engine][INFO] - Engine run starting with max_epochs=1.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Evaluation (val): [1/1] 100%|███████████████████████████████████████████████████████████████████████████████████████ [00:00<?]Visualizing
[2023-07-08 16:56:40,894][ignite.engine.engine.Engine][INFO] - Epoch[1] Complete. Time taken: 00:00:04.135
[2023-07-08 16:56:40,895][ignite.engine.engine.Engine][INFO] - Engine run complete. Time taken: 00:00:04.378
2023-07-08 16:56:40,991 kitti_raw INFO:
Epoch 2 - Evaluation time (seconds): 4.38 - Vis metrics:
abs_rel: 0.14595142936506972
sq_rel: 1.4979753871574057
rmse: 5.749837607629393
rmse_log: 0.26670149436806356
a1: 0.8354953527450562
a2: 0.9211630821228027
a3: 0.9507426023483276
conda activate base
[2023-07-08 17:05:13,192][ignite.engine.engine.Engine][INFO] - Engine run starting with max_epochs=1.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[2023-07-08 17:08:11,160][ignite.engine.engine.Engine][INFO] - Epoch[1] Complete. Time taken: 00:02:57.756
[2023-07-08 17:08:11,161][ignite.engine.engine.Engine][INFO] - Engine run complete. Time taken: 00:02:57.969
2023-07-08 17:08:11,282 kitti_raw INFO:
Epoch 2 - Evaluation time (seconds): 177.97 - Test metrics:
abs_rel: 0.11900233822874062
sq_rel: 0.8671512578064269
rmse: 4.559270895221629
rmse_log: 0.199316740456192
a1: 0.8598688319325447
a2: 0.9558506403118372
a3: 0.9797724287491292
[2023-07-08 17:08:11,282][ignite.engine.engine.Engine][INFO] - Engine run starting with max_epochs=1.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Evaluation (val): [1/1] 100%|███████████████████████████████████████████████████████████████████████████████████████ [00:00<?]Visualizing
[2023-07-08 17:08:15,634][ignite.engine.engine.Engine][INFO] - Epoch[1] Complete. Time taken: 00:00:04.166
[2023-07-08 17:08:15,635][ignite.engine.engine.Engine][INFO] - Engine run complete. Time taken: 00:00:04.353
2023-07-08 17:08:15,740 kitti_raw INFO:
Epoch 2 - Evaluation time (seconds): 4.35 - Vis metrics:
abs_rel: 0.13582281319264153
sq_rel: 1.4082434631998875
rmse: 6.013943428623624
rmse_log: 0.28424850291586035
a1: 0.8557999134063721
a2: 0.923481822013855
a3: 0.9515572786331177
[2023-07-08 17:16:50,828][ignite.engine.engine.Engine][INFO] - Engine run starting with max_epochs=1.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Evaluation (val): [1/1] 100%|███████████████████████████████████████████████████████████████████████████████████████ [00:00<?]Visualizing
[2023-07-08 17:16:55,211][ignite.engine.engine.Engine][INFO] - Epoch[1] Complete. Time taken: 00:00:04.195
[2023-07-08 17:16:55,212][ignite.engine.engine.Engine][INFO] - Engine run complete. Time taken: 00:00:04.384
2023-07-08 17:16:55,296 kitti_raw INFO:
Epoch 2 - Evaluation time (seconds): 4.38 - Vis metrics:
abs_rel: 0.17579734838730143
sq_rel: 2.5830991379591985
rmse: 6.899043454084597
rmse_log: 0.30531200850440265
a1: 0.8277871608734131
a2: 0.9161496162414551
a3: 0.941655695438385
2023-07-08 17:25:02,273 kitti_raw INFO: Epoch[2] Complete. Time taken: 00:45:59.807
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[2023-07-08 17:25:31,259][ignite.engine.engine.Engine][INFO] - Engine run starting with max_epochs=1.
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Evaluation (val): [1/1] 100%|████████████████████████████████████████████████████████████████████████ [00:00<?]Visualizing
[2023-07-08 17:25:35,633][ignite.engine.engine.Engine][INFO] - Epoch[1] Complete. Time taken: 00:00:04.175
[2023-07-08 17:25:35,634][ignite.engine.engine.Engine][INFO] - Engine run complete. Time taken: 00:00:04.374
2023-07-08 17:25:35,720 kitti_raw INFO:
Epoch 3 - Evaluation time (seconds): 4.37 - Vis metrics:
abs_rel: 0.15186456882564142
sq_rel: 2.0138081328306496
rmse: 6.890541053109251
rmse_log: 0.2912098378889751
a1: 0.846023678779602
a2: 0.9128282070159912
a3: 0.944538414478302

How to add depth supervised loss?

Hi,
I added the depth supervised loss in the config file.
But, I found the depth loss starts small and oscillates up.
The visualization of depth map was also poor.

Some problem about this architecture

Hi Thanks for your work. I tried your model architecture on my custom data and here are some problems and my insights

1.There are two key advantages about this model：
(1) It is self-supervised and need only continuous video sequences with relative [R|T] between adjacent frames. This reduce the human labor enormously and we can use high precise localization algorithm or device to automatically get the camera position.
(2) It is not computationally intensively if we do not want to render the depth but only want to get 3d occupancy grid, e.g. , x range [-8,8]
y range [-0.4, 2.2] z range [1,21], and the voxel resolution is 0.2, then there are about 100000 sample point and one inference of mlp with input [1,100000,103] is enough. The grid sample op can also be easily implemented using cuda kernel function and there are also no 3d conv ops.

2.But to be honest there are some inevitable problems associated with the model
(1) The training signal depends both on the image quality itself and [R|T] precision:
-- First, If the image has reflection on the ground such as underground parking lot, then no matter how precision the [R|T] matrix is, the training signal will be vague and weak in these region because there will be no way to tune the predicted density along the camera ray to lead it to find the best stereo-matching point on the epipolar line of the render image.

-- Second, If the [R|T] matrix is not precise, then the traning signal will also not be clear enough to get notable result. For example, if the video sequence is monocular, then becasue there are always some road surface vibration when driving, the [R|T] matrix may not be precise which may lead to sub-optimal training result. However when trained on stereo-camera, the [R|T] between left and right camera is very precision and will almost not change when driving, then the training signal is much clearer and the result will be much better compared to mono ones. So in order to get very good result, it seems that the data gathering car need to equip stereo-camera with larger baseline distance when applied to outdoor driving scenerio.

(2) The generalization ability is weak
Even when trained with stereo-camera or with very precise [R|T] between adjacent sequences, and the image quality is very good with no reflection or artifacts. The generalization ablity is not so impressive. For example the model trained on KITTI-raw or KITTI-360 will perform very bad on custom dataset without finetune (zero-shot). The depth map is far from precise, especially in the road surface region. When finetuned on custom mono dataset, the model performe better but still far from precise, especially in the road surface region, and the texture-copy artifact will occur in rendered depth map.

The problem can not be solved even using larger dataset and I think it is really the intrinsic limitation of the model. The key problem may line in the how the point feature in 3d space be constructed. In your paper, the 3d point feature consist of three components： 1. image feature sampled from projected pixel using interpolation 2.position embedding using [sin(fu) sin(fv) sin(fz) cos(fu) cos(fv) cos(fz) sin(2fu) sin(2fv) sin(2fz) cos(2fu) cos(2fv) cos(2fz) ... ] 3.the normalized position itself (u,v,z)

So I think this may be the key problem that weaken its generalization because the image feature part in point feature vector may be the dominant factor in determining the decoded density in that point and image feature may be very different in different dataset domain. So when zero-shot it to a whole new different dataset the model perform so bad that it has to be finetuned to adapt to new image feature space. If this custom dataset has only mono-camera, then the training result will not be so good. This really hinder its practical usage in autonomous driving

I am wonder if there are some misunderstanding about this model and also want to know how to enhance model generalization and get good result on my custom mono video sequence with middle precise [R|T]

Why is the pose used for nerf calculation an identity matrix?

Hi,
I am confused about that the pose matrix is inverted and multiplied by itself, isn't that the identity matrix?

Question about of generate novel view animations

When I know the R and T of the current view, how do I generate the motion trajectory of the new view on the kitti dataset to generate the demo, can you provide the code?
How to get this file of "./scripts/videos/trajectories/simple_movement.npy"?

These files are not in the link mentioned by you.

          These files are not in the link mentioned by you.

As shown, the 0928 director only has two files .

Others are missing!

Originally posted by @liguopeng0923 in #20 (comment)

Please add license

Dear authors,

Great work. Please add license. Please also check for and respect licenses from which this repo was derived from ( pixelNeRF, monodepth2).

Thanks

Does it support live demo from the webcam?

Does it support webcam for a live depth prediction?

Question about the ground truth occupancy.

Thanks for your great work! I have a question about using the ground truth occupancy. Could you please help me with it?
I think that if objects move during a shoot, there is a possibility that the whole object or parts of it will be carved out.
When you were creating the ground truth occupancy, how did you deal with moving objects?
Thank you!

Why the input data shape is [4, 8, 3,192, 640] when the batch szie is 16?

Hi,
I am confused about the input data shape. Can you help me?
The input data shape is [4, 8, 3,192, 640] when the batch size is 16.
The input data shape is [2, 8, 3,192, 640] when the batch size is 8.
The input data shape is [1, 8, 3,192, 640] when the batch size is 2.
And what's the meaning of the number 8?

question about change the rotation to get the novel view in image custome

Missing some files in KITTI-RAW Poses in datasets/kitti_raw/orb-slam_poses.

Occupancy visualization code

Do you plan to release the visualization code that represents density as occupancy, like the density field image shown in Figure 1?

Can the kitti raw data be replace with the kitti odometry data set

Hi,
I have a kitti odometry dataset, but I don't have kitti raw data.
Can I run the code of BehindTheScene in kitti odometry data with the config of kitti raw data?
I found the SceneRF is training in kitti odometry dataset.

When will the code be released?

Hello. Thank you for sharing this amazing project!

May I know when the code will be released?

I am looking forward to running this project!

How can I get the point cloud and mesh from the density?

Hi,
Can I generate the point cloud and mesh from the density?
Just like the below.

Why the profile results are different with yours?

Hello, thank you for sharing this wonderful work.

When I use the gen_img_custom.py with the pre-trained model you shared to visualize the depth map and profile map for the KITTI-Raw data, I got the results like this:

They are different with the results in Figure 4 in your paper.

I am not sure what the problem.

FisheyeToPinholeSampler

hey brummi ~
I have some question with FisheyeToPinhole.
I tried to visualize the fisheye img after this sample function.
I used the kitti360 data_fisheye_calibration data and calibration results which you support, and then I used the resamle function, after resampling it cut a lot, a lot of information is missing, I was wondering if I could focus on the fisheye camera information, Maybe I don't need such resample ? Looking forward to your reply, really appreciate.

confusion about frame_sample_mode

Hi,
I am confusion about the setting of frame_sample_mode.

I don't know the meaning of the magical number of 4 and 2 in the code block.
I don't known the meaning of this for loop

                for cam in range(4):
                    ids_loss += [cam * steps + i for i in range(start_from, steps, 2)]
                    ids_render += [cam * steps + i for i in range(1 - start_from, steps, 2)]
                    start_from = 1 - start_from

When I train my dataset with only a front camera no stereo camera and fisheye camera, how can I modify these settings to fitting my dataset.

Train on KITTI-360 with fisheye

The images are recon depth, recon img, and invailds.
The traning loss did't decrease, if i use fisheye data to train in KITTI360 . I have checked i use the same config as the original one except i didn't preprocess the dataset. It seems that the invalid regions are large.
Could you give me some advice for proper training?

brummi / behindthescenes Goto Github PK

behindthescenes's Issues

Recommend Projects

Recommend Topics

Recommend Org