I project the centers of voxels to the corresponding image, and find that it does not

The voxel GTs do not aglin well with the images?,about cvpr2023-3d-occupancy-prediction/cvpr2023-3d-occupancy-prediction

Comments (4)

waveleaf27 commented on May 28, 2024 1

Hi shikongzxz,
The deviation in height is caused by voxelization and the lack of translation in the z-axis. This means that the height of an object, after aggregating all frames, may be larger than its true value.
As we do not have access to the original data, this deviation is considered acceptable in this competition. However, we plan to collaborate with the official team of nuscene to address the pose issue after the competition.

from cvpr2023-3d-occupancy-prediction.

waveleaf27 commented on May 28, 2024

Can you further explain why the height of an object is calculated using the number of pixels in an image? Objects of the same height in 3D space may appear at different vertical position in the image due to projection.

For your question, check the projecting process. It should be noted that the voxel ground truth is in the vehicle coordinate. The following pseudo code example shows how to project a 3D point onto an image.
reference_points_cam = torch.matmul(torch.matmul(lidar2img.to(torch.float32), ego2lidar.to(torch.float32)), points.to(torch.float32))

There are also two extra reasons that can cause deviation between point clouds and images pixels:

Voxel size. In this challenge, we provided a voxel size of 0.4m to reduce the GPU memory requirements. As you mentioned, the voxel center is actually not the actual point location due to discretization error. When adopting a smaller voxel size, we have better alignment between point clouds and images, as shown in the figure below.
Nuscene (issues-721) lacks translation in the z-axis, which makes it difficult to achieve accurate 6D localization and can lead to misalignment of point clouds when accumulating them over the entire scene.

from cvpr2023-3d-occupancy-prediction.

shikongzxz commented on May 28, 2024

I am not counting the pixels, but the voxels which are projected to the image. If you doubt that, you can visualize the GT file mini/gts/scene-0061/ca9a282c9e77460f8360f564131a8af5/labels.npz, it gives this image:

the pink voxels, which stands for a person, it occupies 7 voxels.

P.S. I used the following code to project the center of the voxels to the image:

    meta_file = "annotations.json"
    with open(meta_file, 'r') as f:
        meta = json.load(f)

    scene_name = meta["train_split"][0]
    scene_info = meta["scene_infos"][scene_name]

    color_map = get_color_map_nuscenes_seg()

    frame_info = next(iter(scene_info.values()))
    timestamp = frame_info["timestamp"]
    ts_save_path = os.path.join(save_dir, str(timestamp))
    os.makedirs(ts_save_path, exist_ok=True)

    voxel_gt_file = os.path.join(data_root, frame_info["gt_path"])
    voxel_gt = np.load(voxel_gt_file)
    voxel_semantics = voxel_gt["semantics"]
    mask_lidar = voxel_gt["mask_lidar"]
    mask_camera = voxel_gt["mask_camera"]
    grid_shape = voxel_semantics.shape
    voxel_center, voxel_semantics = voxel2points(
        voxel_semantics, [voxel_size] * 3)
    voxel_center = voxel_center.numpy()
    voxel_semantics = voxel_semantics.numpy()
    print(voxel_center.shape)
    print(voxel_semantics.shape)

    # rander each image view
    for cam_name, cam_info in frame_info["camera_sensor"].items():
        img = cv2.imread("imgs/"+cam_info["img_path"])
        print(cam_info["img_path"])
        img_h, img_w = img.shape[:2]
        K = np.array(cam_info["intrinsics"])
        # get camera horizontal fov, if not given
        fov_half = get_horizontal_fov(img_w, K)
        print(f"Half of fov of {cam_name} is {fov_half}")

        extrinsic = cam_info["extrinsic"]
        T_ego2camera = transform_matrix(
            np.array(extrinsic['translation']),
            Quaternion(extrinsic['rotation']),
            inverse=True,
        )
        pts = np.concatenate([voxel_center,
                              np.ones((voxel_center.shape[0], 1))], axis=-1)
        pts = np.squeeze(np.matmul(T_ego2camera[None, :, :],
                                   pts[:, :, None]))
        pts = pts[:, :3]
        alpha = np.abs(np.arctan2(pts[:, 0], pts[:, 2]))
        view_mask = alpha < fov_half
        pts = pts[view_mask]
        view_sematics = voxel_semantics[view_mask]

        pts = np.squeeze(np.matmul(K[None, :, :],
                                   pts[:, :, None]))
        pts /= pts[:, [2]]
        pts = pts.astype(int)
        for i in range(pts.shape[0]):
            if 0 < pts[i, 0] < img_w and 0 < pts[i, 1] < img_h:
                color = [int(color_map[view_sematics[i], j]) for j in
                         range(2, -1, -1)]
                cv2.circle(img, (pts[i, 0], pts[i, 1]), 1, color, -1)

        cv2.imwrite(os.path.join(ts_save_path, f"{cam_name}.jpg"), img)

from cvpr2023-3d-occupancy-prediction.

shikongzxz commented on May 28, 2024

Thanks for replying. Looking forward to a better version of the data :)

from cvpr2023-3d-occupancy-prediction.

The voxel GTs do not aglin well with the images? about cvpr2023-3d-occupancy-prediction HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent