summary error happens when training tested on Runpo

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Thank you <a class="user-mention notranslate" data-hovercard-type="user"

Thank you for kind reply <a class="user-mention notranslate" data-hovercard-type="user

ValueError: math domain error about openlrm HOT 5 OPEN

3dtopia commented on September 22, 2024 2

ValueError: math domain error

from openlrm.

Comments (5)

kunalkathare commented on September 22, 2024 1

Hey @hayoung-jeremy , try reducing the value of global_step_period under val: in the train sample yaml file , until it stops giving the error, which worked for me when I was trying to train with 350 objects.

from openlrm.

hayoung-jeremy commented on September 22, 2024 1

Wow, you're my savior, thank you so much! I'll try it!

from openlrm.

hayoung-jeremy commented on September 22, 2024

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :

...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...

and successfully generated a checkpoint as follows :

[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]

But it seems the loss value is too high. What should I modify to decrease the loss value?
Should I increase the epoch to 1000?
And what is the ideal loss values for successfully generated checkpoint?
Could you share me your case?
Thank you so much for your help

from openlrm.

kunalkathare commented on September 22, 2024

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :

...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...

and successfully generated a checkpoint as follows :

[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]

The loss value is reduced when the size of the dataset is more, and I guess you can increase the epochs and see if it affects.

from openlrm.

hayoung-jeremy commented on September 22, 2024

Thank you for kind reply @kunalkathare !

I don't have enough dataset for now, can I just copy the same data to increase the amount of it?
And I've tried to increase the epoch to 1000, it also generated the checkpoint with the loss value about 0.3.
But the inference result quality from that checkpoint is not that good, as you can see in this issue.
So I'm going to try to increase the epoch to 10000, is it okay?
If it is, what kind of values should I adjust from the train_sample.yaml?

Really great help from you, many thanks for your assistance.

from openlrm.

ValueError: math domain error about openlrm HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent