Hi, taolei, I am trying to use the code in "pt" and "qa" to train and finetune you

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

when run main.py in "qa", get the result p_norm:['nan', 'nan','nan','nan','nan'],about taolei87/rcnn

taolei87 commented on August 23, 2024

Hi @gailysun

Do you have the log of the pretraining run? I think it would be helpful to see the exact running options and the training information.

from rcnn.

gailysun commented on August 23, 2024

Hi,taolei,
The following is my pre_train log information:

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 4007)
Namespace(activation='tanh', batch_size=256, corpus='/data1/gailsun/qa/data/text_tokenized.txt', cut_off=1, depth=1, dev='', dropout=0.1, embeddings='/data1/gailsun/qa/data/vector/vectors_pruned.200.txt', heldout='/data1/gailsun/qa/data/train_random.txt', hidden_dim=200, l2_reg=1e-05, layer='rcnn', learning='adam', learning_rate=0.001, max_epoch=50, max_seq_len=100, mode=1, model='model.pkl.gz', normalize=1, order=2, outgate=0, reweight=1, test='', train='/data1/gailsun/qa/data/train_random.txt', use_anno=1, use_body=1, use_title=1)

0 empty titles ignored.
100406 pre-trained embeddings loaded.
vocab size=100410, corpus size=167765
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
VisibleDeprecationWarning)
heldout examples=139570
2.94957613945 to create batches
num of parameters: 20503210
p_norm: ['5.773', '5.777', '8.155', '0.402', '0.393', '5.777', '5.771', '8.166', '0.415', '0.428', '0.000', '9.131']
^M0/111^M10/111^M20/111^M30/111^M40/111^M50/111^M60/111^M70/111^M80/111^M90/111^M100/111^M110/111 model saved.

from rcnn.

gailysun commented on August 23, 2024

hi @taolei87 ,
Another difference is that I set THEANO_FLAGS = 'device=gpu,floatX=float64'.Will this affect the result?
The following is the current finetune log information.p_norm is always "nan", and the measurements seem not change.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 4007)
Namespace(activation='tanh', average=0, batch_size=40, corpus='/data1/gailsun/qa/data/text_tokenized.txt', cut_off=1, depth=1, dev='/data1/gailsun/qa/data/dev.txt', dropout=0.1, embeddings='/data1/gailsun/qa/data/vector/vectors_pruned.200.txt', hidden_dim=200, l2_reg=1e-05, layer='rcnn', learning='adam', learning_rate=0.001, load_pretrain='/data1/gailsun/qa/code/pt/model.pkl.gz.pkl.gz', max_epoch=50, max_seq_len=100, mode=1, normalize=1, order=2, outgate=0, reweight=1, save_model='model_d200_qa', test='/data1/gailsun/qa/data/test.txt', train='/data1/gailsun/qa/data/train_random.txt')

0 empty titles ignored.
100406 pre-trained embeddings loaded.
vocab size=100408, corpus size=167765
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
VisibleDeprecationWarning)
23.4045739174 to create batches
315 batches, 35312679 tokens in total, 360602 triples in total
h_title dtype: float64
h_avg_title dtype: float64
h_final dtype: float64
num of parameters: 160400
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']
^M0/315^M10/315^M20/315^M30/315^M40/315^M50/315^M60/315^M70/315^M80/315^M90/315^M100/315^M110/315^M120/315^M130/315^M140/315^M150/315^M160/315^M170/315^M180/315^M190/315^M200/315^M210/315^M220/315^M230/315^M240/315^M250/315^M260/315^M270/315^M280/315^M290/315^M300/315^M310/315^M

Epoch 0 cost=nan loss=nan MRR=63.39,63.39 |g|=nan [58.735m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
^M0/315^M10/315^M20/315^M30/315^M40/315^M50/315^M60/315^M70/315^M80/315^M90/315^M100/315^M110/315^M120/315^M130/315^M140/315^M150/315^M160/315^M170/315^M180/315^M190/315^M200/315^M210/315^M220/315^M230/315^M240/315^M250/315^M260/315^M270/315^M280/315^M290/315^M300/315^M310/315^M

Epoch 1 cost=nan loss=nan MRR=63.39,63.39 |g|=nan [58.200m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
^M0/315^M10/315^M20/315^M30/315^M40/315^M50/315^M60/315^M70/315^M80/315^M90/315^M100/315^M110/315^M120/315^M130/315^M140/315^M150/315^M160/315^M170/315^M180/315^M190/315^M200/315^M210/315^M220/315

from rcnn.

taolei87 commented on August 23, 2024

The pnorm is the L2 norm of parameters. In the fine-tuning log, the pnorm is NaN right after loading the model:
num of parameters: 160400 p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']

This means the pre-training is not correctly run or has some error. During pre-training, I also print out necessary information such as the pnorms (here), which seems missing from the log you showed me.

Could you attach or send me the full log of pre-training run? I see that the dev set is empty (--dev option)? The model saving code logic is inside the dev evaluation part (here).

from rcnn.

taolei87 commented on August 23, 2024

I'd better use "float32" by default. Most GPU only supports float32. Also it seems Theano doesn't support float64 in GPU mode. Here's what I found on this webpage:

You will also need to set floatX to be float32, along with your path to CUDA. Theano does not yet support float64 (it will soon), so float32 must, for now, be assigned to floatX.

from rcnn.

gailysun commented on August 23, 2024

hi, @taolei87 ,
I really appreciate that you answer my question in time. Thank you very much. The following is the current pre-train log information, where the arguments are set the same as you suggest. During pre-train, it appears that p_norm is "NAN". Hope you can help. Thank you very much.

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 4007)
Namespace(activation='tanh', batch_size=256, corpus='/data1/gailsun/qa/data/text_tokenized.txt', cut_off=1, depth=1, dev='/data1/gailsun/qa/data/dev.txt', dropout=0.1, embeddings='/data1/gailsun/qa/data/vector/vectors_pruned.200.txt', heldout='/data1/gailsun/qa/data/heldout.txt', hidden_dim=400, l2_reg=1e-05, layer='rcnn', learning='adam', learning_rate=0.001, max_epoch=50, max_seq_len=100, mode=1, model='model_pt_d400', normalize=1, order=2, outgate=0, reweight=1, test='/data1/gailsun/qa/data/test.txt', train='/data1/gailsun/qa/data/train_random.txt', use_anno=1, use_body=1, use_title=1)

0 empty titles ignored.
WARNING: n_d (400) != init word vector size (200). Use 200 instead.
100406 pre-trained embeddings loaded.
vocab size=100410, corpus size=167765
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
VisibleDeprecationWarning)
heldout examples=1989
3.02918314934 to create batches
num of parameters: 41066010
p_norm: ['8.165', '8.170', '14.155', '0.553', '0.562', '8.160', '8.164', '14.104', '0.602', '0.598', '0.000', '9.128']
^M0/732^M10/732^M20/732^M30/732^M40/732^M50/732^M60/732^M70/732^M80/732^M90/732^M100/732^M110/732^M120/732^M130/732^M140/732^M150/732^M160/732^M170/732^M180/732^M190/732^M200/732^M210/732^M220/732^M230/732^M240/732^M250/732^M260/732^M270/732^M280/732^M290/732^M300/732^M310/732^M320/732^M330/732^M340/732^M350/732^M360/732^M370/732^M380/732^M390/732^M400/732^M410/732^M420/732^M430/732^M440/732^M450/732^M460/732^M470/732^M480/732^M490/732^M500/732^M510/732^M520/732^M530/732^M540/732^M550/732^M560/732^M570/732^M580/732^M590/732^M600/732^M610/732^M620/732^M630/732^M640/732^M650/732^M660/732^M670/732^M680/732^M690/732^M700/732^M710/732^M720/732^M730/732 model saved.
^M

Epoch 0 cost=nan loss=nan nan MRR=63.39,63.39 PPL=nan |g|=nan [39.961m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
^M0/732^M10/732^M20/732^M30/732^M40/732^M50/732^M60/732^M70/732^M80/732^M90/732^M100/732^M110/732^M120/732^M130/732^M140/732^M150/732^M160/732^M170/732^M180/732^M190/732^M200/732^M210/732^M220/732^M230/732^M240/732^M250/732^M260/732^M270/732^M280/732^M290/732^M300/732^M310/732^M320/732^M330/732^M340/732^M350/732^M360/732^M370/732^M380/732^M390/732^M400/732^M410/732^M420/732^M430/732^M440/732^M450/732^M460/732^M470/732^M480/732^M490/732^M500/732^M510/732^M520/732^M530/732^M540/732^M550/732^M560/732^M570/732^M580/732^M590/732^M600/732^M610/732^M620/732^M630/732^M640/732^M650/732^M660/732^M670/732^M680/732^M690/732^M700/732^M710/732^M720/732^M730/732^M

Epoch 1 cost=nan loss=nan nan MRR=63.39,63.39 PPL=nan |g|=nan [43.745m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+

from rcnn.

taolei87 commented on August 23, 2024

Hi @gailysun

The training options look fine to me. I used to see NaN issue at some point, but it disappeared after switching the Theano version.

The version on my machine is: 0.7.0.dev-8d3a67b73fda49350d9944c9a24fc9660131861c; but I think 0.8.0 should also work.

What's your Theano version? It's a bit late in Boston time now. I can try your version on my machine later.

from rcnn.

gailysun commented on August 23, 2024

Hi, @taolei87 ,
My theano version is theano 0.8.2. Thank you very much.

from rcnn.

taolei87 commented on August 23, 2024

@gailysun The error seems to come from a later commit I did on parameter initialization. See here.

Could you try changing "0.00" to "0.001" ? The NaN issue disappeared on my machine by fixing this.

from rcnn.

gailysun commented on August 23, 2024

Hi @taolei87 ,
Yes, when revise the W_val as 0.001, the code can run successfully. Thank you very much.

from rcnn.

when run main.py in "qa", get the result p_norm:['nan', 'nan','nan','nan','nan'] about rcnn HOT 10 CLOSED

Comments (10)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent