Giter Site home page Giter Site logo

Comments (5)

dennybritz avatar dennybritz commented on June 3, 2024

Couldn't find trained model at /home/admin1/exp/projects/chatbot-retrieval/runs/1471797260 is a strange error. I'm not sure why it would look for a trained model.

Try removing the EvaluationMonitor, or try giving a first_n_steps=-1 argument to the monitor and see if that solves it.

from chatbot-retrieval.

amirj avatar amirj commented on June 3, 2024

removing EvaluationMonitor and use estimator as:
estimator.fit(input_fn=input_fn_train, steps=None, monitors=[])
solve the problem:

$ python udc_train.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
WARNING:tensorflow:Setting feature info to {'utterance': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(160)]), is_sparse=False), 'utterance_len': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(1)]), is_sparse=False), 'context': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(160)]), is_sparse=False), 'context_len': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(1)]), is_sparse=False)}
WARNING:tensorflow:Setting targets info to TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(1)]), is_sparse=False)
INFO:tensorflow:No glove/vocab path specificed, starting with random embeddings.
INFO:tensorflow:Create CheckpointSaver
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:02:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 5858 get requests, put_count=3526 evicted_count=1000 eviction_rate=0.283607 and unsatisfied allocation rate=0.585865
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Step 1: loss = 4.48393
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1019 get requests, put_count=2032 evicted_count=1000 eviction_rate=0.492126 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1020 get requests, put_count=2037 evicted_count=1000 eviction_rate=0.490918 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1020 get requests, put_count=2043 evicted_count=1000 eviction_rate=0.489476 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1910 get requests, put_count=3940 evicted_count=2000 eviction_rate=0.507614 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7516 get requests, put_count=7598 evicted_count=3000 eviction_rate=0.394841 and unsatisfied allocation rate=0.39356
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 449 to 493
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1059 get requests, put_count=2118 evicted_count=1000 eviction_rate=0.472144 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7976 get requests, put_count=8510 evicted_count=3000 eviction_rate=0.352526 and unsatisfied allocation rate=0.319082
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 871 to 958
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 621 get requests, put_count=1748 evicted_count=1000 eviction_rate=0.572082 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7592 get requests, put_count=8161 evicted_count=1000 eviction_rate=0.122534 and unsatisfied allocation rate=0.0893045
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 2725 to 2997
INFO:tensorflow:Step 101: loss = 0.95188
INFO:tensorflow:Step 201: loss = 0.731247
INFO:tensorflow:Saving checkpoints for 300 into /home/admin1/exp/projects/chatbot-retrieval/runs/1471802620/model.ckpt.
INFO:tensorflow:Step 301: loss = 0.674725
INFO:tensorflow:Step 401: loss = 0.71134
INFO:tensorflow:Step 501: loss = 0.690085
INFO:tensorflow:Saving checkpoints for 600 into /home/admin1/exp/projects/chatbot-retrieval/runs/1471802620/model.ckpt.
INFO:tensorflow:Step 601: loss = 0.685092
INFO:tensorflow:Step 701: loss = 0.686206
INFO:tensorflow:Step 801: loss = 0.685859
INFO:tensorflow:Saving checkpoints for 900 into /home/admin1/exp/projects/chatbot-retrieval/runs/1471802620/model.ckpt.
INFO:tensorflow:Step 901: loss = 0.708556
INFO:tensorflow:Step 1001: loss = 0.670678
INFO:tensorflow:Step 1101: loss = 0.658413
INFO:tensorflow:Saving checkpoints for 1200 into /home/admin1/exp/projects/chatbot-retrieval/runs/1471802620/model.ckpt.
INFO:tensorflow:Step 1201: loss = 0.652786
INFO:tensorflow:Step 1301: loss = 0.650005

EvaluationMonitor has a comment, since I think it maybe a problem of new version of tensorflow. Logging and Monitoring Basics maybe helpful.

from chatbot-retrieval.

dennybritz avatar dennybritz commented on June 3, 2024

Try passing first_n_steps=-1

from chatbot-retrieval.

amirj avatar amirj commented on June 3, 2024

passing first_n_steps=-1 as follows:
eval_monitor = EvaluationMonitor(every_n_steps=FLAGS.eval_every, first_n_steps=-1)
also solve the problem:

$ python udc_train.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
WARNING:tensorflow:Setting feature info to {'context_len': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(1)]), is_sparse=False), 'utterance_len': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(1)]), is_sparse=False), 'context': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(160)]), is_sparse=False), 'utterance': TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(160)]), is_sparse=False)}
WARNING:tensorflow:Setting targets info to TensorSignature(dtype=tf.int64, shape=TensorShape([Dimension(128), Dimension(1)]), is_sparse=False)
INFO:tensorflow:No glove/vocab path specificed, starting with random embeddings.
INFO:tensorflow:Create CheckpointSaver
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:02:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 5878 get requests, put_count=3567 evicted_count=1000 eviction_rate=0.280348 and unsatisfied allocation rate=0.580299
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Step 1: loss = 4.48393
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1016 get requests, put_count=2029 evicted_count=1000 eviction_rate=0.492854 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1016 get requests, put_count=2033 evicted_count=1000 eviction_rate=0.491884 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1020 get requests, put_count=2043 evicted_count=1000 eviction_rate=0.489476 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1765 get requests, put_count=3795 evicted_count=2000 eviction_rate=0.527009 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7500 get requests, put_count=7573 evicted_count=3000 eviction_rate=0.396144 and unsatisfied allocation rate=0.3956
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 449 to 493
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1060 get requests, put_count=2119 evicted_count=1000 eviction_rate=0.471921 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1085 get requests, put_count=2172 evicted_count=1000 eviction_rate=0.460405 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 458 get requests, put_count=1585 evicted_count=1000 eviction_rate=0.630915 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 14841 get requests, put_count=14728 evicted_count=1000 eviction_rate=0.0678979 and unsatisfied allocation rate=0.091638
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 2725 to 2997
INFO:tensorflow:Step 101: loss = 1.27205
INFO:tensorflow:Step 201: loss = 0.845954
INFO:tensorflow:Saving checkpoints for 300 into /home/admin1/exp/projects/chatbot-retrieval/runs/1471804077/model.ckpt.
INFO:tensorflow:Step 301: loss = 0.728196
INFO:tensorflow:Step 401: loss = 0.725865

from chatbot-retrieval.

dennybritz avatar dennybritz commented on June 3, 2024

Fixed this in the code, thanks for reporting it!

from chatbot-retrieval.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.