Giter Site home page Giter Site logo

Comments (14)

 avatar commented on May 25, 2024 1

Okay, I clear all my workspace and rebuild caffe step by step.
And everything seems alright! Maybe it was my environment's problem.

raw memory 3217966104 opt memory 1254105756
I0504 15:39:59.687398 2912 solver.cpp:47] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet50.caffemodel
Solving...
I0504 15:40:03.385043 2912 solver.cpp:240] Iteration 0, loss = 11.4597
I0504 15:40:03.385097 2912 solver.cpp:255] Train net output #0: det_accuracy = 0.679688
I0504 15:40:03.385118 2912 solver.cpp:255] Train net output #1: det_loss = 0.687115 (* 1 = 0.687115 loss)
I0504 15:40:03.385125 2912 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:40:03.385131 2912 solver.cpp:255] Train net output #3: id_loss = 9.26487 (* 1 = 9.26487 loss)
I0504 15:40:03.385136 2912 solver.cpp:255] Train net output #4: loss_bbox = 0.144 (* 1 = 0.144 loss)
I0504 15:40:03.385143 2912 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0759669 (* 1 = 0.0759669 loss)
I0504 15:40:03.385149 2912 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.693476 (* 1 = 0.693476 loss)
I0504 15:40:03.385159 2912 solver.cpp:640] Iteration 0, lr = 0.001
I0504 15:41:01.273102 2912 solver.cpp:240] Iteration 20, loss = 10.1008
I0504 15:41:01.273152 2912 solver.cpp:255] Train net output #0: det_accuracy = 0.75
I0504 15:41:01.273165 2912 solver.cpp:255] Train net output #1: det_loss = 0.510666 (* 1 = 0.510666 loss)
I0504 15:41:01.273172 2912 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:41:01.273181 2912 solver.cpp:255] Train net output #3: id_loss = 9.63118 (* 1 = 9.63118 loss)
I0504 15:41:01.273190 2912 solver.cpp:255] Train net output #4: loss_bbox = 0.462268 (* 1 = 0.462268 loss)
I0504 15:41:01.273205 2912 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0575725 (* 1 = 0.0575725 loss)
I0504 15:41:01.273213 2912 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.678677 (* 1 = 0.678677 loss)
I0504 15:41:01.273222 2912 solver.cpp:640] Iteration 20, lr = 0.001
I0504 15:41:59.257951 2912 solver.cpp:240] Iteration 40, loss = 9.96599
I0504 15:41:59.258055 2912 solver.cpp:255] Train net output #0: det_accuracy = 0.882812
I0504 15:41:59.258088 2912 solver.cpp:255] Train net output #1: det_loss = 0.266811 (* 1 = 0.266811 loss)
I0504 15:41:59.258111 2912 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:41:59.258132 2912 solver.cpp:255] Train net output #3: id_loss = 9.3155 (* 1 = 9.3155 loss)
I0504 15:41:59.258153 2912 solver.cpp:255] Train net output #4: loss_bbox = 0.0908071 (* 1 = 0.0908071 loss)
I0504 15:41:59.258173 2912 solver.cpp:255] Train net output #5: rpn_bbox_loss = 1.06972 (* 1 = 1.06972 loss)
I0504 15:41:59.258193 2912 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.679143 (* 1 = 0.679143 loss)
I0504 15:41:59.258219 2912 solver.cpp:640] Iteration 40, lr = 0.001

//-------------------------------------------------------------------------------//

Thank you very much!

from person_search.

Cysu avatar Cysu commented on May 25, 2024

The id_loss = 87.3365 usually means the softmax loss is computing -log(FLT_MIN) here, which implies the probability is smaller than FLT_MIN.

However, this probability is computed as softmax(Wx), where W is initialized as zero matrix in the first iteration.

As I could not reproduce this situation on my machine, could you please kindly do me a favor by adding the following print function in caffe/src/caffe/layers/softmax_loss_layer.cu

template <typename Dtype>
void debug_print(const Blob<Dtype>& prob) {
  printf("prob shape %d x %d\n", prob.shape(0), prob.shape(1));

  Dtype minval = (Dtype)FLT_MAX;
  Dtype maxval = -(Dtype)FLT_MAX;
  int minidx = 0;
  int maxidx = 0;

  const Dtype* data = prob.cpu_data();
  for (int i = 0; i < prob.count(); ++i) {
    if (data[i] <= minval) {
      minval = data[i];
      minidx = i;
    }
    if (data[i] >= maxval) {
      maxval = data[i];
      maxidx = i;
    }
  }

  printf("prob min = %.6f, at index %d\n", minval, minidx);
  printf("prob max = %.6f, at index %d\n", maxval, maxidx);
}

as well as adding debug_print(prob_); after this line.

Then recompile the caffe, rerun the experiments, and check the output before the Iteration 0. Thank you very much in advance.

from person_search.

 avatar commented on May 25, 2024

Thank U for Ur reply.
And I get this:
I0430 17:20:21.433068 1580 solver.cpp:47] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet50.caffemodel
Solving...
prob shape 1 x 2
prob min = 0.500000, at index 39689
prob max = 0.500000, at index 39689
prob shape 128 x 2
prob min = 0.500000, at index 255
prob max = 0.500000, at index 255
prob shape 128 x 10532
prob min = 340282346638528859811704183484516925440.000000, at index 0
prob max = -340282346638528859811704183484516925440.000000, at index 0
prob shape 1 x 2
prob min = 0.500000, at index 39689
prob max = 0.500000, at index 39689
prob shape 128 x 2
prob min = 0.500000, at index 255
prob max = 0.500000, at index 255
prob shape 128 x 10532
prob min = 340282346638528859811704183484516925440.000000, at index 0
prob max = -340282346638528859811704183484516925440.000000, at index 0
I0430 17:20:25.128804 1580 solver.cpp:240] Iteration 0, loss = 88.9257
I0430 17:20:25.128856 1580 solver.cpp:255] Train net output #0: det_accuracy = 0.117188
I0430 17:20:25.128872 1580 solver.cpp:255] Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0430 17:20:25.128881 1580 solver.cpp:255] Train net output #2: id_accuracy = 0
I0430 17:20:25.128893 1580 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0430 17:20:25.128903 1580 solver.cpp:255] Train net output #4: loss_bbox = 0.307287 (* 1 = 0.307287 loss)
I0430 17:20:25.128913 1580 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0201967 (* 1 = 0.0201967 loss)
I0430 17:20:25.128937 1580 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0430 17:20:25.128958 1580 solver.cpp:640] Iteration 0, lr = 0.001

from person_search.

Cysu avatar Cysu commented on May 25, 2024

That's quite weird. It seems that the memory is overwritten. I'm not sure if it is caused by the memory optimization. Could you please try to change these two lines both to false, and rerun the experiment?

from person_search.

 avatar commented on May 25, 2024

I get this:

I0430 18:52:22.725214 7339 net.cpp:300] Network initialization done.
I0430 18:52:22.725220 7339 net.cpp:301] Memory required for data: 2011680780
I0430 18:52:22.725783 7339 solver.cpp:47] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet50.caffemodel
Solving...
prob shape 1 x 2
prob min = 0.500000, at index 39689
prob max = 0.500000, at index 39689
prob shape 128 x 2
prob min = 0.500000, at index 255
prob max = 0.500000, at index 255
prob shape 128 x 10532
prob min = 340282346638528859811704183484516925440.000000, at index 0
prob max = -340282346638528859811704183484516925440.000000, at index 0
prob shape 1 x 2
prob min = 0.500000, at index 39689
prob max = 0.500000, at index 39689
prob shape 128 x 2
prob min = 0.500000, at index 255
prob max = 0.500000, at index 255
prob shape 128 x 10532
prob min = 340282346638528859811704183484516925440.000000, at index 0
prob max = -340282346638528859811704183484516925440.000000, at index 0
I0430 18:52:26.542032 7339 solver.cpp:240] Iteration 0, loss = 89.1366
I0430 18:52:26.542096 7339 solver.cpp:255] Train net output #0: det_accuracy = 0.0390625
I0430 18:52:26.542109 7339 solver.cpp:255] Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0430 18:52:26.542114 7339 solver.cpp:255] Train net output #2: id_accuracy = 0
I0430 18:52:26.542119 7339 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0430 18:52:26.542125 7339 solver.cpp:255] Train net output #4: loss_bbox = 0.081182 (* 1 = 0.081182 loss)
I0430 18:52:26.542142 7339 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0845679 (* 1 = 0.0845679 loss)
I0430 18:52:26.542148 7339 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0430 18:52:26.542155 7339 solver.cpp:640] Iteration 0, lr = 0.001

from person_search.

 avatar commented on May 25, 2024

Hello, I've changed these two lines both to false.
But it seems doesn't work. The log is above.
THX

from person_search.

Cysu avatar Cysu commented on May 25, 2024

Thank you very much for the cooperation. I'm aware of this, however currently I have no idea why it happens. I suspect it was caused by some mismatch between library used and the hardware. Could you please check the output of the following command:

ldd caffe/build/install/bin/caffe

from person_search.

 avatar commented on May 25, 2024

Okay, I get:
ldd caffe/build/install/bin/caffe
linux-vdso.so.1 => (0x00007ffc7dfeb000)
libcaffe.so => /root/Workspace/person_search/caffe/build/install/lib/lib caffe.so (0x00007f510c691000)
libglog.so.0 => /usr/lib/x86_64-linux-gnu/libglog.so.0 (0x00007f510c4440 00)
libgflags.so.2 => /usr/lib/x86_64-linux-gnu/libgflags.so.2 (0x00007f510c 224000)
libopencv_core.so.3.1 => /root/Util/miniconda/lib/libopencv_core.so.3.1 (0x00007f510b5ba000)
libpython2.7.so.1.0 => /root/Util/miniconda/lib/libpython2.7.so.1.0 (0x0 0007f510b1bf000)
libmpi_cxx.so.1 => /usr/lib/libmpi_cxx.so.1 (0x00007f510afa5000)
libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007f510ac24000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f510a 91f000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f510a709000 )
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f510a344000)
libboost_system.so.1.55.0 => /usr/lib/x86_64-linux-gnu/libboost_system.s o.1.55.0 (0x00007f510a13f000)
libboost_thread.so.1.55.0 => /usr/lib/x86_64-linux-gnu/libboost_thread.s o.1.55.0 (0x00007f5109f28000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5109d0 a000)
libprotobuf.so.8 => /usr/lib/x86_64-linux-gnu/libprotobuf.so.8 (0x00007f 5109a07000)
libhdf5_hl.so.7 => /usr/lib/x86_64-linux-gnu/libhdf5_hl.so.7 (0x00007f51 097d7000)
libhdf5.so.7 => /usr/lib/x86_64-linux-gnu/libhdf5.so.7 (0x00007f510933b0 00)
liblmdb.so.0 => /usr/lib/x86_64-linux-gnu/liblmdb.so.0 (0x00007f51091280 00)
libleveldb.so.1 => /usr/lib/x86_64-linux-gnu/libleveldb.so.1 (0x00007f51 08edb000)
libcudart.so.7.5 => /usr/local/cuda/lib64/libcudart.so.7.5 (0x00007f5108 c7d000)
libcurand.so.7.5 => /usr/local/cuda/lib64/libcurand.so.7.5 (0x00007f5105 414000)
libcublas.so.7.5 => /usr/local/cuda/lib64/libcublas.so.7.5 (0x00007f5103 b35000)
libcudnn.so.5 => /usr/local/lib/libcudnn.so.5 (0x00007f50fff2a000)
libopencv_imgproc.so.3.1 => /root/Util/miniconda/lib/libopencv_imgproc.s o.3.1 (0x00007f50fe5ba000)
libopencv_imgcodecs.so.3.1 => /root/Util/miniconda/lib/libopencv_imgcode cs.so.3.1 (0x00007f50fe0d2000)
libcblas.so.3 => /usr/lib/libcblas.so.3 (0x00007f50fdeb1000)
libboost_python-py27.so.1.55.0 => /usr/lib/x86_64-linux-gnu/libboost_pyt hon-py27.so.1.55.0 (0x00007f50fdc63000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f50fd95d000)
libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 (0x00007f50fd 741000)
libz.so.1 => /root/Util/miniconda/lib/./libz.so.1 (0x00007f50fd52b000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f50fd326000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f50fd11e000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f50fcf0f0 00)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f50fcd0b000)
libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007f50fcac b000)
libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f50fc8c10 00)
/lib64/ld-linux-x86-64.so.2 (0x000055cdb363f000)
libsnappy.so.1 => /usr/lib/libsnappy.so.1 (0x00007f50fc6ba000)
libjpeg.so.8 => /root/Util/miniconda/lib/./libjpeg.so.8 (0x00007f50fc480 000)
libpng16.so.16 => /root/Util/miniconda/lib/./libpng16.so.16 (0x00007f50f c23e000)
libtiff.so.5 => /root/Util/miniconda/lib/./libtiff.so.5 (0x00007f50fbfc0 000)
libatlas.so.3 => /usr/lib/libatlas.so.3 (0x00007f50fba2d000)
libgfortran.so.3 => /root/Util/miniconda/lib/libgfortran.so.3 (0x00007f5 0fb723000)
liblzma.so.5 => /root/Util/miniconda/lib/liblzma.so.5 (0x00007f50fb4fe00 0)
libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f50fb2f20 00)

from person_search.

Cysu avatar Cysu commented on May 25, 2024

I'm not sure if it is the openmpi library that causes the problem. It links to the system library /usr/lib/libmpi.so.1. I suspect the system's openmpi version might be too old. Could you please install a bleeding-edge openmpi by

wget https://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.6.tar.gz
tar xf openmpi-1.10.6.tar.gz
cd openmpi-1.10.6
./configure --with-cuda=/usr/local/cuda --enable-mpi-thread-multiple
make -j8
sudo make install

It should be compiled and installed to /usr/local. If success, (maybe need to restart the terminal) which mpirun should point to /usr/local/bin/mpirun and mpirun --version should print mpirun (Open MPI) 1.10.6.

After that, you may remove the caffe/build directory and try to rebuild it. This time

ldd caffe/build/install/bin/caffe | grep mpi

should point to something like /usr/local/lib/libmpi.so.

from person_search.

 avatar commented on May 25, 2024

Sorry for replying late.
I get:
~/Workspace/person_search# ldd caffe/build/install/bin/caffe | grep mpi
libmpi_cxx.so.1 => /usr/local/lib/libmpi_cxx.so.1 (0x00007fcd71d96000)
libmpi.so.12 => /usr/local/lib/libmpi.so.12 (0x00007fcd71abc000)
and:
// no using memory optimization:
I0504 15:51:59.577517 28784 solver.cpp:240] Iteration 0, loss = 45.2576
I0504 15:51:59.577574 28784 solver.cpp:255] Train net output #0: det_accuracy = 0.0859375
I0504 15:51:59.577589 28784 solver.cpp:255] Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0504 15:51:59.577594 28784 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:51:59.577607 28784 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 15:51:59.577615 28784 solver.cpp:255] Train net output #4: loss_bbox = 0.115893 (* 1 = 0.115893 loss)
I0504 15:51:59.577620 28784 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.275724 (* 1 = 0.275724 loss)
I0504 15:51:59.577626 28784 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0504 15:51:59.577635 28784 solver.cpp:640] Iteration 0, lr = 0.001
I0504 15:52:50.221169 28784 solver.cpp:240] Iteration 20, loss = 76.5974
I0504 15:52:50.221225 28784 solver.cpp:255] Train net output #0: det_accuracy = 0.929688
I0504 15:52:50.221236 28784 solver.cpp:255] Train net output #1: det_loss = 0.615041 (* 1 = 0.615041 loss)
I0504 15:52:50.221242 28784 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:52:50.221248 28784 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 15:52:50.221254 28784 solver.cpp:255] Train net output #4: loss_bbox = 0.118207 (* 1 = 0.118207 loss)
I0504 15:52:50.221261 28784 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.00628357 (* 1 = 0.00628357 loss)
I0504 15:52:50.221266 28784 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.680508 (* 1 = 0.680508 loss)
I0504 15:52:50.221278 28784 solver.cpp:640] Iteration 20, lr = 0.001
I0504 15:53:40.769912 28784 solver.cpp:240] Iteration 40, loss = 74.1462
I0504 15:53:40.769966 28784 solver.cpp:255] Train net output #0: det_accuracy = 0.96875
I0504 15:53:40.769979 28784 solver.cpp:255] Train net output #1: det_loss = 0.508402 (* 1 = 0.508402 loss)
I0504 15:53:40.769996 28784 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:53:40.770014 28784 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 15:53:40.770020 28784 solver.cpp:255] Train net output #4: loss_bbox = 0.0521594 (* 1 = 0.0521594 loss)
I0504 15:53:40.770025 28784 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.00265183 (* 1 = 0.00265183 loss)
I0504 15:53:40.770031 28784 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.661216 (* 1 = 0.661216 loss)
I0504 15:53:40.770056 28784 solver.cpp:640] Iteration 40, lr = 0.001
I0504 15:54:31.513696 28784 solver.cpp:240] Iteration 60, loss = 76.8197
I0504 15:54:31.513744 28784 solver.cpp:255] Train net output #0: det_accuracy = 0.9375
I0504 15:54:31.513756 28784 solver.cpp:255] Train net output #1: det_loss = 0.455823 (* 1 = 0.455823 loss)
I0504 15:54:31.513762 28784 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:54:31.513768 28784 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 15:54:31.513775 28784 solver.cpp:255] Train net output #4: loss_bbox = 0.123411 (* 1 = 0.123411 loss)
I0504 15:54:31.513782 28784 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0825794 (* 1 = 0.0825794 loss)
I0504 15:54:31.513787 28784 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.645016 (* 1 = 0.645016 loss)
I0504 15:54:31.513795 28784 solver.cpp:640] Iteration 60, lr = 0.001
I0504 15:55:21.943455 28784 solver.cpp:240] Iteration 80, loss = 76.5753
I0504 15:55:21.943536 28784 solver.cpp:255] Train net output #0: det_accuracy = 0.804688
I0504 15:55:21.943558 28784 solver.cpp:255] Train net output #1: det_loss = 0.522056 (* 1 = 0.522056 loss)
I0504 15:55:21.943570 28784 solver.cpp:255] Train net output #2: id_accuracy = -nan
I0504 15:55:21.943589 28784 solver.cpp:255] Train net output #3: id_loss = 0 (* 1 = 0 loss)
I0504 15:55:21.943603 28784 solver.cpp:255] Train net output #4: loss_bbox = 0.355632 (* 1 = 0.355632 loss)
I0504 15:55:21.943614 28784 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.282069 (* 1 = 0.282069 loss)
I0504 15:55:21.943630 28784 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.650235 (* 1 = 0.650235 loss)
I0504 15:55:21.943645 28784 solver.cpp:640] Iteration 80, lr = 0.001
I0504 15:56:12.627272 28784 solver.cpp:240] Iteration 100, loss = 76.715
I0504 15:56:12.627333 28784 solver.cpp:255] Train net output #0: det_accuracy = 0.945312
I0504 15:56:12.627346 28784 solver.cpp:255] Train net output #1: det_loss = 0.369229 (* 1 = 0.369229 loss)
I0504 15:56:12.627351 28784 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:56:12.627357 28784 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 15:56:12.627363 28784 solver.cpp:255] Train net output #4: loss_bbox = 0.156478 (* 1 = 0.156478 loss)
I0504 15:56:12.627369 28784 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.00762856 (* 1 = 0.00762856 loss)
I0504 15:56:12.627374 28784 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.605421 (* 1 = 0.605421 loss)
I0504 15:56:12.627393 28784 solver.cpp:640] Iteration 100, lr = 0.001
I0504 15:57:03.166474 28784 solver.cpp:240] Iteration 120, loss = 75.7795
I0504 15:57:03.166532 28784 solver.cpp:255] Train net output #0: det_accuracy = 0.75
I0504 15:57:03.166543 28784 solver.cpp:255] Train net output #1: det_loss = 0.562358 (* 1 = 0.562358 loss)
I0504 15:57:03.166559 28784 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:57:03.166566 28784 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 15:57:03.166584 28784 solver.cpp:255] Train net output #4: loss_bbox = 0.55591 (* 1 = 0.55591 loss)
I0504 15:57:03.166591 28784 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.105171 (* 1 = 0.105171 loss)
I0504 15:57:03.166597 28784 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.619909 (* 1 = 0.619909 loss)
I0504 15:57:03.166605 28784 solver.cpp:640] Iteration 120, lr = 0.001

//using memory optimization:
raw memory 3217966104 opt memory 1254105756
I0504 15:59:54.706725 29344 solver.cpp:47] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/resnet50.caffemodel
Solving...
I0504 15:59:58.044013 29344 solver.cpp:240] Iteration 0, loss = 88.7875
I0504 15:59:58.044054 29344 solver.cpp:255] Train net output #0: det_accuracy = 0.0078125
I0504 15:59:58.044065 29344 solver.cpp:255] Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0504 15:59:58.044070 29344 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 15:59:58.044076 29344 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 15:59:58.044082 29344 solver.cpp:255] Train net output #4: loss_bbox = 0 (* 1 = 0 loss)
I0504 15:59:58.044087 29344 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0125439 (* 1 = 0.0125439 loss)
I0504 15:59:58.044092 29344 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0504 15:59:58.044098 29344 solver.cpp:640] Iteration 0, lr = 0.001
I0504 16:00:51.140867 29344 solver.cpp:240] Iteration 20, loss = nan
I0504 16:00:51.140920 29344 solver.cpp:255] Train net output #0: det_accuracy = 0.914062
I0504 16:00:51.140938 29344 solver.cpp:255] Train net output #1: det_loss = 0.622575 (* 1 = 0.622575 loss)
I0504 16:00:51.140946 29344 solver.cpp:255] Train net output #2: id_accuracy = 0
I0504 16:00:51.140956 29344 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0504 16:00:51.140965 29344 solver.cpp:255] Train net output #4: loss_bbox = nan (* 1 = nan loss)
I0504 16:00:51.140974 29344 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0211178 (* 1 = 0.0211178 loss)
I0504 16:00:51.140983 29344 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.679992 (* 1 = 0.679992 loss)
I0504 16:00:51.140993 29344 solver.cpp:640] Iteration 20, lr = 0.001

I have no idea why this happen.

from person_search.

Cysu avatar Cysu commented on May 25, 2024

That's really weird. Does /usr/local/lib/libcudnn.so.5 link to libcudnn.so.5.1.3?

from person_search.

 avatar commented on May 25, 2024

I check:
ldd caffe/build/install/bin/caffe
linux-vdso.so.1 => (0x00007ffe01bfe000)
libcaffe.so => /root/Workspace/person_search/caffe/build/install/lib/libcaffe.so (0x00007f1918403000)
libglog.so.0 => /usr/lib/x86_64-linux-gnu/libglog.so.0 (0x00007f19181b6000)
libgflags.so.2 => /usr/lib/x86_64-linux-gnu/libgflags.so.2 (0x00007f1917f96000)
libopencv_core.so.3.1 => /root/Util/miniconda/lib/libopencv_core.so.3.1 (0x00007f191732c000)
libpython2.7.so.1.0 => /root/Util/miniconda/lib/libpython2.7.so.1.0 (0x00007f1916f31000)
libmpi_cxx.so.1 => /usr/local/lib/libmpi_cxx.so.1 (0x00007f1916d17000)
libmpi.so.12 => /usr/local/lib/libmpi.so.12 (0x00007f1916a3d000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1916738000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1916522000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f191615d000)
libboost_system.so.1.55.0 => /usr/lib/x86_64-linux-gnu/libboost_system.so.1.55.0 (0x00007f1915f58000)
libboost_thread.so.1.55.0 => /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (0x00007f1915d41000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1915b23000)
libprotobuf.so.8 => /usr/lib/x86_64-linux-gnu/libprotobuf.so.8 (0x00007f1915820000)
libhdf5_hl.so.7 => /usr/lib/x86_64-linux-gnu/libhdf5_hl.so.7 (0x00007f19155f0000)
libhdf5.so.7 => /usr/lib/x86_64-linux-gnu/libhdf5.so.7 (0x00007f1915154000)
liblmdb.so.0 => /usr/lib/x86_64-linux-gnu/liblmdb.so.0 (0x00007f1914f41000)
libleveldb.so.1 => /usr/lib/x86_64-linux-gnu/libleveldb.so.1 (0x00007f1914cf4000)
libcudart.so.7.5 => /usr/local/cuda/lib64/libcudart.so.7.5 (0x00007f1914a96000)
libcurand.so.7.5 => /usr/local/cuda/lib64/libcurand.so.7.5 (0x00007f191122d000)
libcublas.so.7.5 => /usr/local/cuda/lib64/libcublas.so.7.5 (0x00007f190f94e000)
libcudnn.so.5 => /usr/local/lib/libcudnn.so.5 (0x00007f190bd43000)
libopencv_imgproc.so.3.1 => /root/Util/miniconda/lib/libopencv_imgproc.so.3.1 (0x00007f190a3d3000)
libopencv_imgcodecs.so.3.1 => /root/Util/miniconda/lib/libopencv_imgcodecs.so.3.1 (0x00007f1909eeb000)
libcblas.so.3 => /usr/lib/libcblas.so.3 (0x00007f1909cca000)
libboost_python-py27.so.1.55.0 => /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.55.0 (0x00007f1909a7c000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1909776000)
libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 (0x00007f190955a000)
libz.so.1 => /root/Util/miniconda/lib/./libz.so.1 (0x00007f1909344000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f190913f000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f1908f37000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f1908d28000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f1908b24000)
libopen-pal.so.13 => /usr/local/lib/libopen-pal.so.13 (0x00007f190884b000)
libopen-rte.so.12 => /usr/local/lib/libopen-rte.so.12 (0x00007f19085cf000)
/lib64/ld-linux-x86-64.so.2 (0x0000561004504000)
libsnappy.so.1 => /usr/lib/libsnappy.so.1 (0x00007f19083c8000)
libjpeg.so.8 => /root/Util/miniconda/lib/./libjpeg.so.8 (0x00007f190818e000)
libpng16.so.16 => /root/Util/miniconda/lib/./libpng16.so.16 (0x00007f1907f4c000)
libtiff.so.5 => /root/Util/miniconda/lib/./libtiff.so.5 (0x00007f1907cce000)
libatlas.so.3 => /usr/lib/libatlas.so.3 (0x00007f190773b000)
libgfortran.so.3 => /root/Util/miniconda/lib/libgfortran.so.3 (0x00007f1907431000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f190720f000)

from person_search.

Cysu avatar Cysu commented on May 25, 2024

Well. I mean libcudnn.so.5.1.3 is used (it is for cuda-7.5) but not libcudnn.so.5.1.5 (it is for cuda-8.0).

Some issues also reported similar problems, and the reason is that the disk was full.

from person_search.

Cysu avatar Cysu commented on May 25, 2024

Good to know that. Maybe it needs to rebuild from scratch after changing environments.

from person_search.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.