Hello, first off, thank you for sharing this amazing work. Much appreciated. <p di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Pushed the patch. Pull the latest commit and run <div class="snippet-clipboard-con

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU,about dbolya/yolact

Comments (17)

dbolya commented on May 13, 2024 7

I'm actually really glad you asked that! When I timed it, that step took a whopping 19 ms, which didn't seem right at all.

I then narrowed it down to this line
torch.Tensor(frame).float().cuda()
which a full 16 ms on its own!

Turns out most of that was coming from the torch.Tensor constructor, so I changed that to
torch.from_numpy(frame).float().cuda()
but that still took 15 ms, most of which coming from the .float() on the CPU.

So, I once again rearranged that to get
torch.from_numpy(frame).cuda().float()
which took only 1 ms...

So on the current master branch, step 1 takes 19 ms, but now it's down to 4. I'll push this along with my new rendering code and other speed improvements probably later today. Note though that evalvideo is very multithreaded and the torch.Tensor constructor likely releases the GIL (as it's in C++), so this doesn't look like it had as huge an impact on evaluation (though it did take me from 28 fps on one video to 31).

from yolact.

dbolya commented on May 13, 2024 1

@Rm1n90 Idk, I haven't tested it myself. It'll probably be slightly faster, but not that much (maybe 10%?)

from yolact.

rkishore commented on May 13, 2024

Just wanted to report in that running the benchmark on the COCO dataset as per your instructions gets me much closer to the reported numbers. Now I wonder what the difference is between the --benchmark code and the actual per-image instance segmentation code.

With Resnet-101

python3 eval.py --trained_model=weights/yolact_base_54_800000.pth --benchmark --max_images=1000

Config not specified. Parsed yolact_base_config from the file name.

loading annotations into memory...
Done (t=0.46s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 29.87 fps

Stats for the last frame:

  Name      | Time (ms)

Average: 29.87 fps, 33.48 ms

With Resnet-50

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --benchmark --max_images=1000
= args.display = 0
Config not specified. Parsed yolact_resnet50_config from the file name.

loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 40.06 fps

Stats for the last frame:

  Name      | Time (ms)

Average: 40.06 fps, 24.96 ms

With Darknet-53

python3 eval.py --trained_model=weights/yolact_darknet53_54_800000.pth --benchmark --max_images=1000
Config not specified. Parsed yolact_darknet53_config from the file name.

loading annotations into memory...
Done (t=0.45s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 1000 / 1000 (100.00%) 34.68 fps

Stats for the last frame:

  Name      | Time (ms)

Average: 34.68 fps, 28.84 ms

from yolact.

dbolya commented on May 13, 2024

The FPS we report comes from the command,
python eval.py --trained_model=<modelname> --benchmark --max_images=400
run on one Titan Xp. --benchmark mode times just the raw model.

Like other papers, our timing only reports the speed of the model itself. That is, timing starts when the image is finished loading and stops when the network outputs masks. Note that this timing does not include 1.) loading the image, 2.) rendering the mask onto the image, or 3.) displaying the image, all of which are included in evalvideo, and the first two of which in evalimage.

Right now, that step 2 is particularly limiting for us, and it's the bottleneck that you see giving you that lower than reported fps. I'm working on fixing this so that we can run the full model from loading to displaying all at 30 fps (see #17), but that's difficult to do with Python (thanks to the GIL) and without direct access to the graphics card (w/o CUDA or a graphics library like OpenGL or Vulkan).

A large amount of time right now is spent rendering the image on the GPU, copying the image to the CPU to draw boxes and text, and then passing the CPU image to OpenCV which just copies it back to the GPU internally. A real production-ready version of this would likely have to be in native C++ using a CUDA matrix as a texture in Vulkan or OpenGL to render directly to the screen, but I'd like to keep the project in native Pytorch for as long as possible (so that everyone can easily start using it / add to it).

Good news is though that I have updated rendering code in the works, and I think I'll be able to get close to that sweet sweet 30 fps with that. It should be out soon, so I'll keep you posted.

from yolact.

rkishore commented on May 13, 2024

@dbolya , thanks for the explanation and taking the time to respond. Do you know how much time does step 1 (i.e. loading the image add to the whole equation) take?

Also, excited to hear about the updated rendering code.

from yolact.

dbolya commented on May 13, 2024

Pushed the patch. Pull the latest commit and run

python eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.15 --top_k=15 --video_multiframe=2 --video=<your_video>

to test the new video speeds. --images should also have received a similar speed boost.

Also let the video play for a little bit before reporting the FPS because it goes up over time in my experience (seems like the first couple frames take longer than the rest even after initialization).

from yolact.

rkishore commented on May 13, 2024

@dbolya, thanks a lot for the patch. With the command you sent, my early tests show 22-23fps with videos (when displaying the output), and 15-16fps when writing to an output video. Definitely an improvement. My GPU is maxed out so likely need more GPU cores with this implementation.

For --images, I don't see a big change from before (I am getting ~23fps with Resnet 50 when writing out the output and ~29fps when I comment out the cv2.imwrite for the output image). It is likely that the GPU horsepower I have is insufficient + the mask writeout has a sizable penalty. I have an RTX 2080 with 2300 CUDA cores and AFAIK, the Titan Xp has a 1500 more CUDA cores, which may be where the processing speed difference is coming from. What is the processing you get on average with --images on the Titan Xp? I am using the following command:

python3 eval.py --trained_model=weights/yolact_resnet50_54_800000.pth --score_threshold=0.4 --top_k=15 --images=./test_images:./test_output_images

from yolact.

dbolya commented on May 13, 2024

For --images I'm getting 24.77 fps on a Titan Xp with the command you listed there when timing the whole evalimage function. I get 35.72 fps when I omit the cv2.imwrite call from timing (wow why is my server's SSD so slow?) Finally, I get 45.87 fps (the FPS Resnet50 runs at in benchmark mode) if I also omit the cv2.imread call (keeping the FastBaseTransform in).

That 45.87 comes from timing the following 3 lines:

batch = FastBaseTransform()(frame.unsqueeze(0))
preds = net(batch)
img_numpy = pred_display(preds, frame, None, None, undo_transform=False)

Note that I also included a torch.cuda.synchronize() in the timing for good measure but that doesn't matter because of pred_display's call to .cpu().

I guess the bottleneck on my server is disk operations, but those should be done in a separate thread anyway. I haven't bothered to multithread evalimage because --benchmark on COCO is all I needed for the paper, but if you want really fast evalimages you'd better multithread the data loading and saving. Note that I also didn't bother multithreading savevideo (for writing to an output video). I only specifically optimized evalvideo, the real-time demo one.

Also, when you're timing make sure to discard the first ~2 frametimes because Pytorch initializes things on the first or second pass through the network, so the first call for instance can take up to 4 seconds. You can run evalimage on a dummy image beforehand to counteract this if you'd like.

from yolact.

rkishore commented on May 13, 2024

@dbolya, thank you.

For --images, we are not that far off in performance. As your results with cv2.imread is 10fps slower (35fps with cv2.imread and 45fps without it), looks like it takes ~6-7ms for this function?

Also, there is only one cv2.imread in eval.py inside evalimage. When you say you omit this function, I assume you mean you omit it from your time/speed calculations, correct? Because otherwise, where else will you get the input image to process from?

from yolact.

dbolya commented on May 13, 2024

Yeah, I mean I omit it from the calculations.

It looks like this is the performance breakdown:

imread and GPU copy:  6.2 ms
    everything else: 21.8 ms
            imwrite: 12.4 ms

from yolact.

rkishore commented on May 13, 2024

@dbolya, thank you.

from yolact.

zimenglan-sysu-512 commented on May 13, 2024

hi @dbolya
i use the cmd

CUDA_VISIBLE_DEVICES=0 python3.6 eval.py --trained_model=weights/yolact_base_54_800000.pth --score_threshold=0.3 --top_k=100 --image=0001.png

the the fast_nms takes 0.11198925971984863s. so how to get the fast speed?
btw, i use Titan Xp.

from yolact.

dbolya commented on May 13, 2024

@zimenglan-sysu-512 Pytorch uses the first image passed through the network to set itself up, meaning that the first iteration will take much longer than the rest. So the first image you evaluate will be slow (still has some setting up to do), but every image after that will be fast. You need to evaluate multiple images (perhaps with --images or a video with --video) to properly benchmark the speed. Remember to ignore the first frame if you're timing it yourself.

To get the numbers in the paper, download COCO and run
python eval.py --trained_model=<model> --max_images=400 --benchmark

from yolact.

zimenglan-sysu-512 commented on May 13, 2024

thanks @dbolya
you are right, for the first few image, the evaluate will be slow, after that, it will be fast.

from yolact.

Rm1n90 commented on May 13, 2024

Hey @dbolya,
I wonder if I convert the code to C++ with Cuda, What Fps should I expect (assume the maximum FPS that I can achieve now is 14)?

from yolact.

syc10-09 commented on May 13, 2024

thanks for your amazing work!　
From above communication, I learn someting new. However, I still don't know how to solve my problem

When I run eval.py using coco2017 dataset by Titan V, the following results appear：
`Config not specified. Parsed yolact_plus_resnet50_config from the file name.

loading annotations into memory...
Done (t=0.46s)
creating index...
index created!
Loading model... Done.

Processing Images ██████████████████████████████ 400 / 400 (100.00%) 19.93 fps
Saving data...
Calculating mAP...

   |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |

-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
box | 37.25 | 56.08 | 54.41 | 52.81 | 49.87 | 46.45 | 41.13 | 34.19 | 24.24 | 12.14 | 1.24 |
mask | 35.93 | 53.91 | 51.74 | 49.42 | 46.64 | 43.65 | 38.83 | 32.80 | 23.72 | 14.16 | 4.44 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Process finished with exit code 0
`
The command I am using is as below (except I change the model name as needed):

python3 eval.py --trained_model=weights/yolact_plus_resnet50_54_800000.pth --score_threshold=0.15 --top_k=15 --max_images=400

First,maybe it's stupid but I really don't understand the meaning of the parameter top_k. Could you explain to me?
Second, I don't know why is the program running at 19.93fps? Did I miss something important? What should I do to achieve the effect of the paper 33.5fps?
@dbolya @rkishore
I would appreciate your reply!

from yolact.

damghanian commented on May 13, 2024

Hello, first of all, @dbolya thank you for sharing this work. I have a question.
@rkishore how do you calculate time per image from FPS ? for example you said: "I get ~16fps (0.06sec/image) with the Resnet-101 model, ~20fps (0.05sec/image) with the Resnet-50 model and 17-18fps (0.055sec/image) with the Darket53 model."
I would appreciate your reply!

from yolact.

Not able to get 30+ fps processing speed on Nvidia RTX 2080 GPU about yolact HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent