Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thank you for reviewing this problem :3 About <code class="notransla

ご提案いただいた方法を実現しましたが、残念ながら望むパフォーマンスを得ることが出来ませんでした。 <a href="https://github.com/haruj

Too many times creating ComputeBuffer about kelpnet HOT 8 CLOSED

harujoh commented on August 27, 2024

Too many times creating ComputeBuffer

from kelpnet.

Comments (8)

harujoh commented on August 27, 2024 1

I'm sorry for the late reply.

I confirmed that the proposed method has very effective effect.

However, there are more steps to follow special rules to handle this data.
This is not desirable for beginners of the program.

So I was searching for a better form of incorporation into the master, but the concrete method is not decided now.

Unfortunately, I can not have the time for programming for a while.
So, I'd like to take the idea of putting together ComputeBuffer creation in advance.

返信が遅くなり申し訳ありません。

ご提案いただいた方法がとても効果があることを確認しました。

しかし、このデータを取り扱うために特別なルールに従う手順が増えてしまいます。
これはプログラムの初心者にとって望ましい状態でなくなってしまいます。

そこでマスターへより良い形での取り込みを模索していましたが、
現在具体的な方法は決まっていません。

残念なことに、しばらくまとまった時間が取れません。
そこで、先行してComputeBufferの作成をまとめるアイデアを取り込みたいと考えています。

from kelpnet.

harujoh commented on August 27, 2024

OK,I Undestood.

I also reached the same idea in the past.
I've done remodeling of what I pointed out, but no improvement in speed was seen.

Therefore, it is not made like that.

大丈夫です、解っていますよ。

私も過去に同じ考えに至りました。
ご指摘いただいた内容の改造を行ったことがありますが、速度に改善が見られませんでした。

したがって、そのような実装になっていません。

from kelpnet.

gmlwns2000 commented on August 27, 2024

Hi, @harujoh

Thanks for responce first :3

About problem, it's weird. In my case, it is actually faster than before. This is my code And I heard creating vram buffer many times make memory paramentation too.

So I think we need to fix this even not improvement in speed.

Before
After

from kelpnet.

harujoh commented on August 27, 2024

Thank you so so :)

I cheked the source code.
The remodeling which the result you created is very wonderful, I want to review the contents and take in as much as possible.

Previously, my approach was to pass memory to the next function without erasing memory every time.
However, with this method, no improvement was seen.

内容を拝見させていただきました。
あなたが作成した結果の出ている改造はとても素晴らしいもので、内容を確認し、極力取り込みたいと考えています。

以前、私がとったアプローチはメモリを毎回消さずに、次の関数に渡す方法でした。
しかし、この手法では改善が見られませんでした。

from kelpnet.

gmlwns2000 commented on August 27, 2024

Hi, @harujoh,

Thanks for reviewing it first :3

I think I don't understand how did you pass VRAM well. I think if you only pass and get gpuX and gpuY value to next layers, it should be not efficient in Linear layer. In Linear layer input and output of layer is really smaller than weight variable... So caching gpuX and gpuY should be not that efficient.

By the way, I checked conv layer performance too. And I understand why performance improved not much. Conv layer's computing time is longer than creating buffers. But I saw still ~25% CPU time is wasted for buffer creating. And also I think Weaver.CommandQueue.Finish() use while(true){} loop. It uses 100% of core while waiting. So I think we need to sleep thread while heavy computation is processing in GPU.

And for optimizing approach, I think caching should be occurred in where NdArray used (like in layers), not in NdArray itself. My previous idea will be problem because all of NdArray will effect to GpuEnable.

So to make working easier, first cache variable like not need to pass should be cached in class.
Then add NdArray.CopyToGpu() , NdArray.CopyToCpu(), NdArray.GpuData, NdArray.GpuGrad (to manage data easier) then call CopyToGpu() to all of NdArray that used for computation (in layers, input data, test data etc). While using vram, NdArray.Data will return and it should not be used often (make many vram to ram).

I think my last idea seems fine and easy to apply. I think I can apply it tomorrow.

from kelpnet.

harujoh commented on August 27, 2024

I taking measures as you imagined, and had no effect.

And also I think Weaver.CommandQueue.Finish() use while(true){} loop.

I did not recognize this.
And I recognized that this situation is not good.
I would like to explore improvements.

NdArray.CopyToGpu()
NdArray.CopyToCpu()
NdArray.GpuData
NdArray.GpuGrad

I will support this plan as well :)
Unless contrary to the architecture, I'd like to actively adopt it.

あなたが想像した通りの方策をとり、そして効果がありませんでした。

And also I think Weaver.CommandQueue.Finish() use while(true){} loop.

私にはこの認識がありませんでした。
そして、この状態が良くないという認識を持ちました。
したがって、何らかの改善策を模索したいと思います。

NdArray.CopyToGpu()
NdArray.CopyToCpu()
NdArray.GpuData
NdArray.GpuGrad

私もこの案を支持します。
アーキテクチャに反しない限り、積極的に採用したいと考えています。

from kelpnet.

gmlwns2000 commented on August 27, 2024

Thank you for reviewing this problem :3

About Weaver.CommandQueue.Finish(), if we avoid to use it and then use Thread.Sleep(), it will reduce CPU usage. But I think computation speed won't be changed because computation will occur in GPU. I mentioned it because to use Finish() for wait long CL computation, will waste CPU resources for just waiting.

And also I implemented some experimental GPU ram patch on my fork. (It's not on master branch, it's on real_array branch. I'm not sure this approach is an answer yet.)

I added a class that called as RealArray. It maintains Real array buffer between CPU(Real[]) and GPU(ComputeBuffer<Real>). It helps to move data and copy from GPU to CPU. And to store and access data, I wrote some interface that called IDeviceBank. So It will be nice to support other types of storage (mostly it is for custom allocator). I changed Real[] in NdArray to RealArray, and finally, we can re-use ComputeBuffer from RealArray.

In summary, I added some data access layers. NdArray > RealArray > IDeviceBank > GPU/CpuDataBank > Real[]/ComputeBuffer<Real>. We can still use indexers, and it's acceptable fast in CPU bank too. But it is extremely slow in GPU bank (of course, it's because accessing VRAM to get just 1 double is must be huge overhead) And for GPU bank, we need to move data into CPU, but it is not difficult. Just call .ToCpu() before CPU computations, then .ToGpu().

Currently, I found fatal issue in RealArray to train GPU data... I think I made some mistakes in my code. I will keep working to fix problem.

I hope my trials are helpful :3

from kelpnet.

harujoh commented on August 27, 2024

ご提案いただいた方法を実現しましたが、残念ながら望むパフォーマンスを得ることが出来ませんでした。
https://github.com/harujoh/KelpNet/tree/TryGenNdArrayNativeBase

ご指摘の問題は、調査の結果Delegateが主な原因でDelegateの使用を廃止したことで速度低下は解消しました。

from kelpnet.

Too many times creating ComputeBuffer about kelpnet HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent