Giter Site home page Giter Site logo

Comments (8)

harujoh avatar harujoh commented on August 27, 2024 1

I'm sorry for the late reply.

I confirmed that the proposed method has very effective effect.

However, there are more steps to follow special rules to handle this data.
This is not desirable for beginners of the program.

So I was searching for a better form of incorporation into the master, but the concrete method is not decided now.

Unfortunately, I can not have the time for programming for a while.
So, I'd like to take the idea of putting together ComputeBuffer creation in advance.

返信が遅くなり申し訳ありません。

ご提案いただいた方法がとても効果があることを確認しました。

しかし、このデータを取り扱うために特別なルールに従う手順が増えてしまいます。
これはプログラムの初心者にとって望ましい状態でなくなってしまいます。

そこでマスターへより良い形での取り込みを模索していましたが、
現在具体的な方法は決まっていません。

残念なことに、しばらくまとまった時間が取れません。
そこで、先行してComputeBufferの作成をまとめるアイデアを取り込みたいと考えています。

from kelpnet.

harujoh avatar harujoh commented on August 27, 2024

OK,I Undestood.

I also reached the same idea in the past.
I've done remodeling of what I pointed out, but no improvement in speed was seen.

Therefore, it is not made like that.

大丈夫です、解っていますよ。

私も過去に同じ考えに至りました。
ご指摘いただいた内容の改造を行ったことがありますが、速度に改善が見られませんでした。

したがって、そのような実装になっていません。

from kelpnet.

gmlwns2000 avatar gmlwns2000 commented on August 27, 2024

Hi, @harujoh

Thanks for responce first :3

About problem, it's weird. In my case, it is actually faster than before. This is my code And I heard creating vram buffer many times make memory paramentation too.

So I think we need to fix this even not improvement in speed.

  • Before
    Imgur
  • After
    Imgur

from kelpnet.

harujoh avatar harujoh commented on August 27, 2024

Thank you so so :)

I cheked the source code.
The remodeling which the result you created is very wonderful, I want to review the contents and take in as much as possible.

Previously, my approach was to pass memory to the next function without erasing memory every time.
However, with this method, no improvement was seen.

内容を拝見させていただきました。
あなたが作成した結果の出ている改造はとても素晴らしいもので、内容を確認し、極力取り込みたいと考えています。

以前、私がとったアプローチはメモリを毎回消さずに、次の関数に渡す方法でした。
しかし、この手法では改善が見られませんでした。

from kelpnet.

gmlwns2000 avatar gmlwns2000 commented on August 27, 2024

Hi, @harujoh,

Thanks for reviewing it first :3

I think I don't understand how did you pass VRAM well. I think if you only pass and get gpuX and gpuY value to next layers, it should be not efficient in Linear layer. In Linear layer input and output of layer is really smaller than weight variable... So caching gpuX and gpuY should be not that efficient.

By the way, I checked conv layer performance too. And I understand why performance improved not much. Conv layer's computing time is longer than creating buffers. But I saw still ~25% CPU time is wasted for buffer creating. And also I think Weaver.CommandQueue.Finish() use while(true){} loop. It uses 100% of core while waiting. So I think we need to sleep thread while heavy computation is processing in GPU.

And for optimizing approach, I think caching should be occurred in where NdArray used (like in layers), not in NdArray itself. My previous idea will be problem because all of NdArray will effect to GpuEnable.

So to make working easier, first cache variable like not need to pass should be cached in class.
Then add NdArray.CopyToGpu() , NdArray.CopyToCpu(), NdArray.GpuData, NdArray.GpuGrad (to manage data easier) then call CopyToGpu() to all of NdArray that used for computation (in layers, input data, test data etc). While using vram, NdArray.Data will return and it should not be used often (make many vram to ram).

I think my last idea seems fine and easy to apply. I think I can apply it tomorrow.

from kelpnet.

harujoh avatar harujoh commented on August 27, 2024

I taking measures as you imagined, and had no effect.

And also I think Weaver.CommandQueue.Finish() use while(true){} loop.

I did not recognize this.
And I recognized that this situation is not good.
I would like to explore improvements.

NdArray.CopyToGpu()
NdArray.CopyToCpu()
NdArray.GpuData
NdArray.GpuGrad

I will support this plan as well :)
Unless contrary to the architecture, I'd like to actively adopt it.

あなたが想像した通りの方策をとり、そして効果がありませんでした。

And also I think Weaver.CommandQueue.Finish() use while(true){} loop.

私にはこの認識がありませんでした。
そして、この状態が良くないという認識を持ちました。
したがって、何らかの改善策を模索したいと思います。

NdArray.CopyToGpu()
NdArray.CopyToCpu()
NdArray.GpuData
NdArray.GpuGrad

私もこの案を支持します。
アーキテクチャに反しない限り、積極的に採用したいと考えています。

from kelpnet.

gmlwns2000 avatar gmlwns2000 commented on August 27, 2024

Thank you for reviewing this problem :3

About Weaver.CommandQueue.Finish(), if we avoid to use it and then use Thread.Sleep(), it will reduce CPU usage. But I think computation speed won't be changed because computation will occur in GPU. I mentioned it because to use Finish() for wait long CL computation, will waste CPU resources for just waiting.

And also I implemented some experimental GPU ram patch on my fork. (It's not on master branch, it's on real_array branch. I'm not sure this approach is an answer yet.)

I added a class that called as RealArray. It maintains Real array buffer between CPU(Real[]) and GPU(ComputeBuffer<Real>). It helps to move data and copy from GPU to CPU. And to store and access data, I wrote some interface that called IDeviceBank. So It will be nice to support other types of storage (mostly it is for custom allocator). I changed Real[] in NdArray to RealArray, and finally, we can re-use ComputeBuffer from RealArray.

In summary, I added some data access layers. NdArray > RealArray > IDeviceBank > GPU/CpuDataBank > Real[]/ComputeBuffer<Real>. We can still use indexers, and it's acceptable fast in CPU bank too. But it is extremely slow in GPU bank (of course, it's because accessing VRAM to get just 1 double is must be huge overhead) And for GPU bank, we need to move data into CPU, but it is not difficult. Just call .ToCpu() before CPU computations, then .ToGpu().

Currently, I found fatal issue in RealArray to train GPU data... I think I made some mistakes in my code. I will keep working to fix problem.

I hope my trials are helpful :3

from kelpnet.

harujoh avatar harujoh commented on August 27, 2024

ご提案いただいた方法を実現しましたが、残念ながら望むパフォーマンスを得ることが出来ませんでした。
https://github.com/harujoh/KelpNet/tree/TryGenNdArrayNativeBase

ご指摘の問題は、調査の結果Delegateが主な原因でDelegateの使用を廃止したことで速度低下は解消しました。

from kelpnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.