Giter Site home page Giter Site logo

Comments (18)

pablogsal avatar pablogsal commented on June 13, 2024

Thanks for opening this issue @nrs-status. I tried using your Docker container but I cannot reproduce your issue. Indeed, memray correctly shows the leak both in the plots and the flamegraph. Take a look:
blech

That shows that the resident memory (and the heap memory) doesn't stop growing.

blech33
blech22

And in this flamegraph you can see that most memory comes from the pytorch profiler (torch::autograd::profiler::disableProfiler() at <unknown>:0).

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

Notice that the resident size is much much bigger than the heap size. Check out the docs to understand why it could be:

https://bloomberg.github.io/memray/memory.html

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

We will try to investigate to check if we can understand why the resident memory is growing without the heap memory growing.

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

We did some investigation and this is indeed heap fragmentation. We profile a single iteration (between two hits of the breakpoint) and memray correctly accounts for every allocation and deallocation that happens. The problem is that internally posix_memalign (which is the allocator that PyTorch profiler is using underneath) calls brk and when freeing the pointer it never recedes the heap (the heap always grows - also notice that "heap" here is the brk-based heap). You can check this by running strace -e brk -p PID between iterations:

brk(0xaaab3660f000)                     = 0xaaab3660f000
brk(0xaaab3678f000)                     = 0xaaab3678f000
brk(0xaaab3690f000)                     = 0xaaab3690f000
brk(0xaaab36a8f000)                     = 0xaaab36a8f000
brk(0xaaab36c0f000)                     = 0xaaab36c0f000
brk(0xaaab36d8f000)                     = 0xaaab36d8f000
brk(0xaaab36f0f000)                     = 0xaaab36f0f000
brk(0xaaab3714f000)                     = 0xaaab3714f000
brk(0xaaab3750f000)                     = 0xaaab3750f000
brk(0xaaab37e0f000)                     = 0xaaab37e0f000
brk(0xaaab3840f000)                     = 0xaaab3840f000
brk(0xaaab3720f000)                     = 0xaaab3720f000
brk(0xaaab3744f000)                     = 0xaaab3744f000
brk(0xaaab3750f000)                     = 0xaaab3750f000
brk(0xaaab3768f000)                     = 0xaaab3768f000
brk(0xaaab3780f000)                     = 0xaaab3780f000
brk(0xaaab3798f000)                     = 0xaaab3798f000
brk(0xaaab37b0f000)                     = 0xaaab37b0f000
brk(0xaaab37c8f000)                     = 0xaaab37c8f000
brk(0xaaab37ecf000)                     = 0xaaab37ecf000
brk(0xaaab38290000)                     = 0xaaab38290000
brk(0xaaab38b90000)                     = 0xaaab38b90000
brk(0xaaab39190000)                     = 0xaaab39190000
brk(0xaaab37e10000)                     = 0xaaab37e10000
brk(0xaaab38050000)                     = 0xaaab38050000
brk(0xaaab38110000)                     = 0xaaab38110000
brk(0xaaab38290000)                     = 0xaaab38290000
brk(0xaaab38410000)                     = 0xaaab38410000
brk(0xaaab38590000)                     = 0xaaab38590000
brk(0xaaab38710000)                     = 0xaaab38710000
brk(0xaaab38890000)                     = 0xaaab38890000
brk(0xaaab38a10000)                     = 0xaaab38a10000
brk(0xaaab38c50000)                     = 0xaaab38c50000
brk(0xaaab39310000)                     = 0xaaab39310000
brk(0xaaab39910000)                     = 0xaaab39910000
brk(0xaaab39a90000)                     = 0xaaab39a90000
brk(0xaaab3a390000)                     = 0xaaab3a390000
brk(0xaaab3a990000)                     = 0xaaab3a990000
brk(0xaaab3b110000)                     = 0xaaab3b110000
brk(0xaaab3b710000)                     = 0xaaab3b710000
brk(0xaaab3a210000)                     = 0xaaab3a210000
brk(0xaaab3a450000)                     = 0xaaab3a450000
brk(0xaaab3a510000)                     = 0xaaab3a510000
brk(0xaaab3a690000)                     = 0xaaab3a690000
brk(0xaaab3a810000)                     = 0xaaab3a810000
brk(0xaaab3a990000)                     = 0xaaab3a990000
brk(0xaaab3ab10000)                     = 0xaaab3ab10000
brk(0xaaab3ac90000)                     = 0xaaab3ac90000
brk(0xaaab3aed0000)                     = 0xaaab3aed0000
brk(0xaaab3b590000)                     = 0xaaab3b590000
brk(0xaaab3bb90000)                     = 0xaaab3bb90000

as can see the heap pointer always grows. Notice that brk being called is an implementation detail of glibc when it calls posix_memalign and what's being "leaked" is the resident size (because the actual allocation is actually being freed later and memray doesn't complain about it). The problem is that this leaves the heap fragmented. You can read more about this here:

https://bloomberg.github.io/memray/memory.html#memory-can-be-fragmented

I am closing the issue as this is not a problem of memray

from memray.

nrs-status avatar nrs-status commented on June 13, 2024

Thank you very much for taking the time, I wasn't making proper use of the heap vs. resident distinction. I realize this might not be a memray issue but I'm just sharing the extra info in case it is of any use. I've sort of got lost in my debugging attempts and failed to notice that the flamegraph in the OP reported the existence of the profiler, but if you run the test again cloning my repo https://github.com/nrs-status/shared (I've included the .bin this time) and building the Dockerfile there, the report instead looks like what is in the next image. As you can see, there would have been no way to infer that the problem was mainly due to the pytorch profiler.

https://imgur.com/a/RNTv9d9

Sorry for not giving the proper setup, I was doing some debugging on my own and stopped midway to make the docker image. Also thanks for the explanation about how the Pytorch Profiler allocates memory.

As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

I modified memray to show in the plot the memory that's fragmented (taken from calling mallinfo2()):

Screenshot 2024-03-13 at 18 43 39

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!

I think we can do something like I showed in the previous plot. It's still unclear how someone could fix things with this info, but at least you can know what's happening.

from memray.

nrs-status avatar nrs-status commented on June 13, 2024

Very cool for the modification to memray! On my side, I've continued playing with this issue for a bit, this time using heaptrack and profiling the CPython itself. Here's the result of an allocations flamegraph:

https://imgur.com/a/NnuXWvT

The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program. Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.

Isn't that more or less the same information as in the flamegraphs in this comment: #565 (comment) ?

Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse

What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph

from memray.

nrs-status avatar nrs-status commented on June 13, 2024

The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.

Isn't that more or less the same information as in the flamegraphs in this comment: #565 (comment) ?

Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse

What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph

With respect to the difference with your flamegraphs: the run you've benchmarked indeed suggested that the profiler might be related to the memory issue. But take a look at my first flamegraph screenshot:

https://imgur.com/a/RNTv9d9

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand. I'm dealing with a similar situation in my current benchmark: considering total memory allocation, the torch profiler uses a total of 0 to 20 mb at most out of a max usage of 700 mb, so it would be totally invisible if only considered with respect to this metric. Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack setup if you want)

With respect to the numbers I'm referring to: the current memray flamegraph offers the number of allocations when you hover over the components of the flamegraph

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack setup if you want)

Ah, this is an interesting point

from memray.

nrs-status avatar nrs-status commented on June 13, 2024

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

Ah my bad I completely missed this

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

Ah my bad I completely missed this

Check out https://bloomberg.github.io/memray/run.html#native-tracking

from memray.

nrs-status avatar nrs-status commented on June 13, 2024

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

Ah my bad I completely missed this

Check out https://bloomberg.github.io/memray/run.html#native-tracking

Can confirm that my new flamegraph looks a bit more similar to yours, but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared

from memray.

pablogsal avatar pablogsal commented on June 13, 2024

but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared

We capture tha information so it's there. You can get it a bit better in other reporters such as memray summary where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.

from memray.

nrs-status avatar nrs-status commented on June 13, 2024

but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared

We capture tha information so it's there. You can get it a bit better in other reporters such as memray summary where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.

Yeah, thanks for pointing out the --native option which was the main thing I was missing. Thanks for taking the time!

from memray.

nrs-status avatar nrs-status commented on June 13, 2024

Another bit of feedback (I imagine you guys are pretty busy so I hope these are at least useful and not just a bother heheh): using the summary command, the allocations indeed show up, but not the profiler (it shows up, but not with the enormous allocation count, like it does in the flamegraph when you hover over it). I've updated the .bin file in my shared repo to be the last one I've been reporting on.

from memray.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.