Comments (18)
Thanks for opening this issue @nrs-status. I tried using your Docker container but I cannot reproduce your issue. Indeed, memray
correctly shows the leak both in the plots and the flamegraph. Take a look:
That shows that the resident memory (and the heap memory) doesn't stop growing.
And in this flamegraph you can see that most memory comes from the pytorch profiler (torch::autograd::profiler::disableProfiler() at <unknown>:0
).
from memray.
Notice that the resident size is much much bigger than the heap size. Check out the docs to understand why it could be:
https://bloomberg.github.io/memray/memory.html
from memray.
We will try to investigate to check if we can understand why the resident memory is growing without the heap memory growing.
from memray.
We did some investigation and this is indeed heap fragmentation. We profile a single iteration (between two hits of the breakpoint) and memray correctly accounts for every allocation and deallocation that happens. The problem is that internally posix_memalign
(which is the allocator that PyTorch profiler is using underneath) calls brk
and when freeing the pointer it never recedes the heap (the heap always grows - also notice that "heap" here is the brk
-based heap). You can check this by running strace -e brk -p PID
between iterations:
brk(0xaaab3660f000) = 0xaaab3660f000
brk(0xaaab3678f000) = 0xaaab3678f000
brk(0xaaab3690f000) = 0xaaab3690f000
brk(0xaaab36a8f000) = 0xaaab36a8f000
brk(0xaaab36c0f000) = 0xaaab36c0f000
brk(0xaaab36d8f000) = 0xaaab36d8f000
brk(0xaaab36f0f000) = 0xaaab36f0f000
brk(0xaaab3714f000) = 0xaaab3714f000
brk(0xaaab3750f000) = 0xaaab3750f000
brk(0xaaab37e0f000) = 0xaaab37e0f000
brk(0xaaab3840f000) = 0xaaab3840f000
brk(0xaaab3720f000) = 0xaaab3720f000
brk(0xaaab3744f000) = 0xaaab3744f000
brk(0xaaab3750f000) = 0xaaab3750f000
brk(0xaaab3768f000) = 0xaaab3768f000
brk(0xaaab3780f000) = 0xaaab3780f000
brk(0xaaab3798f000) = 0xaaab3798f000
brk(0xaaab37b0f000) = 0xaaab37b0f000
brk(0xaaab37c8f000) = 0xaaab37c8f000
brk(0xaaab37ecf000) = 0xaaab37ecf000
brk(0xaaab38290000) = 0xaaab38290000
brk(0xaaab38b90000) = 0xaaab38b90000
brk(0xaaab39190000) = 0xaaab39190000
brk(0xaaab37e10000) = 0xaaab37e10000
brk(0xaaab38050000) = 0xaaab38050000
brk(0xaaab38110000) = 0xaaab38110000
brk(0xaaab38290000) = 0xaaab38290000
brk(0xaaab38410000) = 0xaaab38410000
brk(0xaaab38590000) = 0xaaab38590000
brk(0xaaab38710000) = 0xaaab38710000
brk(0xaaab38890000) = 0xaaab38890000
brk(0xaaab38a10000) = 0xaaab38a10000
brk(0xaaab38c50000) = 0xaaab38c50000
brk(0xaaab39310000) = 0xaaab39310000
brk(0xaaab39910000) = 0xaaab39910000
brk(0xaaab39a90000) = 0xaaab39a90000
brk(0xaaab3a390000) = 0xaaab3a390000
brk(0xaaab3a990000) = 0xaaab3a990000
brk(0xaaab3b110000) = 0xaaab3b110000
brk(0xaaab3b710000) = 0xaaab3b710000
brk(0xaaab3a210000) = 0xaaab3a210000
brk(0xaaab3a450000) = 0xaaab3a450000
brk(0xaaab3a510000) = 0xaaab3a510000
brk(0xaaab3a690000) = 0xaaab3a690000
brk(0xaaab3a810000) = 0xaaab3a810000
brk(0xaaab3a990000) = 0xaaab3a990000
brk(0xaaab3ab10000) = 0xaaab3ab10000
brk(0xaaab3ac90000) = 0xaaab3ac90000
brk(0xaaab3aed0000) = 0xaaab3aed0000
brk(0xaaab3b590000) = 0xaaab3b590000
brk(0xaaab3bb90000) = 0xaaab3bb90000
as can see the heap pointer always grows. Notice that brk
being called is an implementation detail of glibc when it calls posix_memalign
and what's being "leaked" is the resident size (because the actual allocation is actually being freed later and memray doesn't complain about it). The problem is that this leaves the heap fragmented. You can read more about this here:
https://bloomberg.github.io/memray/memory.html#memory-can-be-fragmented
I am closing the issue as this is not a problem of memray
from memray.
Thank you very much for taking the time, I wasn't making proper use of the heap vs. resident distinction. I realize this might not be a memray issue but I'm just sharing the extra info in case it is of any use. I've sort of got lost in my debugging attempts and failed to notice that the flamegraph in the OP reported the existence of the profiler, but if you run the test again cloning my repo https://github.com/nrs-status/shared (I've included the .bin this time) and building the Dockerfile there, the report instead looks like what is in the next image. As you can see, there would have been no way to infer that the problem was mainly due to the pytorch profiler.
Sorry for not giving the proper setup, I was doing some debugging on my own and stopped midway to make the docker image. Also thanks for the explanation about how the Pytorch Profiler allocates memory.
As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!
from memray.
I modified memray
to show in the plot the memory that's fragmented (taken from calling mallinfo2()
):
from memray.
As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!
I think we can do something like I showed in the previous plot. It's still unclear how someone could fix things with this info, but at least you can know what's happening.
from memray.
Very cool for the modification to memray! On my side, I've continued playing with this issue for a bit, this time using heaptrack
and profiling the CPython itself. Here's the result of an allocations flamegraph:
The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program. Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse
from memray.
The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.
Isn't that more or less the same information as in the flamegraphs in this comment: #565 (comment) ?
Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse
What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph
from memray.
The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.
Isn't that more or less the same information as in the flamegraphs in this comment: #565 (comment) ?
Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse
What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph
With respect to the difference with your flamegraphs: the run you've benchmarked indeed suggested that the profiler might be related to the memory issue. But take a look at my first flamegraph screenshot:
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand. I'm dealing with a similar situation in my current benchmark: considering total memory allocation, the torch profiler uses a total of 0 to 20 mb at most out of a max usage of 700 mb, so it would be totally invisible if only considered with respect to this metric. Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack
setup if you want)
With respect to the numbers I'm referring to: the current memray flamegraph offers the number of allocations when you hover over the components of the flamegraph
from memray.
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass --native
to memray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack
is giving you + the Python frames instead of _PyEval_EvalFrameDefault
.
from memray.
Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack setup if you want)
Ah, this is an interesting point
from memray.
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass
--native
tomemray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information asheaptrack
is giving you + the Python frames instead of_PyEval_EvalFrameDefault
.
Ah my bad I completely missed this
from memray.
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass
--native
tomemray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information asheaptrack
is giving you + the Python frames instead of_PyEval_EvalFrameDefault
.Ah my bad I completely missed this
Check out https://bloomberg.github.io/memray/run.html#native-tracking
from memray.
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass
--native
tomemray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information asheaptrack
is giving you + the Python frames instead of_PyEval_EvalFrameDefault
.Ah my bad I completely missed this
Check out https://bloomberg.github.io/memray/run.html#native-tracking
Can confirm that my new flamegraph looks a bit more similar to yours, but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared
from memray.
but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared
We capture tha information so it's there. You can get it a bit better in other reporters such as memray summary
where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.
from memray.
but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared
We capture tha information so it's there. You can get it a bit better in other reporters such as
memray summary
where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.
Yeah, thanks for pointing out the --native
option which was the main thing I was missing. Thanks for taking the time!
from memray.
Another bit of feedback (I imagine you guys are pretty busy so I hope these are at least useful and not just a bother heheh): using the summary command, the allocations indeed show up, but not the profiler (it shows up, but not with the enormous allocation count, like it does in the flamegraph when you hover over it). I've updated the .bin
file in my shared repo to be the last one I've been reporting on.
from memray.
Related Issues (20)
- %%memray_flamegraph magic options HOT 3
- Make the `%%memray_flamegraph` IPython magic use aggregated capture files
- is memray support profiling c version of python package like pillow? HOT 2
- continous profiling of memray HOT 1
- could i use follow-fork in api?
- the really meaning of --native HOT 1
- Accurate report? HOT 24
- `memray run` overwrites `sys.argv[0]` even when `-I` or `-P` is used
- A crash in `memray flamegraph` (Python 3.12.0, macOS, native mode) HOT 4
- empty flamegraph/summary with large memray dump HOT 15
- Move the runner into a separate package with minimal dependencies HOT 5
- Ctrl-Z in "memray tree" doesn't work HOT 2
- Include thread name in Memray live tracking view HOT 4
- Heap size does not change after function call.
- Track virtual memory HOT 1
- How to visualize huge bin file (over 2TB CPU Memory) HOT 3
- Ability to write to a pipe from `memray.Tracker` HOT 2
- Making memray third-party allocator-aware HOT 24
- How to profile gunicorn workers?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from memray.