Comments (15)
Super weird, thanks for the detailed troubleshooting information.
Just to clarify, is it every generation is slower when using invoke.sh
, or only the very first generation?
from invokeai.
Every generation.
[edit]
For testing purposes (i have mocked about with rocm / pytorch versions) I reinstalled 3.7 with the default automated installation and manually installed Juggernaut XL from huggingface in the installer. I experience the exact same behavior. It could be some interaction between the batch script / invoke-ai, but my bash scripting is very rusty.
I added my discord username if you need more information/testing done and joined invoke ai discord.
from invokeai.
I reviewed the invoke.sh
script and it appears the only thing it does differently is set this suspicious environment variable:
# Avoid glibc memory fragmentation. See invokeai/backend/model_management/README.md for details.
export MALLOC_MMAP_THRESHOLD_=1048576
Could you try commenting out that line? Could be some unexpected interaction with your OS or hardware.
Some context for this setting.
The referenced note in the README:
On linux, it is recommended to run invokeai with the following env var:
`MALLOC_MMAP_THRESHOLD_=1048576`. For example: `MALLOC_MMAP_THRESHOLD_=1048576 invokeai --web`.
This helps to prevent memory fragmentation that can lead to memory accumulation over time. This
env var is set automatically when running via `invoke.sh`.
Some discussion on the PR that introduced the change: #4784 (comment)
The standard library documentation provides detail:
M_MMAP_THRESHOLD
For allocations greater than or equal to the limit
specified (in bytes) by M_MMAP_THRESHOLD that can't be
satisfied from the free list, the memory-allocation
functions employ mmap(2) instead of increasing the program
break using sbrk(2).Allocating memory using mmap(2) has the significant
advantage that the allocated memory blocks can always be
independently released back to the system. (By contrast,
the heap can be trimmed only if memory is freed at the top
end.) On the other hand, there are some disadvantages to
the use of mmap(2): deallocated space is not placed on the
free list for reuse by later allocations; memory may be
wasted because mmap(2) allocations must be page-aligned;
and the kernel must perform the expensive task of zeroing
out memory allocated via mmap(2). Balancing these factors
leads to a default setting of 128*1024 for the
M_MMAP_THRESHOLD parameter.The lower limit for this parameter is 0. The upper limit
is DEFAULT_MMAP_THRESHOLD_MAX: 5121024 on 32-bit systems
or 410241024sizeof(long) on 64-bit systems.Note: Nowadays, glibc uses a dynamic mmap threshold by
default. The initial value of the threshold is 128*1024,
but when blocks larger than the current threshold and less
than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the
threshold is adjusted upward to the size of the freed
block. When dynamic mmap thresholding is in effect, the
threshold for trimming the heap is also dynamically
adjusted to be twice the dynamic mmap threshold. Dynamic
adjustment of the mmap threshold is disabled if any of the
M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or
M_MMAP_MAX parameters is set.
And from the glibc docs, a more digestable description:
Tunable: glibc.malloc.mmap_threshold
This tunable supersedes the
MALLOC_MMAP_THRESHOLD_
environment variable and is identical in features.When this tunable is set, all chunks larger than this value in bytes are allocated outside the normal heap, using the mmap system call. This way it is guaranteed that the memory for these chunks can be returned to the system on free. Note that requests smaller than this threshold might still be allocated via mmap.
If this tunable is not set, the default value is set to ‘131072’ bytes and the threshold is adjusted dynamically to suit the allocation patterns of the program. If the tunable is set, the dynamic adjustment is disabled and the value is set as static.
from invokeai.
I just tested it, this is it. When i comment it out launching from invoke.sh is just as fast as manually launching invoke. I have not experienced memory accumulation problems as far as I remember.
I'm curious what the underlying issue is, if it's just a special case or more repeatable.
from invokeai.
Ok, that's good to hear. We'll need to make this configurable. I'm also very curious about the underlying issue. To be honest, this is a bit outside my comfort zone, and I'm not sure what prompted this env var to be set.
Do you know if your system is using glibc
, musl
, or some other libc
implementation? It could be a difference in behaviour.
If you're not sure, run ldd /bin/ls
. You should see something like this:
❯ ldd /bin/ls
linux-vdso.so.1 (0x00007fff9073a000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f7a60ce5000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7a60a00000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f7a60c4e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7a60d54000)
Look for the line that mentions libc
, then copy and paste the full path and run that:
❯ /lib/x86_64-linux-gnu/libc.so.6
GNU C Library (Ubuntu GLIBC 2.35-0ubuntu3.6) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 11.4.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.
Please paste the output of both commands here.
from invokeai.
Paging @RyanJDick - I'm in a bit too deep here, can you help us understand what could be going on? I think we'll need to address the problem by adding an arg for it to the invoke.sh
script.
from invokeai.
linux-vdso.so.1 (0x00007ffc33f5d000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007266f3cf0000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007266f3a00000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007266f3c59000)
/lib64/ld-linux-x86-64.so.2 (0x00007266f3d5f000)
GNU C Library (Ubuntu GLIBC 2.35-0ubuntu3.6) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 11.4.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.
Reading the documentation about mallopt() and in more detail M_MMAP_THRESHOLD, nowadays a dynamic threshold is used to balance the advantages/disadvantages. Setting it manually (and too high) might cause the latter part of the disadvantages:
deallocated space is not placed on the
free list for reuse by later allocations; memory may be
wasted because [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) allocations must be page-aligned;
and the kernel must perform the expensive task of zeroing
out memory allocated via [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html). Balancing these factors
leads to a default setting of 128*1024 for the
But memory allocations have never been my thing, but it the kernel has to traverse a heap structure of sufficient size even (log n) can be relatively slow. What I understand about this is that it's a trade-off between available memory and speed. Using mmap memory from the entire heap can be released back to the system in contrast to the top of the heap. That's a difference of O(1) and O(log n), because with a higher threshold it has to traverse the heap structure more, which costs time.
But again, memory allocations are not my thing, I only understand the underlying data-structure
[edit]
So I decided to test the above with a few test cases.
I tried the following settings and watched the dynamic memory consumption of invokeai-web in my resource manager, and indeed what I wrote above seems to be happening.
Case 1:
MALLOC_MMAP_THRESHOLD_ of 0
This has the highest memory consumption, invokeai releases much less memory back to the system and keeps hanging around 8-9 gb after 5-6 runs. The start of image generation seems to be the fastest here, but not by much. In this case only the top of the heap is released back to the system (if I understand it correctly). Releasing the top of the heap is a O(1) operation.
case 2: MALLOC_MMAP_THRESHOLD_ of the default dynamic setting
Memory consumption is more dynamic. Invokeai-web releases more memory between image generation runs back to the system, but the top is almost as high as a value of 0. Almost as fast as a threshold of 0 when starting a new generation. Peak memory consumption seems to be a bit lower than a threshold of 0
case 3: MALLOC_MMAP_THRESHOLD_ 1048576
Best memory handling, invokeai-web releases more memory back to the system during / between runs. I see large variations of memory usage of invokeai-web. But the peak is not much lower than the other two settings. However, it takes a lot of time for image generation to start. I suppose in this case the heap is traversed more to release the memory. Traversal is O(log n) and probably more operations of the kernel as well as the system wants to find the nodes in the tree that can be released.
There are a few limitations of this 'test'. I only did 5-6 images per setting, but the trend can be seen and replicated on my n=1 system. I would propose running these tests on a few different linux systems to see how repeatable the behavior is.
If it is the case, I would propose leaving the mmalloc settings at default, because the dynamic setting does a really nice job of balancing. Systems with very low system memory might benefit from a higher setting at the cost of slower generation times.
from invokeai.
For completeness reasons I repeated these tests and recorded the generation time and memory consumption.
Methodology:
I generated 20 1024x1024 images using juggernautXLv9 using the queue. I used the above cases of threshold settings:
0, default setting which is dynamic, the setting of 1049578 from invoke.sh
I will disregard the first image of the batch, since that one is always the slowest and has a large variation every time.
Results
Case 1 value 0
Maximum memory usage of invokeai-web is about 9.1GB. Memory usage increases a bit in the first few runs to this number.
Case 2: use the default settings, which according to the documentation is mostly dynamic.
Maximum memory usage of invokeai-web is 8.6GB, memory usage starts a bit lower but after 2-3 images it holds around this value.
Case 3: invoke.sh setting of 1048576
invokeai-web uses a maximum of 7.7 gb of memory, but it is very much dynamic. Memory consumption is dynamic during a run and between runs. After the last image, invokeai-web still uses 7.7 GB.
from invokeai.
I can give a little more context on why this setting was originally added. There's a decent explanation here: https://github.com/invoke-ai/InvokeAI/pull/4784/files#diff-aaa1044287f4abfd0e20c07530ec1d9f226e6b6eed7bdf31a51a6253fdbd5029R3 (not sure if you guys saw this already).
The TL;DR is that if someone has a bunch of models that they are switching between, memory usage will gradually accumulate until an OOM error.
There could be differences in behavior depending on the OS and libc implementation. I don't remember seeing a performance impact on linux with glibc, but I'll double check and post results here shortly.
By the sound of things, this should definitely be configurable, we just need to do a bit more testing to figure out what the recommendations should be.
from invokeai.
The delay when I use invoke.sh is only when I hit Invoke, then it takes longer to present the latent noise, and at the end it takes longer to present the final image. Generation itself (in iterations per second) is similar in all conditions. Using invoke.sh I see crazy memory usage fluctuations. Is there a way to record/visualize the actual memory operations of invokeai on the heap? I assume that the delay I experience is just the tree (heap) traversal to release the memory (zeroing); the documentation of mmap does hint that there are performance penalties when setting MALLOC_MMAP_THRESHOLD very high.
I do have some time next week/ next few days. We could devise a few test cases and try to determine what happens when in which condition to make a better informed decision? Usually I have memory cache at 8 GB (out of my 32GB RAM). I use Ubuntu 22.04 LTS. I use a Ryzen 5600 on a x470 board, with a 7900XT AMDGPU. Samsung 980 PRO NVMe 2TB SSD. All drivers are up2date.
I just did some testing with model cache on 16GB, and yes there seems to be a very slow memory creep. I start to notice it after 6-7 model changes. I generate 5 images per model, change the model (sdxl variants) and there is a slow upwards trend of a few hundred megabytes. But on my system, if I have run out of memory I think I would have to change a lot of models for hours to achieve this. This is one of the situations we (and that is a decade ago when I worked in practice) would write automated test cases for and let it run for hours on end.
from invokeai.
Thanks for digging in. I've created #6047 to make this configurable via arg to invoke.sh
.
I'll defer to you two for determining the best default value for this setting. Maybe it's reasonable to leave the default as it is now?
from invokeai.
I wanted to test this today, but ended up getting derailed by this performance issue: #6052.
From some preliminary tests, I'm not seeing any slowdown on my system with MALLOC_MMAP_THRESHOLD_=1048576
. I'll do some more rigorous testing tomorrow.
from invokeai.
I did some more testing this morning. The issue is definitely less evident on my system, but I can see a measurable difference when I reduce my VRAM cache size to force more model copying.
Setup:
- Ubuntu 22.04
- glibc 2.35-0ubuntu3.1
- SDXL, 1024x1024, Euler, 20 steps
Configuration | First Generation | Second Generation |
---|---|---|
vram: 8.0 | 6.8s | 4.0s |
vram 8.0, MALLOC_MMAP_THRESHOLD_=1048576 | 6.8s | 4.2s |
vram: 4.0 | 8.6s | 5.0s |
vram 4.0, MALLOC_MMAP_THRESHOLD_=1048576 | 9.0s | 6.1s |
Looking back at when this was originally introduced, I don't think the problem was ever reported by someone running the OSS app on a local workstation. It was mainly intended to address an issue in the hosted version of the app, where we run many models and can see 10s of GBs of memory accumulation caused by fragmentation.
Here is what I propose:
- Remove all
MALLOC_MMAP_THRESHOLD_
overrides. - Add a note to the docs to explain how to set it manually if anyone is encountering this issue.
- I'm not sure if it's necessary to support it as a flag on
invoke.sh
(as proposed in #6047) - I suspect that very few people will actually want to use it.
@psychedelicious @Adlermannnl Let me know what you think.
from invokeai.
I left the vram cache at the default settings, or pretty low, since the installer mentions something in the line of 'reserving a little vram'.
That might explain the difference:
Model Cache: ram: 16.0 vram: 0.25 lazy_offload: true log_memory_usage: false
Taking a requirements engineering perspective: having different settings for the hosted version and the community version is perfectly acceptable, since they have different usage scenarios. Removing all MALLOC_MMAP overrides for the community edition might be the best default setting. I suspect most people also follow the installation defaults of model cache size.
Regarding the way to override is open for discussion and best placed in a particular scenario, which kind of user of the community edition would experience such extreme fragmentation? I don't know if it is allowed to host the community edition and have multiple users (I don't even know if this is possible). But such a user would be capable of reading the docs and setting environmental variables.
I don't see a normal user running out of memory because of the fragmentation issue where there is a single user running on a single instance of invoke. I think it would require extreme model swapping for an extreme prolonged period of time (or extremely low on ram/swap?). But these would be edge cases. Either option is fine by me, passing a parameter is little more user friendly though.
The most important thing is that this issue gets clearly documented, and a best default setting is chosen.
from invokeai.
Removing the var from the launcher script and clearly documenting it sounds good to me. I'll make a new PR to do that.
from invokeai.
Related Issues (20)
- [bug]: [v4.2.8] invoke produces corrupt image until restart
- [bug]: Server Error ValueError: With local_files_only set to False, you must first locally save the tokenizer in the following path: 'openai/clip-vit-large-patch14'. HOT 1
- [bug]: invokeai-import-images throws a TypeError HOT 7
- Disconnected after trying to install flux.dev HOT 1
- [bug]: Having global positive prompt filled in during upscale produces artifacts during scaling HOT 2
- [bug]: Image list resets to page 1 after moving images to board
- [bug]: Image nodes don't save the image in the "automatic" board / gallery HOT 1
- [bug]: Language Setting & Sudden Crashes HOT 10
- [enhancement]: FLUX copy&paste HOT 1
- [enhancement]: non-intuitive WEBUI
- [bug]: NotFoundError: Failed to execute 'removeChild' on 'Node': The node to be removed is not a child of this node.
- [bug]: image generation resets the uncategorized gallery to page 1
- [bug]: Can't install now HOT 2
- [enhancement]: Improve error message when encountering a disk errors
- [bug]: Invert Tensor Mask & Alpha Mask to Tensor Node Errors HOT 2
- [bug]: NOT INSTALL HOT 1
- [enhancement]: Gallery - Double click the thumbnail to open the image in new tab
- [bug]: ERROR - patchmatch failed to load or compile (not enough data: cadata does not contain a certificate (_ssl.c:4020)) HOT 4
- [bug]: Cannot consistently grab bounding box on Brave browser HOT 2
- [bug]: TypeError: a.map is not a function HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from invokeai.