Is there an existing issue for this problem? <li

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Paging <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

[bug]: generation time slower when launching invoke-ai from invoke.sh,about invoke-ai/invokeai

Comments (15)

psychedelicious commented on September 23, 2024

Super weird, thanks for the detailed troubleshooting information.

Just to clarify, is it every generation is slower when using invoke.sh, or only the very first generation?

from invokeai.

Adlermannnl commented on September 23, 2024

Every generation.
[edit]
For testing purposes (i have mocked about with rocm / pytorch versions) I reinstalled 3.7 with the default automated installation and manually installed Juggernaut XL from huggingface in the installer. I experience the exact same behavior. It could be some interaction between the batch script / invoke-ai, but my bash scripting is very rusty.

I added my discord username if you need more information/testing done and joined invoke ai discord.

from invokeai.

psychedelicious commented on September 23, 2024

@Adlermannnl

I reviewed the invoke.sh script and it appears the only thing it does differently is set this suspicious environment variable:

# Avoid glibc memory fragmentation. See invokeai/backend/model_management/README.md for details.
export MALLOC_MMAP_THRESHOLD_=1048576

Could you try commenting out that line? Could be some unexpected interaction with your OS or hardware.

Some context for this setting.

The referenced note in the README:

On linux, it is recommended to run invokeai with the following env var:
`MALLOC_MMAP_THRESHOLD_=1048576`. For example: `MALLOC_MMAP_THRESHOLD_=1048576 invokeai --web`.
This helps to prevent memory fragmentation that can lead to memory accumulation over time. This
env var is set automatically when running via `invoke.sh`.

Some discussion on the PR that introduced the change: #4784 (comment)

The standard library documentation provides detail:

M_MMAP_THRESHOLD

For allocations greater than or equal to the limit
specified (in bytes) by M_MMAP_THRESHOLD that can't be
satisfied from the free list, the memory-allocation
functions employ mmap(2) instead of increasing the program
break using sbrk(2).

Allocating memory using mmap(2) has the significant
advantage that the allocated memory blocks can always be
independently released back to the system. (By contrast,
the heap can be trimmed only if memory is freed at the top
end.) On the other hand, there are some disadvantages to
the use of mmap(2): deallocated space is not placed on the
free list for reuse by later allocations; memory may be
wasted because mmap(2) allocations must be page-aligned;
and the kernel must perform the expensive task of zeroing
out memory allocated via mmap(2). Balancing these factors
leads to a default setting of 128*1024 for the
M_MMAP_THRESHOLD parameter.

The lower limit for this parameter is 0. The upper limit
is DEFAULT_MMAP_THRESHOLD_MAX: 5121024 on 32-bit systems
or 410241024sizeof(long) on 64-bit systems.

Note: Nowadays, glibc uses a dynamic mmap threshold by
default. The initial value of the threshold is 128*1024,
but when blocks larger than the current threshold and less
than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the
threshold is adjusted upward to the size of the freed
block. When dynamic mmap thresholding is in effect, the
threshold for trimming the heap is also dynamically
adjusted to be twice the dynamic mmap threshold. Dynamic
adjustment of the mmap threshold is disabled if any of the
M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or
M_MMAP_MAX parameters is set.

And from the glibc docs, a more digestable description:

Tunable: glibc.malloc.mmap_threshold

This tunable supersedes the MALLOC_MMAP_THRESHOLD_ environment variable and is identical in features.

When this tunable is set, all chunks larger than this value in bytes are allocated outside the normal heap, using the mmap system call. This way it is guaranteed that the memory for these chunks can be returned to the system on free. Note that requests smaller than this threshold might still be allocated via mmap.

If this tunable is not set, the default value is set to ‘131072’ bytes and the threshold is adjusted dynamically to suit the allocation patterns of the program. If the tunable is set, the dynamic adjustment is disabled and the value is set as static.

from invokeai.

Adlermannnl commented on September 23, 2024

I just tested it, this is it. When i comment it out launching from invoke.sh is just as fast as manually launching invoke. I have not experienced memory accumulation problems as far as I remember.
I'm curious what the underlying issue is, if it's just a special case or more repeatable.

from invokeai.

psychedelicious commented on September 23, 2024

Ok, that's good to hear. We'll need to make this configurable. I'm also very curious about the underlying issue. To be honest, this is a bit outside my comfort zone, and I'm not sure what prompted this env var to be set.

Do you know if your system is using glibc, musl, or some other libc implementation? It could be a difference in behaviour.

If you're not sure, run ldd /bin/ls. You should see something like this:

❯ ldd /bin/ls
	linux-vdso.so.1 (0x00007fff9073a000)
	libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f7a60ce5000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7a60a00000)
	libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f7a60c4e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f7a60d54000)

Look for the line that mentions libc, then copy and paste the full path and run that:

❯ /lib/x86_64-linux-gnu/libc.so.6
GNU C Library (Ubuntu GLIBC 2.35-0ubuntu3.6) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 11.4.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.

Please paste the output of both commands here.

from invokeai.

psychedelicious commented on September 23, 2024

Paging @RyanJDick - I'm in a bit too deep here, can you help us understand what could be going on? I think we'll need to address the problem by adding an arg for it to the invoke.sh script.

from invokeai.

Adlermannnl commented on September 23, 2024


linux-vdso.so.1 (0x00007ffc33f5d000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007266f3cf0000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007266f3a00000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007266f3c59000)
/lib64/ld-linux-x86-64.so.2 (0x00007266f3d5f000)


GNU C Library (Ubuntu GLIBC 2.35-0ubuntu3.6) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 11.4.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.

Reading the documentation about mallopt() and in more detail M_MMAP_THRESHOLD, nowadays a dynamic threshold is used to balance the advantages/disadvantages. Setting it manually (and too high) might cause the latter part of the disadvantages:


deallocated space is not placed on the
              free list for reuse by later allocations; memory may be
              wasted because [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) allocations must be page-aligned;
              and the kernel must perform the expensive task of zeroing
              out memory allocated via [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html).  Balancing these factors
              leads to a default setting of 128*1024 for the

But memory allocations have never been my thing, but it the kernel has to traverse a heap structure of sufficient size even (log n) can be relatively slow. What I understand about this is that it's a trade-off between available memory and speed. Using mmap memory from the entire heap can be released back to the system in contrast to the top of the heap. That's a difference of O(1) and O(log n), because with a higher threshold it has to traverse the heap structure more, which costs time.

But again, memory allocations are not my thing, I only understand the underlying data-structure
[edit]
So I decided to test the above with a few test cases.
I tried the following settings and watched the dynamic memory consumption of invokeai-web in my resource manager, and indeed what I wrote above seems to be happening.

Case 1:
MALLOC_MMAP_THRESHOLD_ of 0

This has the highest memory consumption, invokeai releases much less memory back to the system and keeps hanging around 8-9 gb after 5-6 runs. The start of image generation seems to be the fastest here, but not by much. In this case only the top of the heap is released back to the system (if I understand it correctly). Releasing the top of the heap is a O(1) operation.

case 2: MALLOC_MMAP_THRESHOLD_ of the default dynamic setting

Memory consumption is more dynamic. Invokeai-web releases more memory between image generation runs back to the system, but the top is almost as high as a value of 0. Almost as fast as a threshold of 0 when starting a new generation. Peak memory consumption seems to be a bit lower than a threshold of 0

case 3: MALLOC_MMAP_THRESHOLD_ 1048576

Best memory handling, invokeai-web releases more memory back to the system during / between runs. I see large variations of memory usage of invokeai-web. But the peak is not much lower than the other two settings. However, it takes a lot of time for image generation to start. I suppose in this case the heap is traversed more to release the memory. Traversal is O(log n) and probably more operations of the kernel as well as the system wants to find the nodes in the tree that can be released.

There are a few limitations of this 'test'. I only did 5-6 images per setting, but the trend can be seen and replicated on my n=1 system. I would propose running these tests on a few different linux systems to see how repeatable the behavior is.

If it is the case, I would propose leaving the mmalloc settings at default, because the dynamic setting does a really nice job of balancing. Systems with very low system memory might benefit from a higher setting at the cost of slower generation times.

from invokeai.

Adlermannnl commented on September 23, 2024

For completeness reasons I repeated these tests and recorded the generation time and memory consumption.
Methodology:
I generated 20 1024x1024 images using juggernautXLv9 using the queue. I used the above cases of threshold settings:
0, default setting which is dynamic, the setting of 1049578 from invoke.sh
I will disregard the first image of the batch, since that one is always the slowest and has a large variation every time.

Results
Case 1 value 0
Maximum memory usage of invokeai-web is about 9.1GB. Memory usage increases a bit in the first few runs to this number.

Case 2: use the default settings, which according to the documentation is mostly dynamic.
Maximum memory usage of invokeai-web is 8.6GB, memory usage starts a bit lower but after 2-3 images it holds around this value.

Case 3: invoke.sh setting of 1048576
invokeai-web uses a maximum of 7.7 gb of memory, but it is very much dynamic. Memory consumption is dynamic during a run and between runs. After the last image, invokeai-web still uses 7.7 GB.

from invokeai.

RyanJDick commented on September 23, 2024

I can give a little more context on why this setting was originally added. There's a decent explanation here: https://github.com/invoke-ai/InvokeAI/pull/4784/files#diff-aaa1044287f4abfd0e20c07530ec1d9f226e6b6eed7bdf31a51a6253fdbd5029R3 (not sure if you guys saw this already).

The TL;DR is that if someone has a bunch of models that they are switching between, memory usage will gradually accumulate until an OOM error.

There could be differences in behavior depending on the OS and libc implementation. I don't remember seeing a performance impact on linux with glibc, but I'll double check and post results here shortly.

By the sound of things, this should definitely be configurable, we just need to do a bit more testing to figure out what the recommendations should be.

from invokeai.

Adlermannnl commented on September 23, 2024

The delay when I use invoke.sh is only when I hit Invoke, then it takes longer to present the latent noise, and at the end it takes longer to present the final image. Generation itself (in iterations per second) is similar in all conditions. Using invoke.sh I see crazy memory usage fluctuations. Is there a way to record/visualize the actual memory operations of invokeai on the heap? I assume that the delay I experience is just the tree (heap) traversal to release the memory (zeroing); the documentation of mmap does hint that there are performance penalties when setting MALLOC_MMAP_THRESHOLD very high.

I do have some time next week/ next few days. We could devise a few test cases and try to determine what happens when in which condition to make a better informed decision? Usually I have memory cache at 8 GB (out of my 32GB RAM). I use Ubuntu 22.04 LTS. I use a Ryzen 5600 on a x470 board, with a 7900XT AMDGPU. Samsung 980 PRO NVMe 2TB SSD. All drivers are up2date.

I just did some testing with model cache on 16GB, and yes there seems to be a very slow memory creep. I start to notice it after 6-7 model changes. I generate 5 images per model, change the model (sdxl variants) and there is a slow upwards trend of a few hundred megabytes. But on my system, if I have run out of memory I think I would have to change a lot of models for hours to achieve this. This is one of the situations we (and that is a decade ago when I worked in practice) would write automated test cases for and let it run for hours on end.

from invokeai.

psychedelicious commented on September 23, 2024

Thanks for digging in. I've created #6047 to make this configurable via arg to invoke.sh.

I'll defer to you two for determining the best default value for this setting. Maybe it's reasonable to leave the default as it is now?

from invokeai.

RyanJDick commented on September 23, 2024

I wanted to test this today, but ended up getting derailed by this performance issue: #6052.

From some preliminary tests, I'm not seeing any slowdown on my system with MALLOC_MMAP_THRESHOLD_=1048576. I'll do some more rigorous testing tomorrow.

from invokeai.

RyanJDick commented on September 23, 2024

I did some more testing this morning. The issue is definitely less evident on my system, but I can see a measurable difference when I reduce my VRAM cache size to force more model copying.

Setup:

Ubuntu 22.04
glibc 2.35-0ubuntu3.1
SDXL, 1024x1024, Euler, 20 steps

Configuration	First Generation	Second Generation
vram: 8.0	6.8s	4.0s
vram 8.0, MALLOC_MMAP_THRESHOLD_=1048576	6.8s	4.2s
vram: 4.0	8.6s	5.0s
vram 4.0, MALLOC_MMAP_THRESHOLD_=1048576	9.0s	6.1s

Looking back at when this was originally introduced, I don't think the problem was ever reported by someone running the OSS app on a local workstation. It was mainly intended to address an issue in the hosted version of the app, where we run many models and can see 10s of GBs of memory accumulation caused by fragmentation.

Here is what I propose:

Remove all MALLOC_MMAP_THRESHOLD_ overrides.
Add a note to the docs to explain how to set it manually if anyone is encountering this issue.
I'm not sure if it's necessary to support it as a flag on invoke.sh (as proposed in #6047) - I suspect that very few people will actually want to use it.

@psychedelicious @Adlermannnl Let me know what you think.

from invokeai.

Adlermannnl commented on September 23, 2024

I left the vram cache at the default settings, or pretty low, since the installer mentions something in the line of 'reserving a little vram'.
That might explain the difference:
Model Cache: ram: 16.0 vram: 0.25 lazy_offload: true log_memory_usage: false

Taking a requirements engineering perspective: having different settings for the hosted version and the community version is perfectly acceptable, since they have different usage scenarios. Removing all MALLOC_MMAP overrides for the community edition might be the best default setting. I suspect most people also follow the installation defaults of model cache size.

Regarding the way to override is open for discussion and best placed in a particular scenario, which kind of user of the community edition would experience such extreme fragmentation? I don't know if it is allowed to host the community edition and have multiple users (I don't even know if this is possible). But such a user would be capable of reading the docs and setting environmental variables.

I don't see a normal user running out of memory because of the fragmentation issue where there is a single user running on a single instance of invoke. I think it would require extreme model swapping for an extreme prolonged period of time (or extremely low on ram/swap?). But these would be edge cases. Either option is fine by me, passing a parameter is little more user friendly though.

The most important thing is that this issue gets clearly documented, and a best default setting is chosen.

from invokeai.

psychedelicious commented on September 23, 2024

Removing the var from the launcher script and clearly documenting it sounds good to me. I'll make a new PR to do that.

from invokeai.

[bug]: generation time slower when launching invoke-ai from invoke.sh about invokeai HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent