Comments (18)
There is a memory issue that only affects Python 3.12, due to a bug in that Python version: python/cpython#119118. I believe this is the issue affecting you.
There's a test you can run to check whether this is the issue we think it is. In coveragepy/coverage/phystokens.py, change the line:
return list(tokenize.generate_tokens(readline))
To:
return tokenize.generate_tokens(readline)
It should lessen the memory impact, if it's the same issue.
from coveragepy.
This is now released as part of coverage 7.5.3.
from coveragepy.
Any further tips?
As I asked on that issue, I need clear complete instructions for how to reproduce the issue. No one has mentioned what version of coverage.py they are using. I don't know how to make the problem appear so that I can debug it. If you know of a way for me to do that, please provide the instructions.
from coveragepy.
@nedbat , Before closing the issue I posted your request.
For the instructions, I've enclosed all I knew in here, here and here. All information can be found in the previous issue.
from coveragepy.
I'm sorry, that still doesn't give me a way to reproduce the issue. I'm available under contract if you need me to sign an NDA to see your private repo.
from coveragepy.
Sounds like #1785 to me: affects Python 3.12, eats up a lot of memory.
@0x78f1935, does your code happen to have very long lines in it?
from coveragepy.
I'm sorry, that still doesn't give me a way to reproduce the issue. I'm available under contract if you need me to sign an NDA to see your private repo.
I'm sorry, I cannot do that.
Altho @sebastian-muthwill, mentioned he was able to reproduce the issue I had encountered.
Sounds like #1785 to me: affects Python 3.12, eats up a lot of memory.
@0x78f1935, does your code happen to have very long lines in it?
No, my unit-tests do not utilize a large mock data set. That being said, I also utilize Flake8 across my codebase.
from coveragepy.
@nedbat I take the discussion up here.
I tried yesterday to reproduce the issue with a clean Django project with the example from the Django docs and everything went well.
My project is much bigger and is failing.
I try to reproduce the issue and come back.
from coveragepy.
My project is much bigger and is failing.
I can confirm that this happens to me with big project(s) too.
First, I thought maybe my project was the issue, but now that @sebastian-muthwill was able to reproduce the issue, I'm more confident we are onto something. I would love to be of help; unfortunately, I'm unable to share my codebase. Just like Sebastian's codebase, mine is big.
I took some time to take my codebase apart and basically started removing all components I didn't want to share so I could make a repository that reproduces the issue. After removing the frontend and Docker environments, I was left with my tests and my backend.
I trimmed my tests to a total of one test and was still able to reproduce the memory leak. The moment I started touching endpoints in my backend (which I don't want to share), things started to work, which is why I'm a bit uncertain. The strange part is that it doesn't really matter which component I turn on or off in any order. When they are turned on, there is a memory leak; when they are turned off, everything is okay (though there is no data to report), which might explain why that approach appears successful. This makes it difficult to pinpoint a specific component.
I thought to myself, that isn't very helpful. Surely, I can find an existing project of someone and transform that into a reproducible environment, but so far no luck with reproduction. I tried various things; I even tried my own project.toml file. No luck.
I hate to say it, but the best workaround I have right now, if you really need your coverage, downgrade to Python 3.11 or skip generating reports all together.
from coveragepy.
@devdanzin Thanks for pointing that out.
@nedbat I just tested it and can confirm that the issue is gone when settings the return as @devdanzin described.
#return list(tokenize.generate_tokens(readline)) # <-- memory leakage happening
return tokenize.generate_tokens(readline) # <-- no issue
Since it is not reproduceable with a fresh installation, I could imagine it has maybe something to do with the amount of tests run?
from coveragepy.
So far, we had only seen this issue manifest on files with very long (many thousand characters) lines, where memory ballooned very fast and resulted in a MemoryError
. For those, removing the call to list()
avoided the memory issue, but still resulted in very slow (many minutes versus a couple of seconds) execution, are you seeing a significant slow down between 3.11 and 3.12?
Since it comes from a Python 3.12 bug, it seemed OK to keep the list()
call, as it was still unusable without it and a fix is being worked on.
However, given that your case indicates the list()
call is detrimental even for non-pathological files, I believe removing it is the right thing to do. @nedbat?
from coveragepy.
Without having it measured, I would say it was not longer than in 3.11. However I have only 53 tests so the impact is maybe not that big.
from coveragepy.
Thinking about it, the use of @functools.lru_cache(maxsize=100)
is probably what causes the leak: the lists use a lot more memory on 3.12 compared to 3.11 (because each Token keeps a new copy of the text of the line it's in), but should be garbage collected after being used. The cache would keep them around, inflating memory usage proportionally to the number/length of files in the report.
Which brings the question: is the removal of the list() call correct given presence of the cache? Wouldn't we be caching an exhausted iterator in that case?
from coveragepy.
Confirmed that just removing the list()
call causes many test failures. Removing both that call and the cache makes tests pass again.
Given that tokenizing seems much faster on 3.12 compared to 3.9 (and uses a lot more memory), maybe we could define coverage.phystokens.generate_tokens
to be cached and call list()
on versions below 3.12, with no cache and no list()
call for 3.12 and above?
from coveragepy.
There's a test you can run to check whether this is the issue we think it is. In coveragepy/coverage/phystokens.py, change the line:
return list(tokenize.generate_tokens(readline))
To:
return tokenize.generate_tokens(readline)
It should lessen the memory impact, if it's the same issue.
I've tested out your theory.
@functools.lru_cache(maxsize=100)
def generate_tokens(text: str) -> TokenInfos:
"""A cached version of `tokenize.generate_tokens`.
When reporting, coverage.py tokenizes files twice, once to find the
structure of the file, and once to syntax-color it. Tokenizing is
expensive, and easily cached.
Unfortunately, the HTML report code tokenizes all the files the first time
before then tokenizing them a second time, so we cache many. Ideally we'd
rearrange the code to tokenize each file twice before moving onto the next.
"""
readline = io.StringIO(text).readline
# return list(tokenize.generate_tokens(readline))
return tokenize.generate_tokens(readline)
The behaviour of the end changed drastically.
- My memory is stable around 20-25%, around 270MB in memory.
- My tests look to be stuck at 100%, but like you mentioned, It just takes bit longer now.
- My tests finish and I have my html report (after an awfully long time, I had to wait for more then 10 minutes).
I think your issue might be related, yes!
My current run command looks like pytest --exitfirst -vs --junitxml htmlcov/pytest.xml --cov --cov-report html
and consists of 42 tests.
Since it is not reproduceable with a fresh installation, I could imagine it has maybe something to do with the amount of tests run?
I don't think that is the case, while I was stripping down to a reproducable version I was able to walk against this issue with just one test.
are you seeing a significant slow down between 3.11 and 3.12?
I would say 3.11 is significant faster in comparison. Which is weird, cause wasn't 3.12 suppose to be 40% faster?
With 3.11 you can get yourself coffee, with 3.12 you can get yourself coffee and get yourself a french baguette in French, and return safely home. It takes very long.
Thinking about it, the use of
@functools.lru_cache(maxsize=100)
is probably what causes the leak:
Lets test it:
# @functools.lru_cache(maxsize=100)
def generate_tokens(text: str) -> TokenInfos:
"""A cached version of `tokenize.generate_tokens`.
When reporting, coverage.py tokenizes files twice, once to find the
structure of the file, and once to syntax-color it. Tokenizing is
expensive, and easily cached.
Unfortunately, the HTML report code tokenizes all the files the first time
before then tokenizing them a second time, so we cache many. Ideally we'd
rearrange the code to tokenize each file twice before moving onto the next.
"""
readline = io.StringIO(text).readline
return list(tokenize.generate_tokens(readline))
# return tokenize.generate_tokens(readline)
This causes the same issue to arrise. Memoryleak.
I'll try without list foor good meassure.
# @functools.lru_cache(maxsize=100)
def generate_tokens(text: str) -> TokenInfos:
"""A cached version of `tokenize.generate_tokens`.
When reporting, coverage.py tokenizes files twice, once to find the
structure of the file, and once to syntax-color it. Tokenizing is
expensive, and easily cached.
Unfortunately, the HTML report code tokenizes all the files the first time
before then tokenizing them a second time, so we cache many. Ideally we'd
rearrange the code to tokenize each file twice before moving onto the next.
"""
readline = io.StringIO(text).readline
# return list(tokenize.generate_tokens(readline))
return tokenize.generate_tokens(readline)
This seems to work, but it doesn't increase the speed. Just like the first attempt, it's stuck on 100%, memory is idle, takes a couple of minutes before html files start appearing. But it takes ages for it to complete.
@devdanzin You beat me too it :P , Thank you for your research!
from coveragepy.
I think I've found my (big) class which might cause the memory leak. It's a serializer..:
The test seems to be stuck at 100%, but at the moment it was rounding up. I saw the cities file be generated as html file. The size of that file exceeds 8MB. Guess how many statements that class houses? 24914
Is this something I should solve my side, (which I obviously should), but is this something coverage is going to account for?
After reducing the class:
5 Whole minutes to generate 1 (big) file. That is a long time.
Meanwhile generating typescript classess based on those serializers works within seconds.
from coveragepy.
Very odd that tokenize makes new copies of the line for every token. That seems unnecessary. I wrote a CPython issue about it: python/cpython#119654
from coveragepy.
I did some quick experiments with both many-line files and very-long-line files, and found that on both 3.11 and 3.12 it was faster to not cache and to not use list()
, so I've removed them in commit b666f3a.
from coveragepy.
Related Issues (20)
- coverage html crashes with RecursionError in ast traversal HOT 11
- Incorrect missed-branch message with return/finally/if
- On Windows, `test_wrong_alias_doesnt_work` fails if the error message contains bytes that UTF-8 cannot decode HOT 4
- Nested matches of exclude_lines patterns leads to unmatched lines being excluded HOT 3
- Add a hint that `coverage combine` might be needed
- Allow disabling parallel mode from the command line HOT 3
- BUG: inconsistent meaning of `skip_covered` between `coverage report` and `coverage html`
- MemoryError in `coverage report` on Python 3.12 HOT 5
- Zero coverage of process spawned by dogtail HOT 4
- oss-fuzz 69058: TokenError HOT 4
- Support coverage plugins with sys.monitoring
- Line incorrectly reported as uncovered in python 3.11 involving `asyncio.to_thread` and a conditional
- A way to get report data in detail HOT 12
- `continue` incorrectly marked as uncovered on python 3.9 HOT 2
- pragma: no cover ignored and coverage wrongly calculated HOT 2
- How to exclude a line without the block it introduces?
- coverage.tracer thread-safety warning with Python 3.13 free-threaded HOT 3
- Provide the ability to exclude a code block from coverage HOT 3
- PermissionError: [Errno 13] Permission denied: '/dev/stdout' HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from coveragepy.