pallets-eco / cachelib Goto Github PK
View Code? Open in Web Editor NEWExtract from werkzeug.cache
License: BSD 3-Clause "New" or "Revised" License
Extract from werkzeug.cache
License: BSD 3-Clause "New" or "Revised" License
It has been reported in here and there that the os.replace(...)
used by cachelib is not supported in Python 2. Do we consider to have a fallback implementation for Python 2? Something like this? I can provide a PR if necessary.
try:
os.replace(tmp, filename)
except AttributeError: # Python 2 workaround
try:
os.remove(filename)
except:
pass
os.rename(tmp, filename)
Quotting from #48
An interesting thing is how the api "uniformity" is reflected in
BaseCache
as type hints are added to the codebase.For example:
# BaseCache def delete_many(...) -> _t.Union[bool, _t.Optional[int]] def set_many(...) -> _t.Union[bool, _t.List[_t.Any]] def has(self, key: str) -> _t.Union[bool, int]...we can see how
delete_many
,set_many
andhas
methods have different return types across different cache clients. This means our cache types are diverging from the common interface, which is bad AFAIK since it makes the cache types not interchangeable (code written for a given cache type might not work for others) and it's also less intuitive for the user ("set_many returns a boolean for cache X but a list for Y (?)")...
Being able to swap between different cache types without the need to change any code with a 100% guarantee that it will work is something I would like to see in cachelib. For that, I plan on writing a PR to minimize as much as possible the differences in our API (described above) and finally turn BaseCache
into a formal interface with something like python abc, thus enforcing it for all supported cache types (and the ones possibly yet to come). This would also ensure the project grows in a uniform way, always abiding to BaseCache
.
Since this is a fairly large change and would touch the public API I would like to hear what people have to say. Any thoughts are welcome
Sometimes it's useful to have the tests that use the network marked so they can be skipped easily when we know the network is not available.
This is useful for example on SUSE and openSUSE's build servers. When building our packages the network is disabled so we can assure reproducible builds (among other benefits). With this mark, it's easier to skip tests that can not succeed.
The %check
section of our SPEC file is:
%check
# set up working directory
export BASETEMP=$(mktemp -d -t cachelib_test.XXXXXX)
trap "rm -rf ${BASETEMP}" EXIT
# Allow finding memcached
export PATH="%{_sbindir}/:$PATH"
export PYTEST_ADDOPTS="--capture=tee-sys --tb=short --basetemp=${BASETEMP}"
%pytest -rs
(%pytest
means basically pytest -v
with some variables set to take into consideration building environment).
When running only plain pytest
, I get result “11 failed, 117 passed, 1 skipped, 3 errors”.
Complete build log in this situation
Obviously the hot candidate are tests using DynamoDB, which is completely inaccessible, so I have created this patch to mark these tests as network-requiring so they can be easily skipped:
---
setup.cfg | 3 +++
tests/test_dynamodb_cache.py | 1 +
tests/test_interface_uniformity.py | 1 +
tests/test_redis_cache.py | 2 +-
4 files changed, 6 insertions(+), 1 deletion(-)
--- a/setup.cfg
+++ b/setup.cfg
@@ -34,11 +34,14 @@ python_requires = >= 3.7
where = src
[tool:pytest]
+addopts = --strict-markers
testpaths = tests
filterwarnings =
error
default::DeprecationWarning:cachelib.uwsgi
default::DeprecationWarning:cachelib.redis
+markers =
+ network: mark a test which requires net access
[coverage:run]
branch = True
--- a/tests/test_dynamodb_cache.py
+++ b/tests/test_dynamodb_cache.py
@@ -29,5 +29,6 @@ def cache_factory(request):
request.cls.cache_factory = _factory
+@pytest.mark.network
class TestDynamoDbCache(CommonTests, ClearTests, HasTests):
pass
--- a/tests/test_interface_uniformity.py
+++ b/tests/test_interface_uniformity.py
@@ -19,6 +19,7 @@ def create_cache_list(request, tmpdir):
request.cls.cache_list = [FileSystemCache(tmpdir), mc, rc, SimpleCache()]
+@pytest.mark.network
@pytest.mark.usefixtures("redis_server", "memcached_server")
class TestInterfaceUniformity:
def test_types_have_all_base_methods(self):
Just by applying this patch (and adding -k "not network"
to my call of pytest) I get much better results: “117 passed, 1 skipped, 12 deselected, 2 errors”. Again, Complete build log in this situation.
Unfortunately, I don’t know how to skip those remaining two erroring tests. Both of them use so complicated constructs that I don’t know where to put @pytest.mark.skip
or @pytest.mark.network
and any of my attempts failed to make any difference. The only method which actually works (but I really don’t like it) is --ignore=tests/test_redis_cache.py --ignore=tests/test_memcached_cache.py
, which truly make test suite to pass.
Any ideas, how to make the test suite working even without network access? Do I do something wrong in arranging my test environment?
Environment:
I'm using cachelib
from Python 3.7 to write simple string values into Redis, with the set()
method.
from cachelib import RedisCache
cache = RedisCache()
cache.set('FirstName', 'Trevor')
Using the redis-cli
, you can see the value when it's set from cachelib
(first), versus set with the redis-cli
itself (second).
127.0.0.1:6379> get FirstName
"!\x80\x03X\b\x00\x00\x00Trevorq\x00."
127.0.0.1:6379> set FirstName trevor
OK
127.0.0.1:6379> get FirstName
"trevor"
Is there an explanation for why cachelib is writing data in this format?
The functionality was added here : pallets-eco/flask-caching#109 but removed when using cachelib as backend https://github.com/pallets-eco/flask-caching/pull/308/files#diff-d15a59e5c66734a6b6d638efe7582b3a147d5a4213dc3b3ca4e3672c50607ae1.
Accidentally passing None as the key results in the rather mysterious error message:
"UnboundLocalError: local variable 'bkey_hash' referenced before assignment"
thanks to the type check in _get_filename without an else branch.
A better way to handle this would be to raise an exception "Key must be string, received type [which ever type was received]" or something similar. Thank you!
I fixed this in flask-caching some time ago, but it looks like it's broken here as well (and with flask-caching now relying on this, it's broken there again too): pallets-eco/flask-caching#218
Following #35, using service containers as an alternative to manually installing chachelib
external dependencies in tests.yml workflow (e.g. memcached
, redis, pylibmc
headers) seems like an interesting option.
Some thoughts:
redis
and one for memcached
.pytext-xprocess
fixtures should not start local instances of redis
and memcached
when running under CI since the containers will be up. One way of going about this is checking one of the many environment variables set by the CI and having xprocess start (or not) based on the result.There is a type hinting on line 31 of redis.py for the host to be string and also if host is not string (in case of rediscluster) then in that case the self.__client is not available on line 51 of redis.py as the "else:" is removed. For everyone who are sending rediscluster client in host, this change will break their application
Environment:
As said in #35 callbacks are a more reliable way of detecting if external dependencies have already started and are ready to accept queries, so it would be nice to rewrite redis_server and memcached_server fixtures since they currently rely on string patterns only.
I have a suggestion: It should be possible to sign (/ apply HMAC) to cache values in the same way werkzeug.contrib.securecookie
does already.
pickle
is used as serializer to serialize the content. While this is absolutely fine as long nobody can access the underlying cache back end (Redis, FS, Memcached), it may allow privilege escalation once an attacker gains access to it, as pickle
allows to store arbitrary code.
Proposal:
Practically pallets' ItsDangerous
could be used here.
If wanted, I can create a pull request implementing my proposal.
When using redis as cache backend, it will raise an AttributeError: 'module' object has no attribute 'Redis'. It will import cachelib.redis instead of redis because the filename is the same.
I believe memcached is the only backend who did not get a default serializer here #63.
It this a conscious decision?
Using set() and get() of FileSystemCache raises exceptions on Windows 10 when called very fast.
PermissionError: [Errno 13] Permission denied ...
PermissionError: [WinError 5] Access is denied ...
There are no errors when running the same script on Ubuntu 20.04.
How to replicate (optional increase range):
import cachelib
import threading
fsc = cachelib.file.FileSystemCache('.')
def set_get(i):
fsc.set('key', 'val')
val = fsc.get('key')
for i in range(10):
t = threading.Thread(target=set_get, args=(i,))
t.start()
Randomly generates tracebacks like:
WARNING:root:Exception raised while handling cache file '.\3c6e0b8a9c15224a8228b9a98ca1531d'
Traceback (most recent call last):
File "C:\Users\User...\lib\site-packages\cachelib\file.py", line 183, in get
with open(filename, "rb") as f:
PermissionError: [Errno 13] Permission denied: '.\3c6e0b8a9c15224a8228b9a98ca1531d'
WARNING:root:Exception raised while handling cache file '.\3c6e0b8a9c15224a8228b9a98ca1531d'
Traceback (most recent call last):
File "C:\Users\User...\lib\site-packages\cachelib\file.py", line 228, in set
os.replace(tmp, filename)
PermissionError: [WinError 5] Access is denied: 'C:\Users\User\python3.8\venvs\imapclient\cachelib_test\tmpop5jfhp8.__wz_cache' -> '.\3c6e0b8a9c15224a8228b9a98ca1531d'
Expected behavior is no errors. I bounced into this with Flask-Session using Cachelib FileSystemCache on Windows 10 when sessions got lost/not updated. Maybe it has to do with the fact that Windows 10 is running in a VirtualBox(?). I did not try this on a native Windows 10 system. Maybe it has to do with the implementation of os.replace on Windows(?).
Environment:
The pruning contains a magic number, it seems it deletes only every third entry: idx % 3
.
Does anyone know the intention behind this?
I traced it back to the very beginning, there is no info there: mitsuhiko/zine@3502607#diff-b0794f264f02b7241cc1088d07e16659
FileSystemCache implies that a serializer is framed, but some serializers (like json) are not framed and don’t support multiple load/dump from single file descriptor, since it can't determine message borders. This complicates serializer integration.
https://github.com/pallets/cachelib/blob/3777a15cc01d55544bd63d6ffe7e680a823b58fa/src/cachelib/file.py#L231-L237
https://github.com/pallets/cachelib/blob/3777a15cc01d55544bd63d6ffe7e680a823b58fa/src/cachelib/file.py#L192-L195
Sure, framing proxy may be implemented, but this will lead to storage and processing overheads. Since other backends don’t use framed serialization, may be we could implement serialization in a universal (non framed) way?
For example, via struct
:
with os.fdopen(fd, "wb") as f:
f.write(struct.pack("I", timeout))
self.serializer.dump(value, f)
with self._safe_stream_open(filename, "rb") as f:
pickle_time = struct.unpack("I", f.read(4))
if pickle_time == 0 or pickle_time >= time():
return self.serializer.load(f)
struct
also reduces storage overhead (4 bytes vs 17 bytes via pickle).
Another solution is to store metadata in extended attributes (xattr
in linux/darwin or EA in windows). Modern filesystems (like xfs or ext4) have huge inode size (256 bytes by default), so 4 bytes won’t be a problem, but if we don’t fit into current inode, new inode will be created. Theoretically, this also should reduce file operations for expired keys since we work only with inodes, but additional research and benchmarks are required and I'm not sure about portability.
We currently use md5
, but this can easily be changed to allow users to pass in custom hashing methods for keys. It seems like an interesting feature to have and will allow better integration with flask-caching
(see related: pallets-eco/flask-caching#77)
I would like to be able to choose my serialization strategy when configuring caching. This would have benefits both in performance and in cache sharing.
The same bug described with cachlib version 0.1.1 here: "#21", seems to appear also in version 0.2.0
I faced the same issue with 0.1.1:
[Tue Jul 06 09:57:39.680042 2021] [wsgi:error] [pid 172159] [client 127.0.0.1:38240] self._prune()
[Tue Jul 06 09:57:39.680048 2021] [wsgi:error] [pid 172159] [client 127.0.0.1:38240] File "/lib/python3.6/site-packages/flask_app/cachelib/file.py", line 96, in _prune
[Tue Jul 06 09:57:39.680053 2021] [wsgi:error] [pid 172159] [client 127.0.0.1:38240] expires = pickle.load(f)
[Tue Jul 06 09:57:39.680059 2021] [wsgi:error] [pid 172159] [client 127.0.0.1:38240] EOFError: Ran out of input
After replacing cachlib with the new version 0.2.0, this error appears again:
[Thu Jul 08 12:41:55.193613 2021] [wsgi:error] [pid 111282] [client 127.0.0.1:51574] self._prune()
[Thu Jul 08 12:41:55.193626 2021] [wsgi:error] [pid 111282] [client 127.0.0.1:51574] File "/lib/python3.6/site-packages/flask_app/cachelib/file.py", line 122, in _prune
[Thu Jul 08 12:41:55.193639 2021] [wsgi:error] [pid 111282] [client 127.0.0.1:51574] self._remove_expired(now)
[Thu Jul 08 12:41:55.193652 2021] [wsgi:error] [pid 111282] [client 127.0.0.1:51574] File "/lib/python3.6/site-packages/flask_app/cachelib/file.py", line 91, in _remove_expired
[Thu Jul 08 12:41:55.193666 2021] [wsgi:error] [pid 111282] [client 127.0.0.1:51574] expires = pickle.load(f)
[Thu Jul 08 12:41:55.193679 2021] [wsgi:error] [pid 111282] [client 127.0.0.1:51574] EOFError: Ran out of input
In both cases, the error is in picke.load(f)
Environment:
Following the discussion in #11 and the improvement done in #63, where custom serializers were added for each cache backend. I think that it would be very nice to have a few generic serializers (Pickle, JSON, etc) that could be passed as a parameter when initialising the cache backends.
Something that would look like:
from cachelib import FileSystemCache, RedisCache
from cachelib.serializers import JsonSerializer, PickleSerializer
redis_cache = RedisCache(serializer=JsonSerializer)
file_cache = FileSystemCache(serializer=PickleSerializer)
Each backend cache would obviously have a default serializer for backwards compatibility.
This would allow using more secure serialization alternatives than Pickle.
The ultimate goal that I would like to achieve would be to be able to use a custom serializer with the Flask-Caching library.
I could try to work out a solution for this and submit a PR if you think this approach would make sense.
The doc string of RedisSerializer.dumps() suggest an integer value will be serialized as a string, but the actual implementation will seriaize any value by pickle regardless of its type.
Duplicated the code here for reference.
def dumps(self, value: _t.Any, protocol: int = pickle.HIGHEST_PROTOCOL) -> bytes:
"""Dumps an object into a string for redis. By default it serializes
integers as regular string and pickle dumps everything else.
"""
return b"!" + pickle.dumps(value, protocol)
Admittedly, this won't cause run time error as the RedisSerializer.loads() can still handle this case. However, to avoid confusion, either the docstring or the code should be updated to bring them into agreement. Preferablley, the code should be updaed because serailizing int as string is the legacy behaviour when the redis serialization code was still contained in flask-caching repo.
Environment:
cachelib is for single process only now.
If multiple processes handle the same cache directory, errors occur in file count values.
How to reproduce:
import flask_caching on any project. Afterwards a warning "no boto3 module found" is shown.
Environment:
this print statement needs to be removed
The project currently has no tests. It would be good to have tests early on, as this would make for a safer and more robust development going forward.
FileSystemCache logs a warning with FileNotFoundError exception if working with files that do not exist.
To reproduce, simply call FileSystemCache.get on non-existing file (a new key).
I think there should be no such logs, because this is the most common case - when you work with cache, you always try to get the value first and set the value only if it does not exist. There can be a lot of such logs and they pollute the log.
Also, looking at the code, I don't think it affects only FileSystemCache.get. I would consider other places with these logs as well, because consider situation where two processes start pruning cached files and collide, trying to delete the same file twice.
Environment:
Thank you for your opinion on this.
Type hints help with maintaining a clear project architecture, debugging, code documentation, linting (due to static analysis) and several other benefits. Following other pallets projects I plan on adding typing to cachelib
too. The issue is to open this for discussion in case anyone has suggestions 🚀
Using set() of FileSystemCache raises errors on Ubuntu 20.04
WARNING:root:Exception raised while handling cache file '/home/site/wwwroot/deal_sourcing/flask_session/5651fc15999c50354843e09982ff80ed'
Traceback (most recent call last):
File "/antenv/lib/python3.8/site-packages/cachelib/file.py", line 238, in set
self._run_safely(os.replace, tmp, filename)
File "/antenv/lib/python3.8/site-packages/cachelib/file.py", line 299, in _run_safely
output = fn(*args, **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/home/site/wwwroot/deal_sourcing/flask_session/tmp8qwf4_ww.__wz_cache' -> '/home/site/wwwroot/deal_sourcing/flask_session/5651fc15999c50354843e09982ff80ed'
We are running a Flask/Dash app as an Azure web service on Ubunutu 20.04, which uses MSAL and AAD to authenticate.
The Flask app repeatedly tries to re-authenticate, does not allow the user to navigate the app as desired.
The above errors appear in the Azure Application Logs.
Environment:
Can mitigate the problem by editing _run_safely as follows (see the # lines):
def _run_safely(self, fn: _t.Callable, *args: _t.Any, **kwargs: _t.Any) -> _t.Any:
"""On Windows os.replace, os.chmod and open can yield
permission errors if executed by two different processes."""
# if platform.system() == "Windows":
if True:
output = None
wait_step = 0.001
max_sleep_time = 10.0
total_sleep_time = 0.0
while total_sleep_time < max_sleep_time:
try:
output = fn(*args, **kwargs)
# except PermissionError:
except OSError:
sleep(wait_step)
total_sleep_time += wait_step
wait_step *= 2
else:
break
else:
output = fn(*args, **kwargs)
return output
Int values pickled in RedisCache
In Redis db it looks like hex string !\x80\x04\x95\x04\x00\x00\x00\x00\x00\x00\x00M\xC2z.
, not 31426
Environment:
When I use FileSystemCache
, I could not find a good way to get all cache info and manage them.
Could please add func which get all cache info and manage them?
def get_all(self) -> _t.Any:
infos = []
for fname in self._list_dir():
print(fname)
try:
with self._safe_stream_open(fname, "rb") as f:
pickle_time = struct.unpack("I", f.read(4))[0]
if pickle_time == 0 or pickle_time >= time():
infos.append((fname, self.serializer.load(f)))
except FileNotFoundError:
pass
except (OSError, EOFError, struct.error):
logging.warning(
"Exception raised while handling cache file '%s'",
fname,
exc_info=True,
)
return infos
def remove_form_fname(self, fname: str) -> None:
try:
os.remove(fname)
self._update_count(delta=-1)
except FileNotFoundError:
pass
except (OSError, EOFError, struct.error):
logging.warning(
"Exception raised while handling cache file '%s'",
fname,
exc_info=True,
)
It would be nice to provide functions for other cache management methods.
I'm seeing a case where sometimes I get:
File "/.../site-packages/cachelib/file.py", line 147, in set
self._prune()
File "/.../site-packages/cachelib/file.py", line 96, in _prune
expires = pickle.load(f)
EOFError: Ran out of input
Should this error be in the try/except block?
Hi guys,
could you please remove the logging.warning
logs? It is using a basic configuration which automatically creates undesired logging handlers and causes duplicate logs in projects with their own logging. This issue was here before, but it was conditional, now it happens for everyone who does not have boto3
library installed.
Replication:
Simply import the library and see handlers in root logger.
Expectation:
There should be no handlers in root logger.
Environment:
Thank you for considering fixing this.
This repo is nearly identical to sh4nks/flask-caching.
Which is more likely to be maintained in the future? E.g. the former features a ready conda
recipe and a package.
The number of files will not really increase when the cache file is overwritten..
But the file count will increase unnecessarily.
After os.remove()
in FileSystemCache.get()
and FileSystemCache.has()
, self._update_count()
should be called.
I am building and testing cachelib on the Linux platform. I have followed the below steps to build and test cachelib:
BUILD STEPS:
$ git clone https://github.com/pallets/cachelib
$ cd cachelib
$ python3 -m venv env
$ . env/bin/activate
$ pip install -e . -r requirements/dev.txt
$ pre-commit install
TEST STEPS:
$ pytest
Build is successful, but tests corresponding to test_memcached_cache.py file are failing. Below is the short test summary after pytest:
E TimeoutError: The provided start pattern server listening could not be matched within the specified time interval of 120 seconds
env/lib/python3.9/site-packages/xprocess.py:284: TimeoutError
======================================================== short test summary info ========================================================
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_clear - TimeoutError: The provided start pattern server listening could ...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_set_get - TimeoutError: The provided start pattern server listening coul...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_set_get_many - TimeoutError: The provided start pattern server listening...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_get_dict - TimeoutError: The provided start pattern server listening cou...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_delete - TimeoutError: The provided start pattern server listening could...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_delete_many - TimeoutError: The provided start pattern server listening ...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_add - TimeoutError: The provided start pattern server listening could no...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_inc_dec - TimeoutError: The provided start pattern server listening coul...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_expiration - TimeoutError: The provided start pattern server listening c...
ERROR tests/test_memcached_cache.py::TestMemcachedCache::test_has - TimeoutError: The provided start pattern server listening could no...
========================================= 34 passed, 1 skipped, 10 errors in 137.85s (0:02:17) ==========================================
Please find the detailed error logs here: cachelib_memcache_failing_tests_logs.txt
Memcache was not installed on my system, so these tests were getting skipped.
Then I successfully installed memcache, libmemcached-dev, zlib1g-dev, libmemcached-tools, memcached from the apt repo; and installed pylibmc, pymemcache and python-memcached using pip. But the issue is always the same, tests are failing.
I have restarted and checked the memcache service using service memcache restart
...tests are failing even after memcache is in active state.
Can you please provide me with the pointers on how to pass these failing tests?
Please guide me if some more information is required.
Environment:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.