Comments (14)
There is a Pull Request open that holds the performance improvements. @ahmetkucuk could you review it a bit? (Don't mind the codacy errors; for some reason it reports syntax errors and I have yet to figure out why)
from jsons.
Thank you for pointing this out, ahmetkucuk.
There has been one report on performance before, on jsons.load
though: #63. Writing custom serialization methods will outperform jsons
, probably because of the inspection jsons
does on the object.
Could you maybe post some code that may reproduce your benchmarks? I'd like to profile the jsons.dump
with a dataset similar to yours.
from jsons.
@ramonhagenaars I tried to write something similar to what I use today. Here is the code:
import jsons
import uuid
from typing import List
from datetime import datetime
from django.utils import timezone
from dataclasses import dataclass
from dataclasses import asdict
@dataclass
class Test2:
x: int = 1
z: str = str(uuid.uuid4())
@dataclass
class Test:
sub_classes: List[Test2]
@dataclass
class Holder:
values: List[Test]
sub_classes = [Test2() for _ in range(1000)]
test_classes = [Test(sub_classes) for _ in range(1000)]
data = Holder(test_classes)
start = timezone.now()
data_dict1 = jsons.dump(data)
end = timezone.now()
print("jsons performance: ", end - start)
start = timezone.now()
data_dict2 = asdict(data)
end = timezone.now()
print("dataclass performance: ", end - start)
print("Compare: ", data_dict1 == data_dict2)
Results:
jsons performance: 0:01:21.195182
dataclass performance: 0:00:05.960181
Compare: True
So, jsons takes 1 minutes and 20 seconds while dataclasses.asdict
finished in 5 seconds. I suspect there is a memory spike as well. These results from 32GB laptop with i9 processor. I was in version 0.8.9. I updated to 1.0.0 and tried again not significant difference in runtimes.
from jsons.
To clarify, I am definitely not expecting a better performance from jsons
but it is exceptionally slow compared to what it does.
Please share your thoughts and feel free to point me places to improve. I can try to contribute.
from jsons.
Thanks @ahmetkucuk for the code, it's very helpful.
I've been looking into it and as it seems, the default_object_serializer
is the biggest bottleneck. if you replace [Test2() for _ in range(1000)]
with [{'x': 1, 'z': str(uuid.uuid4())} for _ in range(i)]
you'll see that it takes half the time. Which still is 5x slower than your manual solution though, but it's a start.
Before thinking of a (partial) rewrite of default_object_serializer
, it might be wortwhile to see if the _cache.cached
decorator helps.
I'll see if I can improve something on the default_object_serializer
. If you have any ideas, please do share! 🙂
from jsons.
It seem like default_object_serializer
is building dict
while it is not really needed.
I tried to get keys to be included in the serialized dictionary and build final dict at the end. It seems to get faster. Something like that:
if obj is None:
return obj
strip_attr = strip_attr or []
if (not isinstance(strip_attr, MutableSequence)
and not isinstance(strip_attr, tuple)):
strip_attr = (strip_attr,)
cls = kwargs['cls'] or obj.__class__
obj_keys = _get_keys_from_obj(obj, strip_privates, strip_properties,
strip_class_variables, strip_attr, **kwargs)
return {
key: dump(obj.__getattribute__(key), key_transformer=key_transformer,
strip_nulls=strip_nulls, **kwargs)
for key in obj_keys
}
Any ideas?
from jsons.
By doing so, you elimated a loop which increases performance: good going! 👍 Did you also try running the tests? We might also see if we can win any performance in the default_dict
serializer which is involved as well.
I've been quite busy myself too. Here's what I did.
If you have a list of objects (e.g. List[Test2]
), the current implementation naively does an inspection on each and every object of that list (to find out which attributes to include, transform, exclude, etc). This is what makes it slow. What if that inspection is done only once and then repeated for each element in that list? We would save 1000 * 1000 - 2
inspections in your example.
I've worked this out in the branch 1.1.0. I had to improve some stuff in the process. The performance seems to be tremendously improved up to being "only" 3 times as slow as your manual approach.
We could see if it helps to integrate your solution. What do you think? Could you maybe try running your benchmarks again on this branch?
Edit1:
Hmm I seem to have cheered too early. The implementation of 1.1.0 seems to be 6-7 times slower than the manual approach. I used a different benchmark. Still some improvement though. I will need further investigation.
Edit2:
The next performance boost can be obtained I think by eliminating the use of the default_dict
serializer and also elimate that extra loop like you did. I just tested this and it seems to run ~1.5 times faster now.
from jsons.
I pulled Release/1.1.0
, it seems to be even slower than latest master in my local. Maybe we should have a test_performance.py
that can provide an objective measure.
When I do 2 things, it is getting significantly faster in my local (although it fails 10 tests related to stripping etc.).
- Do not build a dictionary for the object, only retrieve attributes from obj and call dump for each value
- Never compute attributes for the same class.
- I know there is an issue with it like those calls are parametrized and these parameters are subject to change over time. However, most of these configurations should be set 1 time and not change at all in runtime. It is fine to invalidate any cache once single dump runs.
- In my use case, I am always using
dataclass(frozen=True)
which means I don't set any attribute to an object on runtime. It would be great if I don't need to calldir(obj)
and iterate over attributes again and again for the same class.
On a separate note:
I profiled memory leak with Pympler. I did not see any leaking objects.
Thanks for looking into this issue, I feel like there is a good performance improvement potential here!
from jsons.
I pulled Release/1.1.0, it seems to be even slower than latest master in my local.
That is strange... That would make sense only if the caching mechanism fails on your machine. What
OS are you working on if I may ask?
Maybe we should have a test_performance.py that can provide an objective measure.
Love the idea!
I profiled memory leak with Pympler. I did not see any leaking objects.
Interesting, I haven't looked for any leaks yet. I'll keep an eye out; it could be the caching.
I think we're on the same road here. I made it possible for jsons
to retrieve the signature of a class only once (and not per object). I had to redesign some stuff though. You can trigger this behavior by setting strict=True
in jsons.dump
.
On my machine, these are the results:
data_dict1 = jsons.dump(data)
Out:
jsons performance: 0:01:02.452031
dataclass performance: 0:00:05.534002
Compare: True
data_dict1 = jsons.dump(data, strict=True)
Out:
jsons performance: 0:00:23.728999
dataclass performance: 0:00:04.870003
Compare: True
I wonder what you'll get. Can you try your benchmarking code again?
from jsons.
I am working on macOS High Sierra, Python 3.7.4.
I did not realize strict was doing the trick :)
It seems like there is a great improvement with strict=True
. After I pulled latest changes and run the script (only changed django.utils.timezone to datetime), I got following results:
jsons.dump(data)
jsons performance: 0:01:04.375545
dataclass performance: 0:00:05.204434
Compare: True
jsons.dump(data, strict=True)
jsons performance: 0:00:23.309205
dataclass performance: 0:00:05.110297
Compare: True
from jsons.
We've made some good progress so far! I'm currently exploring parallelization to further improve on speed. It seems promising. I'll push an update soon.
from jsons.
Great! I think any parallelization introduced in this library should be turned off by default.
from jsons.
I think any parallelization introduced in this library should be turned off by default.
Couldn't agree more! 🙂
from jsons.
v1.1.0 has been released that implements this issue.
from jsons.
Related Issues (20)
- Literal (and Union[Literal]) support HOT 1
- TypeError: Item in jsons.__all__ must be str, not function HOT 1
- Question about JsonSerializable.json method HOT 2
- Document issue: `dumps` should return a string? HOT 1
- Bug `with_dump` only support `json`, not support `dumps` HOT 2
- DeserializationError when loading into a class with an Optional bson.ObjectID HOT 2
- Optional and Union fields are not handled correctly HOT 2
- UserWarning: Failed to dump attribute after updating python to 3.10 HOT 2
- In nested objects, `load` is only called for the root object instead of being called for each one HOT 1
- How to make `jsons.dump()` treat `bytes` the same as Python3 `str`? HOT 2
- Deserialize JSON into Class - Ignore attributes HOT 3
- SerializationError: object of type 'abc' has no len() HOT 2
- What's the key difference between `jsons` and `pydantic`? HOT 1
- DeserializationError: Invalid type: "decimal.Decimal"
- jsons dumps fields sorted
- Test failure with Python 3.11 HOT 1
- deserialization with jsons to nested classes HOT 1
- Error: got an unexpected keyword argument 'origin'
- dicts are serialized wrongly when assigned to an attribute with an 'Any' type
- Is this project still alive ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jsons.