It seems like serialization using jsons.dump(obj) is

There is a <a href="https://github.com/ramonhagenaars/jsons/pull/77" data-hovercard-ty

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Poor Performance on Serialization about jsons HOT 14 CLOSED

ahmetkucuk commented on June 16, 2024

Poor Performance on Serialization

from jsons.

Comments (14)

ramonhagenaars commented on June 16, 2024 1

There is a Pull Request open that holds the performance improvements. @ahmetkucuk could you review it a bit? (Don't mind the codacy errors; for some reason it reports syntax errors and I have yet to figure out why)

from jsons.

ramonhagenaars commented on June 16, 2024

Thank you for pointing this out, ahmetkucuk.

There has been one report on performance before, on jsons.load though: #63. Writing custom serialization methods will outperform jsons, probably because of the inspection jsons does on the object.

Could you maybe post some code that may reproduce your benchmarks? I'd like to profile the jsons.dump with a dataset similar to yours.

from jsons.

ahmetkucuk commented on June 16, 2024

@ramonhagenaars I tried to write something similar to what I use today. Here is the code:

import jsons
import uuid
from typing import List
from datetime import datetime
from django.utils import timezone
from dataclasses import dataclass
from dataclasses import asdict

@dataclass
class Test2:
	x: int = 1
	z: str = str(uuid.uuid4())

@dataclass
class Test:
	sub_classes: List[Test2]

@dataclass
class Holder:
	values: List[Test]

sub_classes = [Test2() for _ in range(1000)]
test_classes = [Test(sub_classes) for _ in range(1000)]

data = Holder(test_classes)

start = timezone.now()
data_dict1 = jsons.dump(data)
end = timezone.now()
print("jsons performance: ", end - start)

start = timezone.now()
data_dict2 = asdict(data)
end = timezone.now()
print("dataclass performance: ", end - start)

print("Compare: ", data_dict1 == data_dict2)

Results:

jsons performance:  0:01:21.195182
dataclass performance:  0:00:05.960181
Compare:  True

So, jsons takes 1 minutes and 20 seconds while dataclasses.asdict finished in 5 seconds. I suspect there is a memory spike as well. These results from 32GB laptop with i9 processor. I was in version 0.8.9. I updated to 1.0.0 and tried again not significant difference in runtimes.

from jsons.

ahmetkucuk commented on June 16, 2024

To clarify, I am definitely not expecting a better performance from jsons but it is exceptionally slow compared to what it does.

Please share your thoughts and feel free to point me places to improve. I can try to contribute.

from jsons.

ramonhagenaars commented on June 16, 2024

Thanks @ahmetkucuk for the code, it's very helpful.

I've been looking into it and as it seems, the default_object_serializer is the biggest bottleneck. if you replace [Test2() for _ in range(1000)] with [{'x': 1, 'z': str(uuid.uuid4())} for _ in range(i)] you'll see that it takes half the time. Which still is 5x slower than your manual solution though, but it's a start.

Before thinking of a (partial) rewrite of default_object_serializer, it might be wortwhile to see if the _cache.cached decorator helps.

I'll see if I can improve something on the default_object_serializer. If you have any ideas, please do share! 🙂

from jsons.

ahmetkucuk commented on June 16, 2024

It seem like default_object_serializer is building dict while it is not really needed.

I tried to get keys to be included in the serialized dictionary and build final dict at the end. It seems to get faster. Something like that:

    if obj is None:
        return obj
    strip_attr = strip_attr or []
    if (not isinstance(strip_attr, MutableSequence)
            and not isinstance(strip_attr, tuple)):
        strip_attr = (strip_attr,)
    cls = kwargs['cls'] or obj.__class__
    obj_keys = _get_keys_from_obj(obj, strip_privates, strip_properties,
                                  strip_class_variables, strip_attr, **kwargs)
    return {
        key: dump(obj.__getattribute__(key), key_transformer=key_transformer,
                  strip_nulls=strip_nulls, **kwargs)
        for key in obj_keys
    }

Any ideas?

from jsons.

ramonhagenaars commented on June 16, 2024

By doing so, you elimated a loop which increases performance: good going! 👍 Did you also try running the tests? We might also see if we can win any performance in the default_dict serializer which is involved as well.

I've been quite busy myself too. Here's what I did.

If you have a list of objects (e.g. List[Test2]), the current implementation naively does an inspection on each and every object of that list (to find out which attributes to include, transform, exclude, etc). This is what makes it slow. What if that inspection is done only once and then repeated for each element in that list? We would save 1000 * 1000 - 2 inspections in your example.

I've worked this out in the branch 1.1.0. I had to improve some stuff in the process. The performance seems to be tremendously improved up to being "only" 3 times as slow as your manual approach.

We could see if it helps to integrate your solution. What do you think? Could you maybe try running your benchmarks again on this branch?

Edit1:

Hmm I seem to have cheered too early. The implementation of 1.1.0 seems to be 6-7 times slower than the manual approach. I used a different benchmark. Still some improvement though. I will need further investigation.

Edit2:

The next performance boost can be obtained I think by eliminating the use of the default_dict serializer and also elimate that extra loop like you did. I just tested this and it seems to run ~1.5 times faster now.

from jsons.

ahmetkucuk commented on June 16, 2024

I pulled Release/1.1.0, it seems to be even slower than latest master in my local. Maybe we should have a test_performance.py that can provide an objective measure.

When I do 2 things, it is getting significantly faster in my local (although it fails 10 tests related to stripping etc.).

Do not build a dictionary for the object, only retrieve attributes from obj and call dump for each value
Never compute attributes for the same class.

I know there is an issue with it like those calls are parametrized and these parameters are subject to change over time. However, most of these configurations should be set 1 time and not change at all in runtime. It is fine to invalidate any cache once single dump runs.
In my use case, I am always using dataclass(frozen=True) which means I don't set any attribute to an object on runtime. It would be great if I don't need to call dir(obj) and iterate over attributes again and again for the same class.

On a separate note:
I profiled memory leak with Pympler. I did not see any leaking objects.

Thanks for looking into this issue, I feel like there is a good performance improvement potential here!

from jsons.

ramonhagenaars commented on June 16, 2024

I pulled Release/1.1.0, it seems to be even slower than latest master in my local.

That is strange... That would make sense only if the caching mechanism fails on your machine. What
OS are you working on if I may ask?

Maybe we should have a test_performance.py that can provide an objective measure.

Love the idea!

I profiled memory leak with Pympler. I did not see any leaking objects.

Interesting, I haven't looked for any leaks yet. I'll keep an eye out; it could be the caching.

I think we're on the same road here. I made it possible for jsons to retrieve the signature of a class only once (and not per object). I had to redesign some stuff though. You can trigger this behavior by setting strict=True in jsons.dump.

On my machine, these are the results:

data_dict1 = jsons.dump(data)

Out:

jsons performance:  0:01:02.452031
dataclass performance:  0:00:05.534002
Compare:  True

data_dict1 = jsons.dump(data, strict=True)

Out:

jsons performance:  0:00:23.728999
dataclass performance:  0:00:04.870003
Compare:  True

I wonder what you'll get. Can you try your benchmarking code again?

from jsons.

ahmetkucuk commented on June 16, 2024

I am working on macOS High Sierra, Python 3.7.4.

I did not realize strict was doing the trick :)

It seems like there is a great improvement with strict=True. After I pulled latest changes and run the script (only changed django.utils.timezone to datetime), I got following results:

jsons.dump(data)

jsons performance:  0:01:04.375545
dataclass performance:  0:00:05.204434
Compare:  True

jsons.dump(data, strict=True)

jsons performance:  0:00:23.309205
dataclass performance:  0:00:05.110297
Compare:  True

from jsons.

ramonhagenaars commented on June 16, 2024

We've made some good progress so far! I'm currently exploring parallelization to further improve on speed. It seems promising. I'll push an update soon.

from jsons.

ahmetkucuk commented on June 16, 2024

Great! I think any parallelization introduced in this library should be turned off by default.

from jsons.

ramonhagenaars commented on June 16, 2024

I think any parallelization introduced in this library should be turned off by default.

Couldn't agree more! 🙂

from jsons.

ramonhagenaars commented on June 16, 2024

v1.1.0 has been released that implements this issue.

from jsons.

Poor Performance on Serialization about jsons HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent