Giter Site home page Giter Site logo

Comments (14)

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024 1

There is a Pull Request open that holds the performance improvements. @ahmetkucuk could you review it a bit? (Don't mind the codacy errors; for some reason it reports syntax errors and I have yet to figure out why)

from jsons.

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024

Thank you for pointing this out, ahmetkucuk.

There has been one report on performance before, on jsons.load though: #63. Writing custom serialization methods will outperform jsons, probably because of the inspection jsons does on the object.

Could you maybe post some code that may reproduce your benchmarks? I'd like to profile the jsons.dump with a dataset similar to yours.

from jsons.

ahmetkucuk avatar ahmetkucuk commented on June 16, 2024

@ramonhagenaars I tried to write something similar to what I use today. Here is the code:

import jsons
import uuid
from typing import List
from datetime import datetime
from django.utils import timezone
from dataclasses import dataclass
from dataclasses import asdict

@dataclass
class Test2:
	x: int = 1
	z: str = str(uuid.uuid4())

@dataclass
class Test:
	sub_classes: List[Test2]

@dataclass
class Holder:
	values: List[Test]

sub_classes = [Test2() for _ in range(1000)]
test_classes = [Test(sub_classes) for _ in range(1000)]

data = Holder(test_classes)

start = timezone.now()
data_dict1 = jsons.dump(data)
end = timezone.now()
print("jsons performance: ", end - start)

start = timezone.now()
data_dict2 = asdict(data)
end = timezone.now()
print("dataclass performance: ", end - start)

print("Compare: ", data_dict1 == data_dict2)

Results:

jsons performance:  0:01:21.195182
dataclass performance:  0:00:05.960181
Compare:  True

So, jsons takes 1 minutes and 20 seconds while dataclasses.asdict finished in 5 seconds. I suspect there is a memory spike as well. These results from 32GB laptop with i9 processor. I was in version 0.8.9. I updated to 1.0.0 and tried again not significant difference in runtimes.

from jsons.

ahmetkucuk avatar ahmetkucuk commented on June 16, 2024

To clarify, I am definitely not expecting a better performance from jsons but it is exceptionally slow compared to what it does.

Please share your thoughts and feel free to point me places to improve. I can try to contribute.

from jsons.

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024

Thanks @ahmetkucuk for the code, it's very helpful.

I've been looking into it and as it seems, the default_object_serializer is the biggest bottleneck. if you replace [Test2() for _ in range(1000)] with [{'x': 1, 'z': str(uuid.uuid4())} for _ in range(i)] you'll see that it takes half the time. Which still is 5x slower than your manual solution though, but it's a start.

Before thinking of a (partial) rewrite of default_object_serializer, it might be wortwhile to see if the _cache.cached decorator helps.

I'll see if I can improve something on the default_object_serializer. If you have any ideas, please do share! 🙂

from jsons.

ahmetkucuk avatar ahmetkucuk commented on June 16, 2024

It seem like default_object_serializer is building dict while it is not really needed.

I tried to get keys to be included in the serialized dictionary and build final dict at the end. It seems to get faster. Something like that:

    if obj is None:
        return obj
    strip_attr = strip_attr or []
    if (not isinstance(strip_attr, MutableSequence)
            and not isinstance(strip_attr, tuple)):
        strip_attr = (strip_attr,)
    cls = kwargs['cls'] or obj.__class__
    obj_keys = _get_keys_from_obj(obj, strip_privates, strip_properties,
                                  strip_class_variables, strip_attr, **kwargs)
    return {
        key: dump(obj.__getattribute__(key), key_transformer=key_transformer,
                  strip_nulls=strip_nulls, **kwargs)
        for key in obj_keys
    }

Any ideas?

from jsons.

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024

By doing so, you elimated a loop which increases performance: good going! 👍 Did you also try running the tests? We might also see if we can win any performance in the default_dict serializer which is involved as well.

I've been quite busy myself too. Here's what I did.

If you have a list of objects (e.g. List[Test2]), the current implementation naively does an inspection on each and every object of that list (to find out which attributes to include, transform, exclude, etc). This is what makes it slow. What if that inspection is done only once and then repeated for each element in that list? We would save 1000 * 1000 - 2 inspections in your example.

I've worked this out in the branch 1.1.0. I had to improve some stuff in the process. The performance seems to be tremendously improved up to being "only" 3 times as slow as your manual approach.

We could see if it helps to integrate your solution. What do you think? Could you maybe try running your benchmarks again on this branch?

Edit1:

Hmm I seem to have cheered too early. The implementation of 1.1.0 seems to be 6-7 times slower than the manual approach. I used a different benchmark. Still some improvement though. I will need further investigation.

Edit2:

The next performance boost can be obtained I think by eliminating the use of the default_dict serializer and also elimate that extra loop like you did. I just tested this and it seems to run ~1.5 times faster now.

from jsons.

ahmetkucuk avatar ahmetkucuk commented on June 16, 2024

I pulled Release/1.1.0, it seems to be even slower than latest master in my local. Maybe we should have a test_performance.py that can provide an objective measure.

When I do 2 things, it is getting significantly faster in my local (although it fails 10 tests related to stripping etc.).

  1. Do not build a dictionary for the object, only retrieve attributes from obj and call dump for each value
  2. Never compute attributes for the same class.
  • I know there is an issue with it like those calls are parametrized and these parameters are subject to change over time. However, most of these configurations should be set 1 time and not change at all in runtime. It is fine to invalidate any cache once single dump runs.
  • In my use case, I am always using dataclass(frozen=True) which means I don't set any attribute to an object on runtime. It would be great if I don't need to call dir(obj) and iterate over attributes again and again for the same class.

On a separate note:
I profiled memory leak with Pympler. I did not see any leaking objects.

Thanks for looking into this issue, I feel like there is a good performance improvement potential here!

from jsons.

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024

I pulled Release/1.1.0, it seems to be even slower than latest master in my local.

That is strange... That would make sense only if the caching mechanism fails on your machine. What
OS are you working on if I may ask?

Maybe we should have a test_performance.py that can provide an objective measure.

Love the idea!

I profiled memory leak with Pympler. I did not see any leaking objects.

Interesting, I haven't looked for any leaks yet. I'll keep an eye out; it could be the caching.

I think we're on the same road here. I made it possible for jsons to retrieve the signature of a class only once (and not per object). I had to redesign some stuff though. You can trigger this behavior by setting strict=True in jsons.dump.

On my machine, these are the results:

data_dict1 = jsons.dump(data)

Out:

jsons performance:  0:01:02.452031
dataclass performance:  0:00:05.534002
Compare:  True
data_dict1 = jsons.dump(data, strict=True)

Out:

jsons performance:  0:00:23.728999
dataclass performance:  0:00:04.870003
Compare:  True

I wonder what you'll get. Can you try your benchmarking code again?

from jsons.

ahmetkucuk avatar ahmetkucuk commented on June 16, 2024

I am working on macOS High Sierra, Python 3.7.4.

I did not realize strict was doing the trick :)

It seems like there is a great improvement with strict=True. After I pulled latest changes and run the script (only changed django.utils.timezone to datetime), I got following results:

jsons.dump(data)

jsons performance:  0:01:04.375545
dataclass performance:  0:00:05.204434
Compare:  True

jsons.dump(data, strict=True)

jsons performance:  0:00:23.309205
dataclass performance:  0:00:05.110297
Compare:  True

from jsons.

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024

We've made some good progress so far! I'm currently exploring parallelization to further improve on speed. It seems promising. I'll push an update soon.

from jsons.

ahmetkucuk avatar ahmetkucuk commented on June 16, 2024

Great! I think any parallelization introduced in this library should be turned off by default.

from jsons.

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024

I think any parallelization introduced in this library should be turned off by default.

Couldn't agree more! 🙂

from jsons.

ramonhagenaars avatar ramonhagenaars commented on June 16, 2024

v1.1.0 has been released that implements this issue.

from jsons.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.