ericvsmith / dataclasses Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Guido and I discussed this yesterday, and we decided we'd just copy.copy()
the default values when creating a new instance.
The __init__
code I'm currently generating for:
@dataclass
class C:
x: list = []
Looks something like:
def __init__(self, x=[]):
self.x = copy.copy(x)
But I don't think this is what we really want. I don't think we want to call copy.copy()
if passed in an unrelated list, like C(x=mylist)
. Maybe __init__
should check and only call copy.copy()
if x is C.x
?
So:
def __init__(self, x=C.x):
self.x = copy.copy(x) if x is C.x else x
?
(I haven't checked that this actually works as written, but the idea is to only copy the argument if it's the same object as the default that was assigned in the class creation statement.)
See the discussion starting at:
https://mail.python.org/pipermail/python-dev/2017-September/149461.html
The idea is to replace the existing cmp
flag to @dataclass
and instead have two boolean parameters, eq
and compare
. If compare
is True
, eq
would be forced to True
.
If eq
is set, generate __eq__
and __ne__
methods. If compare
is set, generate __gt__
, __ge__
, __lt__
, and __le__
methods.
This would allow you to have data classes that could be compared for equality, but are unordered.
I'll also work on updating the PEP to match.
Renaming the cmp
parameter to field()
was also suggested, but I don't have a good idea for that. I'll eventually open a separate issue for that discussion.
In #3, we discussed mutable defaults which included talk of factory functions. Having resolved #3, this issue is about how to specify default value factory functions.
Some options:
field()
. For now, let's call it factory
: @dataclass
class Foo:
x: list = field(factory=list, repr=False)
It would be an error to specify both default=
and factory=
.
default
parameter to field()
, marking it as a factory function so we can determine whether to call it. For now, let's mark it by using Factory(callable)
. @dataclass
class Foo:
x: list = field(default=Factory(list), repr=False)
field
used with factory functions. For now, let's assume it's called factory_field
. It would not have a default=
parameter: @dataclass
class Foo:
x: list = factory_field(list, repr=False)
I don't have a real preference among these. I sense we should go with whatever makes mypy's life easier, but I don't know what that would be. I suspect it would be 3, since I think mypy could be told that the type of factory_field(callable)
is the type of callable()
. But I'm just hypothesizing, and am interested in the opinion of experts.
I think it makes sense to add tests that check that dataclasses interact nicely with typing.Generic
and type variables. Most likely no changes is needed to the code (only tests) since we use a decorator @dataclass
. But still Generic
uses a complex metaclass (this will be removed if/when PEP 560 is accepted and replaced with __init_subclass__
and new friends).
I think it makes sense to add tests checking that things like this work:
@dataclass
class LabeledBox(Generic[T]):
content: T
label: str = '<unknown>'
box = LabeledBox(42)
assert box.content == 42
assert box.label == '<unknown>'
boxes: List[LabeledBox[int]] = [] # subscripting the resulting class should work, etc.
I'm curious about super calls. When a data-class inherit from a non-data-class, does the generated __init__()
method call super().__init__()
? Before or after setting the new attributes? Similar, when a data-class inherits from another data-class, does it call super().__init__()
? In both cases, what about inheritance?
@dataclass(slots=True)
class C:
x: int
y: int = 0
c = C(10)
self.assertEqual(repr(c), 'C(x=10,y=0)')
Fails with:
Traceback (most recent call last):
File "/cygdrive/c/home/eric/local/dataclasses/tst.py", line 627, in test_slots
self.assertEqual(repr(c), 'C(x=10,y=0)')
AssertionError: "C(x=10,y=<member 'y' of 'C' objects>)" != 'C(x=10,y=0)'
- C(x=10,y=<member 'y' of 'C' objects>)
+ C(x=10,y=0)
Currently the post-init function __dataclass_post_init__
takes no parameters. Should it be possible to pass one or more parameters from __init__
to the post-init function? Is it possible to make the parameter optional?
It would be nice to pass parameters to __init__
which do not initialize fields, but are accessible to the post-init function. But, it might mess up the interface too much, and there are issues with calling __dataclass_post_init__
's super()
function if we decide to change the number of parameters.
This issue is a placeholder to decide this issue before the code and PEP are finalized. Any suggestions or thoughts are welcome.
Currently the draft PEP specifies and the code supports the optional ability to add __slots__
. This is the one place where @dataclass
cannot just modify the given class and return it: because __slots__
must be specified at class creation time, it's too late by the time the dataclass
decorator gets control. The current approach is to dynamically generate a new class while setting __slots__
in the new class and copying over other class attributes. The decorator then returns the new class.
The question is: do we even want to support setting __slots__
? Is having __slots__
important enough to have this deviation from the "we just add a few dunder methods to your class" behavior?
I see three options:
@dataclass(slots=True)
returning a new class.__slots__
.@add_slots
, which takes a data class and creates a new class with __slots__
set.I think we should either go with 2 or 3. I don't mind not supporting __slots__
, but if we do want to support it, I think it's easier to explain with a separate decorator.
@add_slots
@dataclass
class C:
x: int
y: int
It would be an error to use @add_slots
on a non-dataclass class.
Sorry, it is not possible to comment directly in code, here are some random comments (particularly for TODO items):
# what exception to raise when non-default follows default? currently
# ValueError
typing module raises TypeError
, we might want to synchronize this.
# what to do if a user specifies a function we're going to overwrite,
# like __init__? error? overwrite it?
typing module defines some predefined set of "infrastructure-critical" attributes,
It raises AttributeError("Cannot overwrite NamedTuple attribute " + key)
for these,
and allows overwriting others. I think we should do the same here.
# use typing.get_type_hints() instead of accessing __annotations__
# directly? recommended by PEP 526, but that's importing a lot just
# to get at __annotations__
I think it is absolutely OK to directly use __annotations__
here as far as
we don't do anything fancy/dangerous with __annotations__
. These classes
should be as fast as possible.
# is __annotations__ guaranteed to be an ordered mapping?
PEP 526 allows this (emphasis my):
__annotations__
attribute of that module or class (mangled if private) as an ordered mapping from names to evaluated annotations.
Support setting
__slots__
.
I am thinking maybe __slots__
should be added by default, with an option to not add them?
Optional support by type checkers (mypy) and typing.py (in the sense that you can declare and use data classes without using typing).
mypy supports/understands descriptor protocol, so that a custom stub can be added to typeshed that will make mypy understand dataclasses semantics correctly (also some special-casing in mypy can be added for even better support, I can take care of this when we decide on the semantics/implementation).
Provide a way to detect if an object is an instance of any data class, similar to using _fields on namedtuples. I'm not sure why people want to be able to detect this, but they do, and Raymond has always suggested looking for an attribute "_fields".
People might want some structural subtyping support for dataclasses. Potentially we can support hasattr
or/and a custom __subclasshook__
in spirit of collections.abc
and PEP 544. Static support for this is also possible (again in spirit of PEP 544, I hope mypy will support it soon, see python/mypy#3132).
EDIT: enumerated comments to simplify reading.
Is there a simple way to avoid including annotated (for the purpose of static checking) class variable into the list of fields? Ideally, it could be possible to just write:
@easy
class C:
x: int
inst_count: ClassVar[int] = 0
Possible solution would be to just use a special marker for exclusion:
@easy
class C:
x: int
inst_count: ClassVar[int] = classvar(0) # Just inserts 0 here in generated code
I don't like re-using field
here, since inst_count
is not actually a field.
Imagine a situation with this class:
class BigClass:
def __init__(self, x, y, z, t, w, parent):
self.x = x
self.y = y
self.z = z
self.t = t
self.w = w
self.tmp = TempStorage()
self.tmp.link(parent)
...
def __repr__(self):
# other boilerplate
It should be possible to refactor this into:
@auto
class BigClass:
x: int
y: int
z: int
t: int
w: int
def initialize(self, parent): # We can choose other name, maybe name of the class?
self.tmp = TempStorage()
self.tmp.link(parent)
A simple way to achieve this is to add code like this at the end of generated __init__
:
def __init__(self, <inserted args>, *extra_args):
...
if hassattr(self, 'initialize'):
self.initialize(*extra_args)
Because typing
is a large module, it would be nice for smaller programs to be able to use dataclasses without needing to import typing. The only place dataclasses.py uses typing is in this check:
if type(a_type) is typing._ClassVar:
# Skip this field if it's a ClassVar.
continue
I think I could avoid using typing unless I know it's already been imported by dropping the import and changing that code to:
# This is a hack for not depending on typing, unless it's already
# been loaded.
typing = sys.modules.get('typing')
if typing is not None:
if type(a_type) is typing._ClassVar:
# Skip this field if it's a ClassVar.
continue
It seems to work:
% python3
>>> import sys
>>> from dataclasses import dataclass
>>> @dataclass
... class C:
... x: int
...
>>> C.__dataclass_fields__
OrderedDict([('x', Field(name='x',type=<class 'int'>,default=_MISSING,default_factory=_MISSING,init=True,repr=True,hash=None,cmp=True))])
>>> 'typing' in sys.modules
False
>>> import typing
>>> @dataclass
... class C:
... x: int
... y: typing.ClassVar[int]
...
>>> @dataclass
... class C:
... x: int
... y: typing.ClassVar[int]
...
>>> C.__dataclass_fields__
OrderedDict([('x', Field(name='x',type=<class 'int'>,default=_MISSING,default_factory=_MISSING,init=True,repr=True,hash=None,cmp=True))])
>>>
I haven't thought through all of the implications, but I wanted to record this thought while I'm thinking about it.
EDIT: for clarity
Currently __slots__
are not set by default (slots=False
). I have two points here:
extendable
that is False
?In this python-ideas post, Nick Coghlan says:
That said, even with this model, the base case of "fields with an
immutable or shared default" could potentially be simplified to:from autoclass import data_record @data_record class Point3D: x: int = 0 y: int = 0 z: int = 0
However, the potentially surprising behaviour there is that to
implement it, the decorator not only has to special case the output of
"field()" calls, but also has to special case any object that
implements the descriptor protocol to avoid getting confused by normal
method and property definitions.
I don't believe the last sentence is true, because __annotations__
will not contain entries for properties or methods with type annotations.
This issue is to remind me to add test cases for this.
The tag last-version-with-ast points to a version of the code that uses ast to create functions. The current master branch uses exec.
Presumably we want type annotations added to the generated methods, in particular __init__
. This is a reminder to add annotations.
From the Abstract in the PEP, the comparison functions are given as:
def __eq__(self, other):
if other.__class__ is self.__class__:
return (self.name, self.unit_price, self.quantity_on_hand) == (other.name, other.unit_price, other.quantity_on_hand)
return NotImplemented
There's been discussion on whether this should be a subclass check, or an exact match. I plan on looking in to this and addressing it before the next PEP version. This is a placeholder to remind me, and for discussion.
In the "Specification" section, you write:
If
cmp
andhash
are both true, Data Classes will generate a__hash__
for you
I believe this should be:
If
cmp
andfrozen
are both true...
I often have instance variables that are not part of the constructor arguments. The current design doesn't seem to let me specify their types using the nice x: int
notation, since that implies they are included in the constructor signature. E.g. (almost from asyncio):
class Server:
def __init__(self, loop, sockets):
self._loop = loop
self.sockets = sockets
self._active_count = 0
self._waiters = []
I'd like to add types, like such:
@dataclass
class Server:
loop: AbstractEventLoop
sockets: List[socket.socket]
_activecount: int = 0
_waiters: List[Future] = [] # or field(factory=list)
But I'd need to have a way to say "these fields should not be part of the constructor".
Which type annotation would be appropriate for fields()
, astuple()
, asdict()
? Because dataclass
es don’t derive from a common ancestor, a protocol SupportsFields
would be required to properly specify
> echo import dataclasses; dataclasses.astuple((1, 2, 3)) >xxx.py
> mypy xxx.py
(no warnings, as astuple()
doesn't have a type annotation)
For:
@dataclass
class C:
x: int = 0
y = 0
Is y a field? What are the params to init()? Just x? Or x and y?
dataclass doesn't need to look at annotations, so there's no technical reason they'd be required.
My personal preference is to require them. That is, drive field discovery from __annotations__
, not something like [name for name in C.__dict__ if not name.startswith('_')]
When programming with an immutable/frozen data structure (which I personally prefer to do whenever it's reasonable), perhaps the most common operation is to replace some subset of the fields with new values, returning a new instance. It would be nice if dataclasses had similar functionality, for frozen classes at least.
Some prior art:
Python 3 dict:
a = {'key1': 'value1', 'key2': 'value2'}
b = {**a, 'key1': 'VALUE1'}
namedtuple _replace:
NT = namedtuple('NT', ('field1', 'field2'))
a = NT(field1='value1', field2='value2')
b = a._replace(field1='VALUE1')
attrs evolve:
@attr.s
class C:
field1 = attr.ib()
field2 = attr.ib()
a = C(field1='value1', field2='value2')
b = attr.evolve(a, field1='VALUE1')
Kotlin data class copy:
data class C(val field1: String, val field2: String)
val a = C(field1 = "value1", field2 = "value2")
val b = a.copy(field1 = "VALUE1")
Clojure record assoc:
(defrecord R [field1 field2])
(def a (R. "value1" "value2"))
(def b (assoc a :field1 "VALUE1"))
There are of course other example, but I think this shows the variety of names.
Imagine users who are currently using named tuples or dictionaries as objects. They probably would want to switch to dataclasses when they appear, but probably in their code they have something like:
name, age = person
# or
person['name'] = 'Eve'
I think it may easier for them to switch to dataclasses if we provide something like this:
@data(iterable=True)
class Point:
x: int
y: int
origin = Point(0, 0)
x, y = origin # OK
@data(indexable=True)
class Person:
name: str
age: int
person = Person('John', 31)
name = person['name']
person['age'] = 32
I am not sure, but I think both flags should be False
by default. If iterable
is enabled, then we would add a corresponding __iter__
, for indexable
we would add __getitem__
and (depending on the mutability/hashability flags) __setitem__
.
From a discussion with @raymondh:
Should we auto-generate some sort of doc string for __init__
, or allow the user to specify one (maybe as a param to @dataclass
)?
I'm not sure a generated one wouldn't have too much noise to be useful.
The PEP doesn't say that fields are initialized in order, but I'd like to do so, anyway.
If a field has a default_factory, and has init=False, it is initialized out of order. For this code:
@dataclass
class C:
x: int
y: list = field(default_factory=list, init=False)
z: int
The generated code looks like:
def __init__(__dataclass_self__,x:_type_x,z:_type_z)->_return_type:
__dataclass_self__.x=x
__dataclass_self__.z=z
__dataclass_self__.y = _dflt_y()
I'll rework field initialization to make sure this is handled correctly. I think there are also likely other corner cases.
In particular, the None/True/False behavior and how it interacts with cmp is not fully tested.
After talking to @raymondh today, he questioned the need for make_class()
and its many parameters. And he's correct: it's not needed. Python can already dynamically create classes, just leverage that.
So, I plan to remove make_class()
.
Instead of:
C = make_class('C',
[field('x', int),
field('y', int, default=5),
])
We'd use:
cls_dict = {'__annotations__': OrderedDict(x=int, y=int,),
'y': field(default=5),
}
C = dataclass(type('C', (object,), cls_dict))
assert repr(C(4)) == 'C(x=4,y=5)'
And the beauty of this is that I can remove the name
and type
parameters to field()
.
In point 5 of issue #9, @ilevkivskyi suggests that setting slots should be the default, and setting slots=False would allow the caller to opt out of this behavior.
Given the typical use cases for this feature, I tend to agree. I'm opening this issue for discussion of this point.
As I’ve already mentioned by e-mail, I’m strongly opposed to call this concept “data classes”.
Having an easy way to define many small class with attributes is nothing about data, it’s about good OO design.
Calling it “data classes” implies that they differ from…“code classes” I guess?
One of the things people love about attrs is that it’s helping them to write regular classes which they can add methods to without any subclassing or other magic. IOW: to focus on the actual code they want to write as opposed to generic boilerplate.
Debasing them by name seems like a poor start to me. We do have data containers in the stdlib (namedtuples, SimpleNamespace) so I don’t see a reason to add a third to the family – even if just by name.
The initializer should ideally be kwargs-only. That is:
def __init__(self, name: str, …
should be:
def __init__(self, *, name: str, …
The reason is that the automatically generated methods are all using the definition order in the class, so if, for example, I add a new attribute and I want comparison to look at it before some other attribute, I’d have to possibly change the definition order, which will change the signature of init in an incompatible way. By making the method kwargs-only, you can ensure that you aren't breaking clients when you do that.
Quoting @ncoghlan on python-ideas:
Some of the key problems I personally see are that attrs reuses a
general noun (attributes) rather than using other words that are more
evocative of the "data record" use case, and many of the parameter
names are about "How attrs work" and "How Python magic methods work"
rather than "Behaviours I would like this class to have".That's fine for someone that's already comfortable writing those
behaviours by hand and just wants to automate the boilerplate away
(which is exactly the problem that attrs was written to solve), but
it's significantly more problematic once we assume people will be
using a feature like this before learning how to write out all the
corresponding boilerplate themselves (which is the key additional
complication that a language level version of this will have to
account for).However, consider instead the following API sketch:
from autoclass import data_record, data_field @data_record(orderable=False, hashable=False) class SvgTransform(SvgPicture): child = data_field() matrix = data_field(setter=numpy.asarray)
Here, the core concepts to be learned would be:
- the "autoclass" module lets you ask the interpreter to automatically
fill in class details- SvgTransform is a data record that cannot be hashed, and cannot be ordered
- it is a Python class inheriting from SvgPicture
- it has two defined fields, child & matrix
- we know "child" is an ordinary read/write instance attribute
- we know "matrix" is a property, using numpy.asarray as its setter
In this particular API sketch, data_record is just a class decorator
factory, and data_field is a declarative helper type for use with that
factory, so if you wanted to factor out particular combinations, you'd
just write ordinary helper functions.Instead of trying to cover every possible use-case from a single
decorator with a multitude of keyword arguments, I think covering the
simple cases is enough. Explicitly overriding methods is not a bad
thing! It is much more comprehensible to see an explicit class with
methods than a decorator with multiple keyword arguments and callbacks.This isn't the case for folks that have to actually read dunder
methods to find out what a class does, thought. Reading an
imperatively defined class only works that way once you're able to
mentally pattern match "Oh, that's a conventional init, that's a
conventional repr, that's a conventional hash, that's a
conventional eq, that's a conventional lt implementation, etc,
etc".Right now, telling Python "I want to do the same stock-standard things
that everyone always does" means writing a lot of repetitive logic
(or, more likely, copying the logic from an existing class that you or
someone else wrote, and editing it to fit).The idea behind offering some form of declarative class definitions is
to build out a vocabulary of conventional class behaviours, and make
that vocabulary executable such that folks can use it to write
applications even if they haven't learned how it works under the hood
yet. As with descriptors before it, that vocabulary may also take
advantage of the fact that Python offers first class functions to
allow callbacks and transformation functions to be injected at various
steps in the process without requiring you to also spell out all the
other steps in the process that you don't want to alter.I like the namedtuple approach: I think it hits the sweet spot between
"having to do everything by hand" and "everything is magical".It's certainly a lot better than nothing at all, but it brings a lot
of baggage with it due to the fact that it is a tuple. Declarative
class definitions aim to offer the convenience of namedtuple
definitions, without the complications that arise from the "it's a
tuple with some additional metadata and behaviours" aspects.Database object-relational-mapping layers like those in SQL Alchemy
and Django would be the most famous precursors for this, but there are
also things like Django Form definitions, and APIs like JSL (which
uses Python classes to declaratively define JSON Schema documents).For folks already familiar with ORMs, declarative classes are just a
matter of making in memory data structures as easy to work with as
database backed ones. For folks that aren't familiar with ORMs yet,
then declarative classes provide a potentially smoother learning
curve, since the "declarative class" aspects can be better separated
from the "object-relational mapping" aspects.
It's a tiny issue, but shouldn't there be spaces after commas in the generated reprs, i.e. Point(x=1, y=2)
instead of Point(x=1,y=2)
?
It would be helpful to have a list of functional differences between dataclasses
and attrs
, broken down by @dataclass
vs @attr.s
and field
vs attr.ib
.
This would be useful and illuminating for a few reasons:
It would make it easier to vet the logic behind, and need for, each of the proposed differences.
@hynek and @Tinche have invested years of thought into the current design: deviating from it without fully understanding the history and reasoning behind each decision might lead to this project needlessly repeating mistakes. I'm glad to see that the attrs
devs have already been brought into several issues. My hope is we can get a bird's eye view so that nothing slips through the cracks.
If the differences aren't too great (and ideally they will not be, see above) I'd like to see a dataclass
compatibility mode for attrs
(e.g. from attrs import dataclass, field
).
I'm glad that this badly-needed feature is being worked on, but sadly I'm stuck in python 2 for at least another 2 years, so it's important to me, and surely many attrs
-users, to have an easy path to adoption once this becomes part of stdlib.
Currently the code always uses _dataclass_self, but that makes the generated code ugly in the normal case. Only use dataclass_self if there's a field named self.
I don't recall where I saw this but I believe we had some example where a base class would declare a field of type float
and the subclass would redefine it as int
. I would just like to insert a note in that thread referencing python/mypy#3208, which points out that this is not actually a safe thing to do unless the class is immutable.
There is a need to have a number of helper functions, such as asdict(), astuple(), replace(), etc. This issue is not about which specific functions we need (that will come later), but rather how to make these functions available.
namedtuple handles this by having member functions that begin with an underscore, such as ._make(), ._asdict(), ._replace(), ._fields(), etc.
attrs handles this by using module-level functions that take an instance as a parameter, such as .attrs.fields(), attrs.has(), attrs.asdict(), etc.
I'm leaning towards the attrs interface of using module-level functions. I don't like the underscores required by namedtuple's approach, and even then you could have name collisions.
As it was decided (issue #8) that fields()
, astuple()
, asdict()
will be module-level functions (not member functions), users can (but shouldn’t) call these for non-dataclasses too. I surmise this should raise TypeError
(not AttributeError
, which is raised when fields()
accesses the _MARKER
attribute now) to best inform users of their erroneous use
>>> import dataclasses
>>> dataclasses.astuple((1, 2, 3))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...dataclasses.py", line
return tuple(getattr(obj, name) for name in fields(obj))
File "...dataclasses.py", line
return getattr(cls, _MARKER)
AttributeError: 'tuple' object has no attribute '__dataclass_fields__'
I don't mean this as a dig at the project here, but a serious question:
Is there any reason this project should exist?
Every aspect of the design here appears to be converging further in the direction of exact duplication of functionality available within attrs
, except those which are features already on attrs
's roadmap. The attrs
maintainers have carefully considered a number of subtle issues related to this problem domain, and every discussion I have observed thus far on the dataclasses
repo has paralleled the reasoning process that went into attrs's original design or its maintenance. (Remembering, also, that attrs
is itself a second-generation project and there was a lot of learning that came from characteristic
.)
I haven't had time to read all of the python-ideas thread, but all of the objections to attrs that I could find seem to have to do with whimsical naming or a desire to shoehorn convenient class creation into a "special" role that is just for "data" and not for regular classes somehow. I shouldn't be passive-aggressive here, so I should just say: I can't follow Nick's reasoning on #1 at all :-).
The silly names could be modified by a trivially tiny fork, if that is really a deal-breaker for stdlib adoption; honestly I find that the names grow on you. (More than one attrs
user has separately come up with the idea that it is a lot like Python's significant whitespace.)
That said, of course there may be some reasons or some broader goal that I'm missing, but if this is the case it seems like writing a very clear "goals / non-goals / rejected approaches" section for the PEP itself would be worthwhile. The reasons given in the PEP don't really make sense; the need to support python 2 hasn't been a drag on python 3 that I'm aware of, and annotation-based attribute definition is coming to attrs itself; it's a relatively small extension.
I want to make sure that the instance dict is being initialized when I think it should be.
In:
@dataclass
class C:
a: int
b: list = field(default_factory=list, init=False)
c: list = field(default_factory=list)
d: int = field(default=4, init=False)
e: int = 0
c=C(0)
Then:
The reasoning is:
__init__
.__init__
param list. This is because the default factory still needs to be called from __init__
.__init__
.__init__
. It's in the class dict because it has a default value.__init__
.The generated __init__
looks like:
def __init__(self,a:_type_a,c:_type_c=_MISSING,e:_type_e=_dflt_e)->_return_type:
self.a = a
self.b = _dflt_b()
self.c = _dflt_c() if c is _MISSING else c
self.e = e
locals: {'_type_a': <class 'int'>, '_type_b': <class 'list'>, '_type_c': <class 'list'>, '_type_d': <class 'int'>, '_type_e': <class 'int'>, '_return_type': None}
globals: {'_MISSING': <object object at 0x10ea26100>, '_dflt_b': <class 'list'>, '_dflt_c': <class 'list'>, '_dflt_e': 0}
Seeing as the scope of dataclasses still seems fairly pliable, I'll share an answer to #19 ("why not attrs?") which could greatly increase its utility: The serialization cliff.
My teammates and I have been using attrs almost as long as it's been around, and namedtuples for much longer. Great and fine solutions, if your data doesn't have to leave the process space.
The boilerplate we're trying to avoid comes back in a severe way as soon as databases, SOA, or files get involved. Let alone anything with complicated validation rules.
I think to be true to the name, dataclasses need to account for the pervasiveness of data ingestion and emission. As it stands, attrs has asdict()
and not much else. Nothing to help recursively (re)construct instances from data.
There is another, somewhat more popular, approach in marshmallow, which also aims to make data-centric, serialization-agnostic container types. It's not perfect, but it's a starting point. A balance of attrs and marshmallow features may yield a powerful and sufficiently-differentiated featureset that I think a lot of developers are missing in Python. The number of entries in this space seems to agree.
I'm happy to see more discussion going into these fundamentals, and I'm hopeful that more and more they take the whole data workflow into mind. :)
attrs has a frozen=True decorator parameter that causes all instances to be read-only. It does this by adding a setattr to the class which disallows overwriting attributes.
I like this feature, and I assume we should do the same thing. If anyone disagrees, please discuss here.
make_class is a way to dynamically create classes, similar in spirit to collections.namedtuple or attr.make_class.
The question is, what should the API be for specifying fields?
For 'normal' usage, a field requires 3 pieces of information, one of which is optional: name, type, default value. The default value can be overridden with a field() call to specify not only the default value, but additional per-field attributes (use in hash, use in rear, etc.).
For make_class, I propose the per-field API be similar: 2 required items (name, type) and one optional one, the default or field().
So, something like:
C = make_class('C',
[('a', int),
('x', float, 1001.0),
('b', str, field(default=1000, repr=False),
],
bases=(B1, B2))
Which would be the dynamic equivalent of the static:
@dataclass
class C(B1, B2):
a: int
x: float = 1001.0
b: str = field(default=1000, repr=False)
I realize the dynamic make_class call is somewhat unsatisfying, but I don't expect it to get used at all with statically specified values. My use case is something like reading database schemas and generating a dataclass that will be the return value for each row. In that case, the list of fields would be generated in code after reading the schema.
Any suggestions for improvements here? One thought would be that instead of using a 2- or 3-tuple for each field, have another class that would represent them. I'm not sure if that's worth the hassle, though.
Input is welcomed.
Since we decided in issue #8 to use module level helper functions instead of instance methods, I want to add the first such function.
dataclasses.fields(cls)
will return a tuple of Field objects defined in cls
. Each Field object represents one field in the class.
This will be the basic building block for a number of introspection methods.
attrs returns object that can be either indexed or accessed by field name. I think that's a good idea, but I'm not going to implement it at first.
Since we're saying it should always match what's currently the "cmp" flag, let's just force it to be the same by getting rid of the "hash" flag. This will slightly simplify the field()
call.
It occurs to me that I could also use this mechanism to generate namedtuples. I'm not saying it should happen in this PEP, but it might be worth considering when deciding on names and APIs.
In issue #14 we added support for typing.ClassVar
annotations. If an annotation is a ClassVar
, then it's considered to not be a field (that is, it's not set on instances of the dataclass, it's not in __init__
, etc.).
There's ongoing discussion on python-ideas and python-dev about dropping the typing module from the stdlib.
I'm wondering what we should do about ClassVar
if typing is in fact dropped from the stdlib.
Currently, the code doesn't import typing. It just looks for "typing"
in sys.modules, and if that's present, assumes it's the typing module and looks inside of it for ClassVar
. I think this is a good approach. However, if typing is no longer part of the stdlib, I guess it's possible for another module named typing to be used in its place, and then I need to be more defensive about looking inside sys.modules['typing']
. Is that case worth worrying about? I sort of think it's not, although it would be easy enough to add a getattr(typing, 'ClassVar')
to the code.
The other thing to worry about is: what if typing is removed, but something in the stdlib wants to have a dataclass with a ClassVar? In https://mail.python.org/pipermail/python-dev/2017-November/150176.html, @ncoghlan suggested having dataclasses create its own ClassVar. Another option that's just as good, although the syntax is somewhat worse, is to add a param to field()
that says "this isn't really a field". Something like:
@dataclass
class C:
x: int
classvar: field(int, not_a_field=True)
In either event, mypy would need to know about it, to know that __init__
for the class would only have one parameter, x
.
If we go with any of these approaches, I think we should still keep the code in dataclasses that understands real typing.ClassVar
fields. That seems like the most natural way to write code, outside of the stdlib.
Suggest to have asdict()
return an OrderedDict
(like fields()
already does) as order matters in a dataclass
If someone wants to create a data class in which all instances are immutable (i.e. each attribute can not be changed after construction), I propose that a immutable
parameter be added (which in the spirit of Python defaults to False
). Note this is different than frozen
, which applies to monkey patching new attributes.
Currently, this can be done manually with normal classes with a lot of boilerplate and the use of @property
. In other languages, such as Kotlin, data classes are immutable by default.
A sketch of this proposal would be as follows:
@dataclass(immutable=True)
class InventoryItem:
name: str
unit_price: float
quantity_on_hand: int = 0
def total_cost(self) -> float:
return self.unit_price * self.quantity_on_hand
Would desugar into something like:
def __init__(self, name: str, unit_price: float, quantity_on_hand: int = 0) -> None:
self._name = name
self._unit_price = unit_price
self._quantity_on_hand = quantity_on_hand
@property
def name(self) -> str:
return self._name
@property
def unit_price(self) -> float:
return self._unit_price
@property
def quantity_on_hand(self) -> int:
return self._quantity_on_hand
If one attempts you modify a property
, an AttributeError
is raised. IDEs can lint for this kind of thing while the user types before runtime. PyCharm, for example, squiggles a warning if you try to set a property
.
In issue #9 @ilevkivskyi suggests that we should raise an error if any class attribute we're trying to set is already used. This includes __init__
, __eq__
, etc. He notes that the typing module raises AttributeError in these cases.
Note that the caller can suppress the generation of __eq__
by setting cmp=False
, can suppress __repr__
by setting repr=False
, can suppress the generation of __init__
by setting init=False
, etc.
I think raising an AttributeError if we try to overwrite a dunder method is a good idea. The caller has the choice to either write their own method and suppress ours, or use ours. But the can't try to do both.
Edit: Fix init=False
instead of __init__=False
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.