ericvsmith / dataclasses Goto Github PK

View Code? Open in Web Editor NEW

582.0 582.0 51.0 349 KB

License: Apache License 2.0

Python 99.38% Makefile 0.62%

dataclasses's People

Contributors

Stargazers

Watchers

Forkers

ilevkivskyi lisroach alir3z4 taleinat hbcbh1999 sudoguy rohit507 kazhuravlev mjpieters dsanders11 priya-gittest sp-acdevs tensorstrings charlax hackaugusto mkurnikov amitdanin jonmcoe alex1442 noamkush offgrid-konrad haydenbbickerton dillonko wayveai melo108 agronholm fayedel pombredanne dementiy cctv666 zhiweio dinar667 guushertogh ecchilds alvistack ra2003 kianmeng tfedud ariebovenberg frankfanslc momataj julienergan achat2 ermantraun arpitjain799 iq-scm j1nse kaluluosi holliemin9090 dschab

dataclasses's Issues

Copying mutable defaults

Guido and I discussed this yesterday, and we decided we'd just copy.copy() the default values when creating a new instance.

The __init__ code I'm currently generating for:

@dataclass
class C:
    x: list = []

Looks something like:

def __init__(self, x=[]):
    self.x = copy.copy(x)

But I don't think this is what we really want. I don't think we want to call copy.copy() if passed in an unrelated list, like C(x=mylist). Maybe __init__ should check and only call copy.copy() if x is C.x?

So:

def __init__(self, x=C.x):
    self.x = copy.copy(x) if x is C.x else x

(I haven't checked that this actually works as written, but the idea is to only copy the argument if it's the same object as the default that was assigned in the class creation statement.)

Have separate flags to control ==, != methods, independently of >, >=, <, <=

See the discussion starting at:
https://mail.python.org/pipermail/python-dev/2017-September/149461.html

The idea is to replace the existing cmp flag to @dataclass and instead have two boolean parameters, eq and compare. If compare is True, eq would be forced to True.

If eq is set, generate __eq__ and __ne__ methods. If compare is set, generate __gt__, __ge__, __lt__, and __le__ methods.

This would allow you to have data classes that could be compared for equality, but are unordered.

I'll also work on updating the PEP to match.

Renaming the cmp parameter to field() was also suggested, but I don't have a good idea for that. I'll eventually open a separate issue for that discussion.

How to specify factory functions

In #3, we discussed mutable defaults which included talk of factory functions. Having resolved #3, this issue is about how to specify default value factory functions.

Some options:

A new parameter to field(). For now, let's call it factory:

   @dataclass
    class Foo:
        x: list = field(factory=list, repr=False)

It would be an error to specify both default= and factory=.

Re-use the default parameter to field(), marking it as a factory function so we can determine whether to call it. For now, let's mark it by using Factory(callable).

    @dataclass
    class Foo:
        x: list = field(default=Factory(list), repr=False)

Have a separate flavor of field used with factory functions. For now, let's assume it's called factory_field. It would not have a default= parameter:

    @dataclass
    class Foo:
       x: list = factory_field(list, repr=False)

I don't have a real preference among these. I sense we should go with whatever makes mypy's life easier, but I don't know what that would be. I suspect it would be 3, since I think mypy could be told that the type of factory_field(callable) is the type of callable(). But I'm just hypothesizing, and am interested in the opinion of experts.

Add tests for generic dataclasses

I think it makes sense to add tests that check that dataclasses interact nicely with typing.Generic and type variables. Most likely no changes is needed to the code (only tests) since we use a decorator @dataclass. But still Generic uses a complex metaclass (this will be removed if/when PEP 560 is accepted and replaced with __init_subclass__ and new friends).

I think it makes sense to add tests checking that things like this work:

@dataclass
class LabeledBox(Generic[T]):
    content: T
    label: str = '<unknown>'

box = LabeledBox(42)
assert box.content == 42
assert box.label == '<unknown>'

boxes: List[LabeledBox[int]] = []  # subscripting the resulting class should work, etc.

How are super calls handled?

I'm curious about super calls. When a data-class inherit from a non-data-class, does the generated __init__() method call super().__init__()? Before or after setting the new attributes? Similar, when a data-class inherits from another data-class, does it call super().__init__()? In both cases, what about inheritance?

Failure with slots and default values

        @dataclass(slots=True)
        class C:
            x: int
            y: int = 0

        c = C(10)
        self.assertEqual(repr(c), 'C(x=10,y=0)')

Fails with:

Traceback (most recent call last):
  File "/cygdrive/c/home/eric/local/dataclasses/tst.py", line 627, in test_slots
    self.assertEqual(repr(c), 'C(x=10,y=0)')
AssertionError: "C(x=10,y=<member 'y' of 'C' objects>)" != 'C(x=10,y=0)'
- C(x=10,y=<member 'y' of 'C' objects>)
+ C(x=10,y=0)

Should it be possible to pass parameter(s) to the post-init function?

Currently the post-init function __dataclass_post_init__ takes no parameters. Should it be possible to pass one or more parameters from __init__ to the post-init function? Is it possible to make the parameter optional?

It would be nice to pass parameters to __init__ which do not initialize fields, but are accessible to the post-init function. But, it might mess up the interface too much, and there are issues with calling __dataclass_post_init__'s super() function if we decide to change the number of parameters.

This issue is a placeholder to decide this issue before the code and PEP are finalized. Any suggestions or thoughts are welcome.

Support slots?

Currently the draft PEP specifies and the code supports the optional ability to add __slots__. This is the one place where @dataclass cannot just modify the given class and return it: because __slots__ must be specified at class creation time, it's too late by the time the dataclass decorator gets control. The current approach is to dynamically generate a new class while setting __slots__ in the new class and copying over other class attributes. The decorator then returns the new class.

The question is: do we even want to support setting __slots__? Is having __slots__ important enough to have this deviation from the "we just add a few dunder methods to your class" behavior?

I see three options:

Leave it as-is, with @dataclass(slots=True) returning a new class.
Completely remove support for setting __slots__.
Add a different decorator, say @add_slots, which takes a data class and creates a new class with __slots__ set.

I think we should either go with 2 or 3. I don't mind not supporting __slots__, but if we do want to support it, I think it's easier to explain with a separate decorator.

@add_slots
@dataclass
class C:
    x: int
    y: int

It would be an error to use @add_slots on a non-dataclass class.

Random comments

Sorry, it is not possible to comment directly in code, here are some random comments (particularly for TODO items):

#  what exception to raise when non-default follows default? currently
#  ValueError

typing module raises TypeError, we might want to synchronize this.

#  what to do if a user specifies a function we're going to overwrite,
#  like __init__? error? overwrite it?

typing module defines some predefined set of "infrastructure-critical" attributes,
It raises AttributeError("Cannot overwrite NamedTuple attribute " + key) for these,
and allows overwriting others. I think we should do the same here.

#  use typing.get_type_hints() instead of accessing __annotations__
#  directly? recommended by PEP 526, but that's importing a lot just
#  to get at __annotations__

I think it is absolutely OK to directly use __annotations__ here as far as
we don't do anything fancy/dangerous with __annotations__. These classes
should be as fast as possible.

# is __annotations__ guaranteed to be an ordered mapping?

PEP 526 allows this (emphasis my):

__annotations__ attribute of that module or class (mangled if private) as an ordered mapping from names to evaluated annotations.

Support setting __slots__.

I am thinking maybe __slots__ should be added by default, with an option to not add them?

Optional support by type checkers (mypy) and typing.py (in the sense that you can declare and use data classes without using typing).

mypy supports/understands descriptor protocol, so that a custom stub can be added to typeshed that will make mypy understand dataclasses semantics correctly (also some special-casing in mypy can be added for even better support, I can take care of this when we decide on the semantics/implementation).

Provide a way to detect if an object is an instance of any data class, similar to using _fields on namedtuples. I'm not sure why people want to be able to detect this, but they do, and Raymond has always suggested looking for an attribute "_fields".

People might want some structural subtyping support for dataclasses. Potentially we can support hasattr or/and a custom __subclasshook__ in spirit of collections.abc and PEP 544. Static support for this is also possible (again in spirit of PEP 544, I hope mypy will support it soon, see python/mypy#3132).

EDIT: enumerated comments to simplify reading.

Annotated class variables

Is there a simple way to avoid including annotated (for the purpose of static checking) class variable into the list of fields? Ideally, it could be possible to just write:

@easy
class C:
    x: int
    inst_count: ClassVar[int] = 0

Possible solution would be to just use a special marker for exclusion:

@easy
class C:
    x: int
    inst_count: ClassVar[int] = classvar(0)  # Just inserts 0 here in generated code

I don't like re-using field here, since inst_count is not actually a field.

Possibility to call custom initialization

Imagine a situation with this class:

class BigClass:
    def __init__(self, x, y, z, t, w, parent):
        self.x = x
        self.y = y
        self.z = z
        self.t = t
        self.w = w
        self.tmp = TempStorage()
        self.tmp.link(parent)
    ...
    def __repr__(self):
    # other boilerplate

It should be possible to refactor this into:

@auto
class BigClass:
    x: int
    y: int
    z: int
    t: int
    w: int
    def initialize(self, parent):  # We can choose other name, maybe name of the class?
        self.tmp = TempStorage()
        self.tmp.link(parent)

A simple way to achieve this is to add code like this at the end of generated __init__:

def __init__(self, <inserted args>, *extra_args):
    ...
    if hassattr(self, 'initialize'):
        self.initialize(*extra_args)

Don't force import of typing module

Because typing is a large module, it would be nice for smaller programs to be able to use dataclasses without needing to import typing. The only place dataclasses.py uses typing is in this check:

        if type(a_type) is typing._ClassVar:
            # Skip this field if it's a ClassVar.
            continue

I think I could avoid using typing unless I know it's already been imported by dropping the import and changing that code to:

        # This is a hack for not depending on typing, unless it's already
        #  been loaded.
        typing = sys.modules.get('typing')
        if typing is not None:
            if type(a_type) is typing._ClassVar:
                # Skip this field if it's a ClassVar.
                continue

It seems to work:

% python3
>>> import sys
>>> from dataclasses import dataclass
>>> @dataclass
... class C:
...    x: int
... 
>>> C.__dataclass_fields__
OrderedDict([('x', Field(name='x',type=<class 'int'>,default=_MISSING,default_factory=_MISSING,init=True,repr=True,hash=None,cmp=True))])
>>> 'typing' in sys.modules
False
>>> import typing
>>> @dataclass
... class C:
...     x: int
...     y: typing.ClassVar[int]
... 
>>> @dataclass
... class C:
...     x: int
...     y: typing.ClassVar[int]
... 
>>> C.__dataclass_fields__
OrderedDict([('x', Field(name='x',type=<class 'int'>,default=_MISSING,default_factory=_MISSING,init=True,repr=True,hash=None,cmp=True))])
>>>

I haven't thought through all of the implications, but I wanted to record this thought while I'm thinking about it.

EDIT: for clarity

slots should be present by default, and other name for corresponding kwarg

Currently __slots__ are not set by default (slots=False). I have two points here:

This is not what user would expect, since dataclasses will be probably used for something that is present in large quantities, so that they should be lightweight. Also in other languages analogs of dataclasses are typically not extendable after creation.
I would rather choose a different name for the corresponding kwarg, some people may not be familiar with slots and how they work. Maybe something like extendable that is False?

Methods and properties with type annotations should be ignored.

In this python-ideas post, Nick Coghlan says:

That said, even with this model, the base case of "fields with an
immutable or shared default" could potentially be simplified to:
from autoclass import data_record

@data_record
class Point3D:
    x: int = 0
    y: int = 0
    z: int = 0
However, the potentially surprising behaviour there is that to
implement it, the decorator not only has to special case the output of
"field()" calls, but also has to special case any object that
implements the descriptor protocol to avoid getting confused by normal
method and property definitions.

I don't believe the last sentence is true, because __annotations__ will not contain entries for properties or methods with type annotations.

This issue is to remind me to add test cases for this.

When API is stable, compare performance of ast vs. exec

The tag last-version-with-ast points to a version of the code that uses ast to create functions. The current master branch uses exec.

Add type annotations to generated code

Presumably we want type annotations added to the generated methods, in particular __init__. This is a reminder to add annotations.

For comparison functions, use subclass check, or identity check?

From the Abstract in the PEP, the comparison functions are given as:

def __eq__(self, other):
    if other.__class__ is self.__class__:
        return (self.name, self.unit_price, self.quantity_on_hand) == (other.name, other.unit_price, other.quantity_on_hand)
    return NotImplemented

There's been discussion on whether this should be a subclass check, or an exact match. I plan on looking in to this and addressing it before the next PEP version. This is a placeholder to remind me, and for discussion.

PEP typo

In the "Specification" section, you write:

If cmp and hash are both true, Data Classes will generate a __hash__ for you

I believe this should be:

If cmp and frozen are both true...

Skipping fields in the constructor signature

I often have instance variables that are not part of the constructor arguments. The current design doesn't seem to let me specify their types using the nice x: int notation, since that implies they are included in the constructor signature. E.g. (almost from asyncio):

class Server:
    def __init__(self, loop, sockets):
        self._loop = loop
        self.sockets = sockets
        self._active_count = 0
        self._waiters = []

I'd like to add types, like such:

@dataclass
class Server:
    loop: AbstractEventLoop
    sockets: List[socket.socket]
    _activecount: int = 0
    _waiters: List[Future] = []  # or field(factory=list)

But I'd need to have a way to say "these fields should not be part of the constructor".

Type annotation for module-level functions

Which type annotation would be appropriate for fields(), astuple(), asdict()? Because dataclasses don’t derive from a common ancestor, a protocol SupportsFields would be required to properly specify

> echo import dataclasses; dataclasses.astuple((1, 2, 3)) >xxx.py
> mypy xxx.py

(no warnings, as astuple() doesn't have a type annotation)

PEP: Add examples of using eq, compare, hash, and frozen.

Type annotation required?

For:

@dataclass
class C:
    x: int = 0
    y = 0

Is y a field? What are the params to init()? Just x? Or x and y?

dataclass doesn't need to look at annotations, so there's no technical reason they'd be required.

My personal preference is to require them. That is, drive field discovery from __annotations__, not something like [name for name in C.__dict__ if not name.startswith('_')]

A way to copy an instance while replacing some fields

When programming with an immutable/frozen data structure (which I personally prefer to do whenever it's reasonable), perhaps the most common operation is to replace some subset of the fields with new values, returning a new instance. It would be nice if dataclasses had similar functionality, for frozen classes at least.

Some prior art:

Python 3 dict:

a = {'key1': 'value1', 'key2': 'value2'}
b = {**a, 'key1': 'VALUE1'}

namedtuple _replace:

NT = namedtuple('NT', ('field1', 'field2'))
a = NT(field1='value1', field2='value2')
b = a._replace(field1='VALUE1')

attrs evolve:

@attr.s
class C:
     field1 = attr.ib()
     field2 = attr.ib()
a = C(field1='value1', field2='value2')
b = attr.evolve(a, field1='VALUE1')

Kotlin data class copy:

data class C(val field1: String, val field2: String)
val a = C(field1 = "value1", field2 = "value2")
val b = a.copy(field1 = "VALUE1")

Clojure record assoc:

(defrecord R [field1 field2])
(def a (R. "value1" "value2"))
(def b (assoc a :field1 "VALUE1"))

There are of course other example, but I think this shows the variety of names.

Possibility to make dataclasses iterable or indexable

Imagine users who are currently using named tuples or dictionaries as objects. They probably would want to switch to dataclasses when they appear, but probably in their code they have something like:

name, age = person
# or
person['name'] = 'Eve'

I think it may easier for them to switch to dataclasses if we provide something like this:

@data(iterable=True)
class Point:
    x: int
    y: int
origin = Point(0, 0)
x, y = origin  # OK

@data(indexable=True)
class Person:
    name: str
    age: int
person =  Person('John', 31)
name = person['name']
person['age'] = 32

I am not sure, but I think both flags should be False by default. If iterable is enabled, then we would add a corresponding __iter__, for indexable we would add __getitem__ and (depending on the mutability/hashability flags) __setitem__.

doc string for init

From a discussion with @raymondh:

Should we auto-generate some sort of doc string for __init__, or allow the user to specify one (maybe as a param to @dataclass)?

I'm not sure a generated one wouldn't have too much noise to be useful.

With a default_factory and init=False, fields are initialized in the wrong order

The PEP doesn't say that fields are initialized in order, but I'd like to do so, anyway.

If a field has a default_factory, and has init=False, it is initialized out of order. For this code:

@dataclass
class C:
    x: int
    y: list = field(default_factory=list, init=False)
    z: int

The generated code looks like:

def __init__(__dataclass_self__,x:_type_x,z:_type_z)->_return_type:
 __dataclass_self__.x=x
 __dataclass_self__.z=z
 __dataclass_self__.y = _dflt_y()

I'll rework field initialization to make sure this is handled correctly. I think there are also likely other corner cases.

Need tests for fields.hash

In particular, the None/True/False behavior and how it interacts with cmp is not fully tested.

Get rid of make_class().

After talking to @raymondh today, he questioned the need for make_class() and its many parameters. And he's correct: it's not needed. Python can already dynamically create classes, just leverage that.

So, I plan to remove make_class().

Instead of:

C = make_class('C',
               [field('x', int),
                field('y', int, default=5),
                ])

We'd use:

cls_dict = {'__annotations__': OrderedDict(x=int, y=int,),
            'y': field(default=5),
            }
C = dataclass(type('C', (object,), cls_dict))
assert repr(C(4)) == 'C(x=4,y=5)'

And the beauty of this is that I can remove the name and type parameters to field().

Should setting slots be the default

In point 5 of issue #9, @ilevkivskyi suggests that setting slots should be the default, and setting slots=False would allow the caller to opt out of this behavior.

Given the typical use cases for this feature, I tend to agree. I'm opening this issue for discussion of this point.

On naming

As I’ve already mentioned by e-mail, I’m strongly opposed to call this concept “data classes”.

Having an easy way to define many small class with attributes is nothing about data, it’s about good OO design.

Calling it “data classes” implies that they differ from…“code classes” I guess?

One of the things people love about attrs is that it’s helping them to write regular classes which they can add methods to without any subclassing or other magic. IOW: to focus on the actual code they want to write as opposed to generic boilerplate.

Debasing them by name seems like a poor start to me. We do have data containers in the stdlib (namedtuples, SimpleNamespace) so I don’t see a reason to add a third to the family – even if just by name.

init should be kwargs-only

The initializer should ideally be kwargs-only. That is:

	def __init__(self, name: str, …

should be:

	def __init__(self, *, name: str, …

The reason is that the automatically generated methods are all using the definition order in the class, so if, for example, I add a new attribute and I want comparison to look at it before some other attribute, I’d have to possibly change the definition order, which will change the signature of init in an incompatible way. By making the method kwargs-only, you can ensure that you aren't breaking clients when you do that.

Message from Nick on the python-ideas thread

Quoting @ncoghlan on python-ideas:

Some of the key problems I personally see are that attrs reuses a
general noun (attributes) rather than using other words that are more
evocative of the "data record" use case, and many of the parameter
names are about "How attrs work" and "How Python magic methods work"
rather than "Behaviours I would like this class to have".

That's fine for someone that's already comfortable writing those
behaviours by hand and just wants to automate the boilerplate away
(which is exactly the problem that attrs was written to solve), but
it's significantly more problematic once we assume people will be
using a feature like this before learning how to write out all the
corresponding boilerplate themselves (which is the key additional
complication that a language level version of this will have to
account for).

However, consider instead the following API sketch:
from autoclass import data_record, data_field

@data_record(orderable=False, hashable=False)
class SvgTransform(SvgPicture):
    child = data_field()
    matrix = data_field(setter=numpy.asarray)
Here, the core concepts to be learned would be:

the "autoclass" module lets you ask the interpreter to automatically
fill in class details

SvgTransform is a data record that cannot be hashed, and cannot be ordered

it is a Python class inheriting from SvgPicture

it has two defined fields, child & matrix

we know "child" is an ordinary read/write instance attribute

we know "matrix" is a property, using numpy.asarray as its setter

In this particular API sketch, data_record is just a class decorator
factory, and data_field is a declarative helper type for use with that
factory, so if you wanted to factor out particular combinations, you'd
just write ordinary helper functions.

Instead of trying to cover every possible use-case from a single
decorator with a multitude of keyword arguments, I think covering the
simple cases is enough. Explicitly overriding methods is not a bad
thing! It is much more comprehensible to see an explicit class with
methods than a decorator with multiple keyword arguments and callbacks.

This isn't the case for folks that have to actually read dunder
methods to find out what a class does, thought. Reading an
imperatively defined class only works that way once you're able to
mentally pattern match "Oh, that's a conventional init, that's a
conventional repr, that's a conventional hash, that's a
conventional eq, that's a conventional lt implementation, etc,
etc".

Right now, telling Python "I want to do the same stock-standard things
that everyone always does" means writing a lot of repetitive logic
(or, more likely, copying the logic from an existing class that you or
someone else wrote, and editing it to fit).

The idea behind offering some form of declarative class definitions is
to build out a vocabulary of conventional class behaviours, and make
that vocabulary executable such that folks can use it to write
applications even if they haven't learned how it works under the hood
yet. As with descriptors before it, that vocabulary may also take
advantage of the fact that Python offers first class functions to
allow callbacks and transformation functions to be injected at various
steps in the process without requiring you to also spell out all the
other steps in the process that you don't want to alter.

I like the namedtuple approach: I think it hits the sweet spot between
"having to do everything by hand" and "everything is magical".

It's certainly a lot better than nothing at all, but it brings a lot
of baggage with it due to the fact that it is a tuple. Declarative
class definitions aim to offer the convenience of namedtuple
definitions, without the complications that arise from the "it's a
tuple with some additional metadata and behaviours" aspects.

Database object-relational-mapping layers like those in SQL Alchemy
and Django would be the most famous precursors for this, but there are
also things like Django Form definitions, and APIs like JSL (which
uses Python classes to declaratively define JSON Schema documents).

For folks already familiar with ORMs, declarative classes are just a
matter of making in memory data structures as easy to work with as
database backed ones. For folks that aren't familiar with ORMs yet,
then declarative classes provide a potentially smoother learning
curve, since the "declarative class" aspects can be better separated
from the "object-relational mapping" aspects.

Spaces in repr

It's a tiny issue, but shouldn't there be spaces after commas in the generated reprs, i.e. Point(x=1, y=2) instead of Point(x=1,y=2)?

differences / compatibility with attrs project

It would be helpful to have a list of functional differences between dataclasses and attrs, broken down by @dataclass vs @attr.s and field vs attr.ib.

This would be useful and illuminating for a few reasons:

It would make it easier to vet the logic behind, and need for, each of the proposed differences.

@hynek and @Tinche have invested years of thought into the current design: deviating from it without fully understanding the history and reasoning behind each decision might lead to this project needlessly repeating mistakes. I'm glad to see that the attrs devs have already been brought into several issues. My hope is we can get a bird's eye view so that nothing slips through the cracks.

If the differences aren't too great (and ideally they will not be, see above) I'd like to see a dataclass compatibility mode for attrs (e.g. from attrs import dataclass, field).

I'm glad that this badly-needed feature is being worked on, but sadly I'm stuck in python 2 for at least another 2 years, so it's important to me, and surely many attrs-users, to have an easy path to adoption once this becomes part of stdlib.

In the generated init, use "self" unless it has a conflict with a field name.

Currently the code always uses _dataclass_self, but that makes the generated code ugly in the normal case. Only use dataclass_self if there's a field named self.

covariant overriding of fields considered unsafe

I don't recall where I saw this but I believe we had some example where a base class would declare a field of type float and the subclass would redefine it as int. I would just like to insert a note in that thread referencing python/mypy#3208, which points out that this is not actually a safe thing to do unless the class is immutable.

Module-level functions or instance methods for helpers

There is a need to have a number of helper functions, such as asdict(), astuple(), replace(), etc. This issue is not about which specific functions we need (that will come later), but rather how to make these functions available.

namedtuple handles this by having member functions that begin with an underscore, such as ._make(), ._asdict(), ._replace(), ._fields(), etc.

attrs handles this by using module-level functions that take an instance as a parameter, such as .attrs.fields(), attrs.has(), attrs.asdict(), etc.

I'm leaning towards the attrs interface of using module-level functions. I don't like the underscores required by namedtuple's approach, and even then you could have name collisions.

Exception type of fields() on non-dataclass

As it was decided (issue #8) that fields(), astuple(), asdict() will be module-level functions (not member functions), users can (but shouldn’t) call these for non-dataclasses too. I surmise this should raise TypeError (not AttributeError, which is raised when fields() accesses the _MARKER attribute now) to best inform users of their erroneous use

>>> import dataclasses
>>> dataclasses.astuple((1, 2, 3))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...dataclasses.py", line
    return tuple(getattr(obj, name) for name in fields(obj))
  File "...dataclasses.py", line
    return getattr(cls, _MARKER)
AttributeError: 'tuple' object has no attribute '__dataclass_fields__'

why not just attrs?

I don't mean this as a dig at the project here, but a serious question:

Is there any reason this project should exist?

Every aspect of the design here appears to be converging further in the direction of exact duplication of functionality available within attrs, except those which are features already on attrs's roadmap. The attrs maintainers have carefully considered a number of subtle issues related to this problem domain, and every discussion I have observed thus far on the dataclasses repo has paralleled the reasoning process that went into attrs's original design or its maintenance. (Remembering, also, that attrs is itself a second-generation project and there was a lot of learning that came from characteristic.)

I haven't had time to read all of the python-ideas thread, but all of the objections to attrs that I could find seem to have to do with whimsical naming or a desire to shoehorn convenient class creation into a "special" role that is just for "data" and not for regular classes somehow. I shouldn't be passive-aggressive here, so I should just say: I can't follow Nick's reasoning on #1 at all :-).

The silly names could be modified by a trivially tiny fork, if that is really a deal-breaker for stdlib adoption; honestly I find that the names grow on you. (More than one attrs user has separately come up with the idea that it is a lot like Python's significant whitespace.)

That said, of course there may be some reasons or some broader goal that I'm missing, but if this is the case it seems like writing a very clear "goals / non-goals / rejected approaches" section for the PEP itself would be worthwhile. The reasons given in the PEP don't really make sense; the need to support python 2 hasn't been a drag on python 3 that I'm aware of, and annotation-based attribute definition is coming to attrs itself; it's a relatively small extension.

Test needed: Check items in class and instance dicts

I want to make sure that the instance dict is being initialized when I think it should be.

In:

@dataclass
class C:
    a: int
    b: list = field(default_factory=list, init=False)
    c: list = field(default_factory=list)
    d: int = field(default=4, init=False)
    e: int = 0

c=C(0)

Then:

only a, b, c, and e should be in the instance dict
d should be in the class dict with value 4
e should be in the class dict with value 0

The reasoning is:

a is in the instance dict because it's assigned to in __init__.
b is in the instance dict, even though it's not in the __init__ param list. This is because the default factory still needs to be called from __init__.
c is in the instance dict because it's assigned to in __init__.
d is not in the instance dict because it's not assigned to by __init__. It's in the class dict because it has a default value.
e is in the instance dict because it's assigned to in __init__.

The generated __init__ looks like:

def __init__(self,a:_type_a,c:_type_c=_MISSING,e:_type_e=_dflt_e)->_return_type:
    self.a = a
    self.b = _dflt_b()
    self.c = _dflt_c() if c is _MISSING else c
    self.e = e

locals: {'_type_a': <class 'int'>, '_type_b': <class 'list'>, '_type_c': <class 'list'>, '_type_d': <class 'int'>, '_type_e': <class 'int'>, '_return_type': None}
globals: {'_MISSING': <object object at 0x10ea26100>, '_dflt_b': <class 'list'>, '_dflt_c': <class 'list'>, '_dflt_e': 0}

The serialization cliff

Seeing as the scope of dataclasses still seems fairly pliable, I'll share an answer to #19 ("why not attrs?") which could greatly increase its utility: The serialization cliff.

My teammates and I have been using attrs almost as long as it's been around, and namedtuples for much longer. Great and fine solutions, if your data doesn't have to leave the process space.

The boilerplate we're trying to avoid comes back in a severe way as soon as databases, SOA, or files get involved. Let alone anything with complicated validation rules.

I think to be true to the name, dataclasses need to account for the pervasiveness of data ingestion and emission. As it stands, attrs has asdict() and not much else. Nothing to help recursively (re)construct instances from data.

There is another, somewhat more popular, approach in marshmallow, which also aims to make data-centric, serialization-agnostic container types. It's not perfect, but it's a starting point. A balance of attrs and marshmallow features may yield a powerful and sufficiently-differentiated featureset that I think a lot of developers are missing in Python. The number of entries in this space seems to agree.

I'm happy to see more discussion going into these fundamentals, and I'm hopeful that more and more they take the whole data workflow into mind. :)

Supporting immutable instances

attrs has a frozen=True decorator parameter that causes all instances to be read-only. It does this by adding a setattr to the class which disallows overwriting attributes.

I like this feature, and I assume we should do the same thing. If anyone disagrees, please discuss here.

What API for make_class()?

make_class is a way to dynamically create classes, similar in spirit to collections.namedtuple or attr.make_class.

The question is, what should the API be for specifying fields?

For 'normal' usage, a field requires 3 pieces of information, one of which is optional: name, type, default value. The default value can be overridden with a field() call to specify not only the default value, but additional per-field attributes (use in hash, use in rear, etc.).

For make_class, I propose the per-field API be similar: 2 required items (name, type) and one optional one, the default or field().

So, something like:

C = make_class('C',
               [('a', int),
                ('x', float, 1001.0),
                ('b', str, field(default=1000, repr=False),
               ],
               bases=(B1, B2))

Which would be the dynamic equivalent of the static:

@dataclass
class C(B1, B2):
    a: int
    x: float = 1001.0
    b: str = field(default=1000, repr=False)

I realize the dynamic make_class call is somewhat unsatisfying, but I don't expect it to get used at all with statically specified values. My use case is something like reading database schemas and generating a dataclass that will be the return value for each row. In that case, the list of fields would be generated in code after reading the schema.

Any suggestions for improvements here? One thought would be that instead of using a 2- or 3-tuple for each field, have another class that would represent them. I'm not sure if that's worth the hassle, though.

Input is welcomed.

Add module helper function that provides access to all fields.

Since we decided in issue #8 to use module level helper functions instead of instance methods, I want to add the first such function.

dataclasses.fields(cls) will return a tuple of Field objects defined in cls. Each Field object represents one field in the class.

This will be the basic building block for a number of introspection methods.

attrs returns object that can be either indexed or accessed by field name. I think that's a good idea, but I'm not going to implement it at first.

Remove per-field "hash" flag

Since we're saying it should always match what's currently the "cmp" flag, let's just force it to be the same by getting rid of the "hash" flag. This will slightly simplify the field() call.

Also generate namedtuples?

It occurs to me that I could also use this mechanism to generate namedtuples. I'm not saying it should happen in this PEP, but it might be worth considering when deciding on names and APIs.

An alternative to typing.ClassVar?

In issue #14 we added support for typing.ClassVar annotations. If an annotation is a ClassVar, then it's considered to not be a field (that is, it's not set on instances of the dataclass, it's not in __init__, etc.).

There's ongoing discussion on python-ideas and python-dev about dropping the typing module from the stdlib.

I'm wondering what we should do about ClassVar if typing is in fact dropped from the stdlib.

Currently, the code doesn't import typing. It just looks for "typing" in sys.modules, and if that's present, assumes it's the typing module and looks inside of it for ClassVar. I think this is a good approach. However, if typing is no longer part of the stdlib, I guess it's possible for another module named typing to be used in its place, and then I need to be more defensive about looking inside sys.modules['typing']. Is that case worth worrying about? I sort of think it's not, although it would be easy enough to add a getattr(typing, 'ClassVar') to the code.

The other thing to worry about is: what if typing is removed, but something in the stdlib wants to have a dataclass with a ClassVar? In https://mail.python.org/pipermail/python-dev/2017-November/150176.html, @ncoghlan suggested having dataclasses create its own ClassVar. Another option that's just as good, although the syntax is somewhat worse, is to add a param to field() that says "this isn't really a field". Something like:

@dataclass
class C:
    x: int
    classvar: field(int, not_a_field=True)

In either event, mypy would need to know about it, to know that __init__ for the class would only have one parameter, x.

If we go with any of these approaches, I think we should still keep the code in dataclasses that understands real typing.ClassVar fields. That seems like the most natural way to write code, outside of the stdlib.

Return type of asdict()

Suggest to have asdict() return an OrderedDict (like fields() already does) as order matters in a dataclass

Add immutable=False parameter to dataclass

If someone wants to create a data class in which all instances are immutable (i.e. each attribute can not be changed after construction), I propose that a immutable parameter be added (which in the spirit of Python defaults to False). Note this is different than frozen, which applies to monkey patching new attributes.

Currently, this can be done manually with normal classes with a lot of boilerplate and the use of @property. In other languages, such as Kotlin, data classes are immutable by default.

A sketch of this proposal would be as follows:

@dataclass(immutable=True)
class InventoryItem:
    name: str
    unit_price: float
    quantity_on_hand: int = 0

    def total_cost(self) -> float:
        return self.unit_price * self.quantity_on_hand

Would desugar into something like:

def __init__(self, name: str, unit_price: float, quantity_on_hand: int = 0) -> None:
    self._name = name
    self._unit_price = unit_price
    self._quantity_on_hand = quantity_on_hand

@property
def name(self) -> str:
    return self._name

@property
def unit_price(self) -> float:
    return self._unit_price

@property
def quantity_on_hand(self) -> int:
    return self._quantity_on_hand

If one attempts you modify a property, an AttributeError is raised. IDEs can lint for this kind of thing while the user types before runtime. PyCharm, for example, squiggles a warning if you try to set a property.

Overwrite init_ and others?

In issue #9 @ilevkivskyi suggests that we should raise an error if any class attribute we're trying to set is already used. This includes __init__, __eq__, etc. He notes that the typing module raises AttributeError in these cases.

Note that the caller can suppress the generation of __eq__ by setting cmp=False, can suppress __repr__ by setting repr=False, can suppress the generation of __init__ by setting init=False, etc.

I think raising an AttributeError if we try to overwrite a dunder method is a good idea. The caller has the choice to either write their own method and suppress ours, or use ours. But the can't try to do both.

Edit: Fix init=False instead of __init__=False.