Python 3.7 Data Classes
July 8, 8080
PEP 557 in the recently-released Python 3.7 added data classes to the standard Python library. Data classes can be thought of as mutable data holders and are somewhat similar to named tuples, although named tuples are immutable.
Data classes provide a lot of boilerplate code, saving time and effort on the part of the Python programmer, although it could be argued that this layer of abstraction makes debugging more difficult.
Comparing regular classes and data classes
Consider the following class:
class BankAccount():
def __init__(self, id, balance, customer_id):
self.id = id
self.balance = balance
self.customer_id = customer_id
This provides us with the minimal ability to initialise a new BankAccount object, although we’ve had to reference id
, balance
, and customer_id
three times in this small piece of code.
Let’s initialise two new objects using our BankAccount
class - my_account
and your_account
. We’ll initialise both with the same values, ignoring the fact that they should have different id
and customer_id
values, then try and compare them to each other.
>>> my_account = BankAccount(1, 0, 1)
>>> your_account = BankAccount(1, 0, 1)
>>> my_account == your_account
False
In order to be able to compare our my_account
and your_account
objects successfully, we’d need to add an __eq__
method to our class.
class BankAccount():
def __init__(self, id, balance, customer_id):
self.id = id
self.balance = balance
self.customer_id = customer_id
def __eq__(self, other):
if self.__class__ is other.__class__:
return (self.id, self.balance, self.customer_id) == (other.id, other.balance, other.customer_id)
return NotImplemented
If we initialise our two objects again and compare them now, we’ll get the True
response that we’re expecting. If we were to initialise the your_account
object with an id
value of 2
, and a customer_id
value of 2
, we’d get the correct response of False
when comparing the two objects.
>>> my_account = BankAccount(1, 0, 1)
>>> your_account = BankAccount(1, 0, 1)
>>> my_account == your_account
True
>>> your_account = BankAccount(2, 0, 2)
>>> my_account == your_account
False
This all makes sense so far, but it’s boilerplate code that we have to write each and every time that we write a new class. Let’s take a look at how we’d do the same thing with 3.7’s data classes.
from dataclasses import dataclass
@dataclass
class DataClassBankAccount():
id: int
balance: int
customer_id: int
Data classes generate all of this boilerplate code for us, but they don’t stop at just the __init__
and __eq__
methods - they can also generate __repr__
, __ne__
, __lt__
, __le__
, __gt__
, and __ge__
methods too, if the order
parameter is specified as True
(this is done at the @dataclass
level, i.e. @dataclass(order=True)
). Additional methods can be added to the data class as you would for a normal class. The @dataclass
decorator inspects a class definition for fields with type annotations (added in PEP 526). These type annotations are mandatory when creating data classes as fields without type annotations will simply be ignored. We can now initialise and compare our two objects straight away:
>>> my_account = DataClassBankAccount(1, 0, 1)
>>> your_account = DataClassBankAccount(1, 0, 1)
>>> my_account == your_account
True
>>> your_account = DataClassBankAccount(2, 0, 2)
>>> my_account == your_account
False
As mentioned in PEP 557, there isn’t anything special about these classes. The decorator takes the class and adds generated methods to it, then returns the class it was given. This means adding your own methods to a data class is done in exactly the same way as you would for a regular class.
Comparing named tuples and data classes
Let’s compare for a moment our bank account data class and an implementation of the bank account using a named tuple.
from typing import NamedTuple
class NamedTupleBankAccount(NamedTuple):
id: int
balance: int
customer_id: int
There’s no great difference here, other than the fact that our data class was described using a decorator, whilst the named tuple subclasses NamedTuple
. There are other similarities too. For instance, with our data class we can create a new object from an existing data class object.
>>> from dataclasses import replace
>>>
>>> replace(my_account, balance=100)
BankAccount(id=1, balance=100, customer_id=1)
We’d do this in a similar way with a named tuple, but the replace method here is proceded by an underscore, indicating that it is a private method of our named tuple bank account object.
>>> our_account = NamedTupleBankAccount(3, 0, 3)
>>>
>>> our_account._replace(balance=100)
NamedTupleBankAccount(id=3, balance=100, customer_id=3)
Data classes also provide methods for conversion to dictionaries and tuples.
>>> from dataclasses import asdict, astuple
>>>
>>> asdict(my_account)
{'id': 1, 'balance': 0, 'customer_id': 1}
>>>
>>> astuple(my_account)
(1, 0, 1)
And similarly, the asdict
method exists as a private method of our named tuple object, with the key difference being that this returns an OrderedDict
rather than a standard dict.
>>> our_account._asdict()
OrderedDict([('id', 3), ('balance', 0), ('customer_id', 3)])
You can unpack a named tuple rather simply, but must first wrap a data class object in a call to astuple
before it is possible to unpack - this is because data classes don’t iterate by default.
>>> our_account_id, our_balance, our_customer_id = our_account
>>> our_account_id
3
>>>
>>> my_account_id, my_balance, my_customer_id = astuple(my_account)
>>> my_account_id
1
Data classes can’t be hashed by default, whereas named tuples can - data classes actually set __hash__
to None
in order to avoid accidental hashability. Named tuples provide hashability and ordering out of the box, as they are inherited from tuples.
Equality methods between the two types are different as well. It’s possible to compare two different named tuple objects instantiated from two different named tuple classes which happen to have the same field naming - this is because named tuples lack the if self.__class__ is other.__class__:
conditional that data classes provide in their equality methods.
As of Python 3.7 it is slower to access fields of a named tuple than those of a data class, though Raymond Hettinger mentions in his PyCon 2018 talk ‘Dataclasses: The code generator to end all code generators’ that this timing will be improved significantly in Python 3.8. You can find the slides for Raymond’s PyCon talk here.
You shouldn’t think of data classes as an improvement upon a named tuple - if that’s what fits the structure of your data, then that’s what you should use.
Additional data class usages
Default values
We can set default values for our specified data class fields. Let’s take a look at how we’d do that with a normal class.
class Animal:
def __init__(self, type, legs=4):
self.type = type
self.legs = legs
When declaring our data class, we declare our default value(s) differently.
@dataclass
class Animal:
type: str
legs: int = 4
The above data class will give the below output when initialising objects.
>>> Animal("dog")
Animal(type="dog", legs=4)
>>> Animal("ostrich", 2)
Animal(type="ostrich", legs=2)
Building upon our original BankAccount class we can take a look at a more advanced default value. Let’s say for each bank account object, we want to track who accessed the bank account and when. We’ll create a more advanced BankAccount class that features this functionality.
from dataclass import field
from datetime import datetime
@dataclass
class AdvancedBankAccount():
id: int
balance: int = field(metadata={"currency": "GBP"})
customer_id: int
accessed_by: list = field(default_factory=list)
def access(self, accessor_id):
self.accessed_by.append((accessor_id, datetime.now()))
>>> advanced_account = AdvancedBankAccount(4, 10000, 4)
>>> advanced_account.access(1)
>>> advanced_account
AdvancedBankAccount(id=4, balance=10000, customer_id=4, accessed_by=[(1, datetime.datetime(2018, 7, 8, 19, 30, 40, 783467))])
The default_factory
is used to provide a mutable default value. Additionally, we’ve also passed a metadata parameter which specifies some metadata about the field, in this case the currency of the balance
. The dataclass itself won’t do anything with this, but you can view it using the fields
function.
Field arguments
We can pass some additional arguments when creating our data classes.
We can not include a specific field in the output of the class __repr__
method.
from dataclasses import field
@dataclass
class Animal():
type: str = field(repr=False)
legs: int = 4
And we could also not include a specific field when comparing two objects from the same data class.
from dataclasses import field
@dataclass
class Animal():
type: str = field(order=False)
legs: int = 4
Immutable data classes
Data classes are mutable by default, but there might be scenarios where we want to maintain the immutability that a named tuple offers us.
from dataclasses import field
@dataclass(frozen=True)
class Animal():
type: str
legs: int
The frozen=True
argument that we’ve passed to the @dataclass
decorator means that we won’t be able to assign values to any objects created from this data class after their initialisation.
>>> cat = Animal("cat", 4)
>>> cat.legs = 3
dataclasses.FrozenInstanceError: cannot assign to field 'legs'