Quicker startup with module-level __getattr__
This morning, someone on Twitter pointed me to PEP 562, which introduces getattr and dir at the module level. While dir helps control which attributes are printed when calling dir(module), getattr is the more interesting addition.
The getattr method in a module works similarly to how it does in a Python class. For example:
In this class, getattr defines what happens when specific attributes are accessed, allowing you to manage how missing attributes behave. Since Python 3.7, you can also define getattr at the module level to handle attribute access on the module itself.
For instance, if you have a module my_module.py:
Using this module:
If an attribute isn't found through the regular lookup (using object.getattribute), Python will look for getattr in the module's dict. If found, it calls getattr with the attribute name and returns the result. But if you're looking up a name directly as a module global, it bypasses getattr. This prevents performance issues that would arise from repeatedly invoking getattr for built-in or common attributes.
One practical use for module-level getattr is lazy-loading heavy dependencies to improve startup performance. Imagine you have a module that relies on a large library but don't need it immediately at import.
With this setup, importing heavy_module doesn't immediately import NumPy. Only when you access heavy_module.np does it trigger the import:
The first access to heavy_module.np imports NumPy (adding ~150ns), but since we cache np with globals()['np'] = np, subsequent accesses are fast, as the module now holds the reference to NumPy.
This approach is handy in scenarios like CLIs where you want to keep startup quick. For example, if you need to initialize a database connection but only for specific commands, you can defer the setup until needed.
Here's an example with SQLite (though SQLite connections are quick, imagine a slower connection here):
In this setup, nothing is instantiated when you import db_module. The connection is only initialized on the first access of db_module.connection. Later calls use the cached _connection, making subsequent access fast.
Here's how you might use it in a CLI:
When you run python cli.py greet, the CLI starts quickly since it doesn't initialize the database connection. But running python cli.py show_data accesses db_module.connection, which triggers the connection setup.
This could also be achieved by defining a function that initializes the database connection and caches it for subsequent calls. However, using module-level getattr can be more convenient if you have multiple global variables that require expensive calculations or initializations. Instead of writing separate functions for each variable, you can handle them all within the getattr method.
Here's one example of using it for a non-trivial case in the wild.
Discussion in the ATmosphere