{
"$type": "site.standard.document",
"canonicalUrl": "https://rednafi.com/python/django-bulk-operation-with-process-pool/",
"description": "Speed up Django bulk operations with ProcessPoolExecutor while preserving signals and hooks that bulk_create/bulk_update bypass.",
"path": "/python/django-bulk-operation-with-process-pool/",
"publishedAt": "2022-06-27T00:00:00.000Z",
"site": "at://did:plc:fgtm2c26vfcj74rfmeggbyqj/site.standard.publication/3mnl6f7ob462z",
"tags": [
"Python",
"Django",
"Concurrency",
"Database"
],
"textContent": "I've rarely been able to take advantage of Django's bulk_create / bulk_update APIs in\nproduction applications; especially in the cases where I need to create or update multiple\ncomplex objects with a script. Often time, these complex objects trigger a chain of signals\nor need non-trivial setups before any operations can be performed on each of them.\n\nThe issue is, bulk_create / bulk_update doesn't trigger these signals or expose any hooks\nto run any setup code. The Django doc mentions these [bulk_create caveats] in detail. Here\nare a few of them:\n\n- The model's save() method will not be called, and the pre_save and post_save signals\n will not be sent.\n- It does not work with child models in a multi-table inheritance scenario.\n- If the model's primary key is an AutoField, the primary key attribute can only be\n retrieved on certain databases (currently PostgreSQL, MariaDB 10.5+, and SQLite 3.35+). On\n other databases, it will not be set.\n- It does not work with many-to-many relationships.\n- It casts objs to a list, which fully evaluates objs if it's a generator. Here, obj is\n the iterable that passes the information necessary to create the database objects in a\n single go.\n\nTo solve this, I wanted to take advantage of Python's concurrent.futures module. It\nexposes a similar API for both thread-based and process-based concurrency. The snippet below\ncreates ten thousand user objects in the database and runs some setup code before creating\neach object.\n\nHere, the create_user_setup function runs some complex setup code before the creation of\neach user object. We wrap the user creation process in a function named create_user and\ncall the setup code in that. This allows us to run the complex setup code concurrently. The\nmagic happens in the bulk_create_users function. It takes in an iterable containing the\ninformation to create the users and runs the create_user functions concurrently.\n\nThe ProcessPoolExecutor forks 4 processes and starts consuming the iterable. We use the\nexecutor.submit method for maximum flexibility. This allows us to further process the\nreturned value from the create_user function (in this case it's None). Running this\nsnippet will also show a progress bar as the processes start chewing through the work.\n\nYou can also try experimenting with ThreadPoolExecutor, executor.map, and chunksize. I\ndidn't choose executor.map because it's tricky to show the progress bar with map. Also,\nI encountered some psycopg2 errors in a PostgreSQL database whenever I switched to the\nThreadPoolExecutor. Another gotcha is that psycopg can complain about closed cursors and\nclosing the database connection before running each process is a way to avoid that. Notice\nthat the script above runs django.db.connections.close_all() before entering into the\nProcessPoolExecutor context manager.\n\nThis appoach will run the pre_save and post_save signals which allows me to take\nadvantage of these hooks without losing the ability of being able to perform concurrent row\noperations.\n\nBreadcrumbs\n\nExample shown here performs a trivial task of creating 10k user objects. In cases like this,\nyou might find that a simple for-loop might be faster. Always run at least a rudimentary\nbenchmark before adding concurrency to your workflow.\n\nAlso, this approach primarily targets ad-hoc scripts and tasks. I don't recommend forking\nmultiple processes in your views or forms since Python processes aren't cheap.\n\nFurther reading\n\n- [concurrent.futures documentation]\n\n\n\n\n[bulk_create caveats]:\n https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create\n\n[concurrent.futures documentation]:\n https://docs.python.org/3/library/concurrent.futures.html",
"title": "Bulk operations in Django with process pool"
}