Faster bulk_update in Django
Django has a Model.objects.bulk_update method that allows you to update multiple objects in a single pass. While this method is a great way to speed up the update process, oftentimes it's not fast enough. Recently, at my workplace, I found myself writing a script to update half a million user records and it was taking quite a bit of time to mutate them even after leveraging bulk update. So I wanted to see if I could use multiprocessing with .bulk_update to quicken the process even more. Turns out, yep I can!
Here's a script that creates 100k users in a PostgreSQL database and updates their usernames via vanilla .bulk_update. Notice how we're timing the update duration:
This can be executed as a script like this:
It'll return:
A little over 9 seconds isn't too bad for 100k users but we can do better. Here's how I've updated the above script to make it 4x faster:
This script divides the updated user list into a list of multiple user chunks and assigns that to the user_chunks variable. The update_users function takes a single user chunk and runs .bulk_update on that. Then we fork a bunch of processes and run the update_users function over the user_chunks via multiprocessing.Pool.map. Each process consumes 10 chunks of users in a single go - determined by the chunksize parameter of the pool.map function. Running the updated script will give you similar output as before but with a much smaller runtime:
This will print the following:
Whoa! This updated the records in under 2.5 seconds. Quite a bit of performance gain there.
This won't work if you're using SQLite database as your backend since SQLite doesn't support concurrent writes from multiple processes. Trying to run the second script with SQLite backend will incur a database error.
Further reading
Discussion in the ATmosphere