Kuba's Lab Notes

Postgress progress

Kuba Suder 🇵🇱🇺🇦 January 16, 2025

I think I've finally made some good progress with my Postgres database. Things are working pretty well now, and I still have a few fixes in the queue. It's not all finished yet, but I feel like I'm coming out of the woods now.

For those out of the loop: I have a server which streams and saves all posts from the Bluesky firehose, and uses this data to run several feeds on Bluesky used by 1k+ people, plus generates daily post statistics, the directory of custom handles, and things like that.

This all has been running on SQLite so far (200+ GB file now), and it has served me well far longer than I would ever have expected, but in the end I'm hitting its limits. The main problem is lack of support for concurrent writes, which makes it unrealistic to e.g. split the work into multiple workers, which I'm gonna have to do sooner or later.

I've only really used MySQL before on servers, but everyone was telling me I need to use Postgres, so I did what any normal person would have done (right?): I've set up two test servers, one on MySQL and one on Postgres, running from two modified branches of the code, and I made the SQLite-based production server proxy some requests to one or the other in order to test them on real traffic. A sort of database A/B testing 🤪

It has been working like this for a few months now, and I've been slowly tweaking things, mostly on the Postgres side. Since I got it to work at all, they've both been working ok, but MySQL has generally been doing more write I/O doing the same things, and Postgres has been doing comparably more read I/O.

The key part are the Replies feeds (Follows & Replies and Only Replies, it's basically the same code). These two are kind of variants of the built-in Following feed, showing the connecting user their own chronological timeline, generated live, filtered in some way. These feeds work by fetching the requesting user's follows list, either from the AppView or from a local cache, and then making a query that's basically: "give me the most recent 100 posts from any of these 50/200/1000/5000 users".

There are two relevant indexes in that table: one on (repo, time) (repo = user's DID), and one on just (time). Roughly speaking, for those users who follow e.g. 80 or 200 people, it makes more sense to scan the (repo, time) index those 80-200 times and collect the 100 most recent posts from all of those found, and for those who follow e.g. 9000 (yes, that happens 😛), it's faster to scan the single (time) index until you find 100 relevant posts. But I've been struggling to make Postgres always use the right index.

At first it tried to use the (time) index way too much. I tried to find a way to force it to use a specific index, but I was told it's not possible (it might be though!). With the help of some people on Bluesky, I managed to rewrite the query to use the other index, but then it worked bad for those people with a lot of follows. I interpreted the EXPLAINs as the query planner not having a good enough info about the structure of the data (it estimated n_distinct, the total number of distinct values in a column, as much lower than the actual number of users in the table), so I started messing with the "statistics" settings. I ended up bumping up the statistics target to the maximum 10,000 (making ANALYZE take really long) and hardcoded n_distinct to 4.5M, which was closer to the total count. I also increased shared_buffers to 1 GB. This improved things somewhat, to the "not great, not terrible" level, but I was still not happy with the performance.

But eventually, I was able to improve things a lot over the last few weeks:

I'm really looking forward to when I can finally set up a production instance with Postgres and move everything there and drop the other branches - but I feel like this moment is getting close 😅

Update: I've summed up this whole adventure in a blog post here:

https://mackuba.eu/2025/10/15/three-bases-one-app/ External Link • mackuba.eu

Discussion in the ATmosphere