Big Indexing

bryan newbold (⛱️ sabbatical edition) December 6, 2025
Source

A couple of weeks ago at Eurosky I talked to some friendly Erlang hackers who wanted to get involved with AT development. The AT network is a big data-intensive distributed system, and it is the sort of thing the BEAM runtime is well-suited for. I know Chad Miller has been using Gleam for parts of Slices, and services like relays, jetstream, and the forthcoming tap tool could all be re-implemented using Erlang-y tools. But I think these tools, written in Go, are already in a pretty good place: they are efficient enough, are not too hard to operate (IMO), and have capacity to scale. hose.cam lists 9 full-network relays from 7 distinct parties.

I think the harder technical problem in the network today is patterns for big indexing: datastores for indexing billions of records. This is an area where there aren't clear choices, and folks are feeling pain. There probably isn't going to be one solution that works for everything, and more experiments and write-ups would be welcome, particularly from folks from outside the AT ecosystem who have a hammer they love (eg, a particular database) and are looking for a nail (a fun use case).

There definitely are folks doing great work in this direction already:

Building alternative bluesky-compatible appviews is definitely a big motivating use case, but I think the need is broader than something that works for one project and codebase. At a minimum, it should be possible to develop new product features which would require additional data types and indices, like Blacksky is planning with community features. And we expect other projects and apps in the network to grow over time: we need to be ready for non-bsky record types with millions and billions of records, which will have their own unique indexing needs.

The sweet spot to me is: what systems and design patterns work well to index tens of billions of records on low-end bare metal servers? By "low-end" I mean cheap dedicated servers on the order of $100 to $600 a month: very expensive compared to a Raspberry Pi or basic VPS, but much cheaper than buying a $20k to $50k closet monster, and usually a lot cheaper than a dedicated/managed database server (eg, a big AWS RDS instance). This class of machine usually has tens of GB of RAM, 12+ vCPUs, and most importantly, several TBytes of fast directly attached NVMe storage. You can get good deals on this sort of hardware from OVH or Hetzner; don't bother with cloud providers like AWS, GCP, or Azure. Ideally the datastore would support horizontal scalability for read load and availability, but still work well on a single instance for prototyping.

Some of the broad categories I can think of are:

Datastores are often branded or perceived as more transaction-oriented (OLTP) or analytics-oriented (OLAP), but can often work well enough for the opposite use case, especially if there is flexibility around performance or eventual consistency.

What would love to see emerge is a bunch of blog posts and trip reports talking about big AT data indexing attempts, and what the resource costs, bottlenecks, and pain points were. Maybe even a benchmark/leaderboard could emerge around how long it takes to backfill the full network and how much it costs (though this might be reductive and hard to do fair comparisons). I'm less interested in making a big list of hypothetical options, or "what about XYZ" questions: there are a bajillion ideas and options, we need real attempts for real use cases.

Discussion in the ATmosphere

Loading comments...