downloading a slice of atproto

Ben Bleything May 8, 2026
Source

I'm working on a little experiment for which I want to analyze all of the starter packs people have created on Bluesky. All of the data on atproto is public by design so this shouldn't be too hard, right?

I did some digging and came across this document about how to backfill data from the network. tl;dr: use tap with a client library. Shouldn't be too hard, right? Right??

wait hang on what is tap

tap is a self-hosted service that handles the nitty-gritty* of synchronizing with the network. One important detail is that it synchronizes existing data directly from the PDS network and then processes events from the firehose* so you can stay in sync.

The idea is that you identify what parts of the network you want to have locally and configure tap to backfill those records. tap does its magic and emits events that you can consume to integrate the data into your application.

A complete backfill system is tap + a consumer. Consumers are custom code that takes the tap events and does whatever you need with them. In my case that's just shoving them into the database but you can get fancier if you like.

You'll probably want a library to help your consumer interface with tap. @atproto/tap is the reference typescript client but there are plenty of options. I'm working in Ruby and I picked tapfall.

how to tap

Today I'm going to be talking about running tap locally to fetch data for development purposes but it's worth knowing that tap is designed to run as a service alongside your production infrastructure. It requires a database and you will need to write some code to process the records from tap.

The database is used to manage tap's internal state so you can start and stop it without losing your backfill progress. Tap uses sqlite by default but you can also use postgres if you like.

You'll need to have go installed and then you can install tap:

$ go install github.com/bluesky-social/indigo/cmd/tap@latest

Then you can run tap with whatever arguments make sense for your application. Here I'm filtering it to three collections:

$ ~/go/bin/tap run --no-replay \
    --collection-filters app.bsky.graph.starterpack \
    --collection-filters app.bsky.graph.list \
    --collection-filters app.bsky.graph.listitem

Your go binaries might be somewhere else*. Also note that tap will create a sqlite database called tap.db in the current directory. You can use --db-url sqlite:///path/to/db if you want it elsewhere.

By default tap starts in a mode where you must explicitly add repos* using tap's HTTP API. Your client library probably knows how to do this. For example, in tapfall it's #add_repo(did).

You can continue with this mode for as long as it suits you but see Network Boundary Modes for alternatives. For my case I want every starter pack so I'll add --signal-collection app.bsky.graph.starterpack to my args which tells tap to automatically add any repo that has starter pack records.

For reasons I'll discuss shortly, I'm only going to backfill starterpack and app.bsky.actor.profile* with tap so my final command line looks like this:

$ ~/go/bin/tap run --no-replay \
    --signal-collection app.bsky.graph.starterpack \
    --collection-filters app.bsky.graph.starterpack \
    --collection-filters app.bsky.actor.profile

the thing about filtering

tap's filtering can only take you so far. If you need all of a specific record type tap has you covered. If you need some of a specific record type then it's up to you to filter the records downstream of tap. This is easy enough to do but it does mean that you will be spending a lot of network resources* on content you're going to throw away*.

tap protip #1: know your data model

This brings me to my first protip: take the time to understand the data model of the records you're interested in. I did not do that and got to enjoy the following experiences:

To give a concrete example, I'm trying to scrape every starter pack. Starter packs are stored as app.bsky.graph.starterpack objects that refer* to an app.bsky.graph.list to handle list membership.

Unfortunately list is also how you store moderation lists and lists for the Lists tab in the bsky UI, so if you tell tap you want to backfill app.bsky.graph.list you're getting all of those as well. You can filter these out during the consumption phase but that happens after you download the records from the atproto network and process them with tap.

The problem is even worse with list membership. Those records are stored as app.bsky.graph.listitem objects with references to the list and subject*. Based on my experience with list I suspect I'm going to be throwing away far more objects than I keep.

a (hopefully) more efficient approach

I'm going to continue using tap for app.bsky.graph.starterpack records. These are the only records that we need help finding; every other record in our little slice of the network is related to a specific starterpack so we can crawl records starting from there. I'll also use it for bsky profile data since we always want the profile of anyone who owns a starter pack and it's easier to get that delivered than to fetch it ourselves*.

To get the rest of the data we'll use slingshot (for list records) and constellation (for list memberships) from @microcosm.blue, facilitated by a background processing system. I'm using sidekiq because it's the first thing I thought of but anything* should work fine.

Here's the general processing flow:

Nothing especially groundbreaking but it works pretty well.

๐Ÿ•–๐Ÿ•˜๐Ÿ•š a few hours later ๐Ÿ•๐Ÿ•‘๐Ÿ•’


Let me just take a big sip of protein shake and have a quick peek at my network graphs hey how did this bird get in here*

I dug into it and as best I can tell I was done fetching new records within a couple of hours and everything I processed after that was an update to either a starter pack or a profile. Here's the thing: people update their profiles a lot. And every time you get an update from tap it's the full record.

I don't have anything instrumented and I only have port-level network stats so I can't say for sure it was tap but it was definitely tap. That said, I did have a lot of inefficiency in my data processing flow (including a ton of redundant xrpc calls) so there are certainly multiple factors. That 12 hour window also captures the tail end of my initial testing when I was firehosing WAY too many records but I can see from the graph that it's a minor contribution.

tap protip #2: don't run it at home

Unless you've got good bandwidth and truly unlimited transfer I would not recommend running this kind of setup at home. I spent 250gb of my data cap* on:

Not a bad haul but it's certainly not 250gb on disk and I'm also sure it's not a complete dataset. Good thing my data cap resets in just 23 short days!!!!!

At peak I was averaging 102mbps over a 5 minute period. 12.5mbps avg/5min was the lowest I ever saw. It's just a lot of data, and remember I'm talking about small numbers of records here. If you're trying to backfill more or busier records you might be in a for a Fun time. Be careful and do your job and you should be fine.

I think a cheap VPS is probably the best way to run tap so I'll experiment with that.

what's this about an incomplete dataset

Yeah so honestly the data is kind of a mess. Here's everything I've found so far that should be true but isn't:

I should be able to fix the problems with starter packs with the data I already have but list membership is trickier. I didn't know this when I started but I have learned that constellation does not have a full copy of the network*. That means that I can't rely on it for list membership. I've already found several lists that have members in reality but that constellation does not have indexed.

I'm not entirely sure what to do about this. I think I'll need to scrape each user's listitem records and match them up with the their corresponding list. I was trying to avoid that amount of manual scraping but I'm not sure I can get around it. If you have any ideas please let me know.

okay that's enough for today

You said it, friend. Thanks for reading and find me on bluesky at @bleything.net if you want to tell me what I'm doing wrong or otherwise discuss any of this.

Stay tuned for our next episode where I'll have figured all of this out and will also finally reveal why exactly I want all this data. Coming... soon?

Discussion in the ATmosphere

Loading comments...