{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigriviufsbxjgtathiwtn4xiynaojpckzbxkzje4ciydzdohg6u6i",
"commit": {
"cid": "bafyreihtcqihp6gygfu3khf44umexufmes4tbflm4uu7eqbdkzam57wedq",
"rev": "3mlejorxorf2n"
},
"uri": "at://did:plc:2ha7bym7sxhtpt3du2lasczt/app.bsky.feed.post/3mlejorujxs2t",
"validationStatus": "valid"
},
"content": {
"$type": "pub.leaflet.content",
"pages": [
{
"$type": "pub.leaflet.pages.linearDocument",
"blocks": [
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "I'm working on a little experiment for which I want to analyze all of the starter packs people have created on Bluesky. All of the data on atproto is public by design so this shouldn't be too hard, right?"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://atproto.com/guides/backfilling"
}
],
"index": {
"byteEnd": 48,
"byteStart": 35
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#bold"
}
],
"index": {
"byteEnd": 101,
"byteStart": 94
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://github.com/bluesky-social/indigo/blob/main/cmd/tap/README.md"
}
],
"index": {
"byteEnd": 108,
"byteStart": 105
}
}
],
"plaintext": "I did some digging and came across this document about how to backfill data from the network. tl;dr: use tap with a client library. Shouldn't be too hard, right? Right??"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "wait hang on what is tap"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentFacets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://atproto.com/guides/backfilling#how-backfilling-works"
}
],
"index": {
"byteEnd": 26,
"byteStart": 5
}
}
],
"contentPlaintext": "see How Backfilling Works ",
"footnoteId": "019dff04-64cf-788b-a454-ba5b5240631e"
}
],
"index": {
"byteEnd": 59,
"byteStart": 58
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://atproto.com/guides/streaming-data"
}
],
"index": {
"byteEnd": 227,
"byteStart": 215
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "have you noticed there are a lot of birds out today?",
"footnoteId": "019e0880-0613-755a-9475-afaf2c8e4aee"
}
],
"index": {
"byteEnd": 228,
"byteStart": 227
}
}
],
"plaintext": "tap is a self-hosted service that handles the nitty-gritty* of synchronizing with the network. One important detail is that it synchronizes existing data directly from the PDS network and then processes events from the firehose* so you can stay in sync."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "The idea is that you identify what parts of the network you want to have locally and configure tap to backfill those records. tap does its magic and emits events that you can consume to integrate the data into your application."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "A complete backfill system is tap + a consumer. Consumers are custom code that takes the tap events and does whatever you need with them. In my case that's just shoving them into the database but you can get fancier if you like."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://github.com/bluesky-social/atproto/tree/main/packages/tap"
}
],
"index": {
"byteEnd": 85,
"byteStart": 73
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://ruby.sdk.blue/tapfall/"
}
],
"index": {
"byteEnd": 194,
"byteStart": 187
}
}
],
"plaintext": "You'll probably want a library to help your consumer interface with tap. @atproto/tap is the reference typescript client but there are plenty of options. I'm working in Ruby and I picked tapfall."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "how to tap"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "Today I'm going to be talking about running tap locally to fetch data for development purposes but it's worth knowing that tap is designed to run as a service alongside your production infrastructure. It requires a database and you will need to write some code to process the records from tap."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "The database is used to manage tap's internal state so you can start and stop it without losing your backfill progress. Tap uses sqlite by default but you can also use postgres if you like."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://go.dev/doc/install"
}
],
"index": {
"byteEnd": 32,
"byteStart": 20
}
}
],
"plaintext": "You'll need to have go installed and then you can install tap:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.code",
"language": "shellscript",
"plaintext": "$ go install github.com/bluesky-social/indigo/cmd/tap@latest",
"syntaxHighlightingTheme": "catppuccin-latte"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "Then you can run tap with whatever arguments make sense for your application. Here I'm filtering it to three collections:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.code",
"language": "shellscript",
"plaintext": "$ ~/go/bin/tap run --no-replay \\\n --collection-filters app.bsky.graph.starterpack \\\n --collection-filters app.bsky.graph.list \\\n --collection-filters app.bsky.graph.listitem",
"syntaxHighlightingTheme": "catppuccin-latte"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "I'm on macos with go installed via homebrew and that's where it is for me",
"footnoteId": "019e0355-60fa-7dd5-be30-40b46bc95034"
}
],
"index": {
"byteEnd": 41,
"byteStart": 40
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 105,
"byteStart": 99
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 173,
"byteStart": 144
}
}
],
"plaintext": "Your go binaries might be somewhere else*. Also note that tap will create a sqlite database called tap.db in the current directory. You can use --db-url sqlite:///path/to/db if you want it elsewhere."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentFacets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://atproto.com/specs/repository"
}
],
"index": {
"byteEnd": 59,
"byteStart": 50
}
}
],
"contentPlaintext": "in atproto a \"repo\" is the the user's datastore. read more",
"footnoteId": "019dff4a-1126-788b-a49a-09bc1bde433a"
}
],
"index": {
"byteEnd": 68,
"byteStart": 67
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://github.com/bluesky-social/indigo/blob/main/cmd/tap/README.md#http-api"
}
],
"index": {
"byteEnd": 89,
"byteStart": 81
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 185,
"byteStart": 171
}
}
],
"plaintext": "By default tap starts in a mode where you must explicitly add repos* using tap's HTTP API. Your client library probably knows how to do this. For example, in tapfall it's #add_repo(did)."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://github.com/bluesky-social/indigo/blob/main/cmd/tap/README.md#network-boundary-modes"
}
],
"index": {
"byteEnd": 90,
"byteStart": 68
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 205,
"byteStart": 159
}
}
],
"plaintext": "You can continue with this mode for as long as it suits you but see Network Boundary Modes for alternatives. For my case I want every starter pack so I'll add --signal-collection app.bsky.graph.starterpack to my args which tells tap to automatically add any repo that has starter pack records."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 72,
"byteStart": 61
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 99,
"byteStart": 77
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "you hear a crow in the distance",
"footnoteId": "019e0865-0250-7eef-ae3b-84ad61e60fb1"
}
],
"index": {
"byteEnd": 100,
"byteStart": 99
}
}
],
"plaintext": "For reasons I'll discuss shortly, I'm only going to backfill starterpack and app.bsky.actor.profile* with tap so my final command line looks like this:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.code",
"language": "shellscript",
"plaintext": "$ ~/go/bin/tap run --no-replay \\\n --signal-collection app.bsky.graph.starterpack \\\n --collection-filters app.bsky.graph.starterpack \\\n --collection-filters app.bsky.actor.profile",
"syntaxHighlightingTheme": "catppuccin-latte"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "the thing about filtering"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
],
"index": {
"byteEnd": 57,
"byteStart": 54
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
],
"index": {
"byteEnd": 121,
"byteStart": 117
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "both your own and the atproto network",
"footnoteId": "019e02e6-d1d2-7dd5-bdef-075babce38e5"
}
],
"index": {
"byteEnd": 305,
"byteStart": 304
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "your crow friend alights upon the power line outside your window",
"footnoteId": "019e08ac-f153-755a-94ae-ca2ec2661d13"
}
],
"index": {
"byteEnd": 344,
"byteStart": 343
}
}
],
"plaintext": "tap's filtering can only take you so far. If you need all of a specific record type tap has you covered. If you need some of a specific record type then it's up to you to filter the records downstream of tap. This is easy enough to do but it does mean that you will be spending a lot of network resources* on content you're going to throw away*."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "tap protip #1: know your data model"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "This brings me to my first protip: take the time to understand the data model of the records you're interested in. I did not do that and got to enjoy the following experiences:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.unorderedList",
"children": [
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "backfilled an entire lexicon only to discover that it's a wrapper around a different lexicon I didn't know about"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
],
"index": {
"byteEnd": 147,
"byteStart": 144
}
}
],
"plaintext": "backfilled the second lexicon only to discover that it's used for several different things and I just downloaded a million records I don't need and it refers to a third lexicon that I also didn't know about"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "also the third lexicon is used for several different things"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "presumably other things that I simply haven't experienced yet!"
}
}
]
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
},
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://lexicon.garden/lexicon/did:plc:4v4y5r3lwsbtmsxhile2ljac/app.bsky.graph.starterpack"
}
],
"index": {
"byteEnd": 123,
"byteStart": 97
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "defer?",
"footnoteId": "019e0343-3c59-7dd5-be1c-64c3ee6ea5b3"
}
],
"index": {
"byteEnd": 143,
"byteStart": 142
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
},
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://lexicon.garden/lexicon/did:plc:4v4y5r3lwsbtmsxhile2ljac/app.bsky.graph.list"
}
],
"index": {
"byteEnd": 169,
"byteStart": 150
}
}
],
"plaintext": "To give a concrete example, I'm trying to scrape every starter pack. Starter packs are stored as app.bsky.graph.starterpack objects that refer* to an app.bsky.graph.list to handle list membership. "
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 18,
"byteStart": 14
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://bsky.app/lists"
}
],
"index": {
"byteEnd": 100,
"byteStart": 76
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 161,
"byteStart": 142
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
],
"index": {
"byteEnd": 275,
"byteStart": 270
}
}
],
"plaintext": "Unfortunately list is also how you store moderation lists and lists for the Lists tab in the bsky UI, so if you tell tap you want to backfill app.bsky.graph.list you're getting all of those as well. You can filter these out during the consumption phase but that happens after you download the records from the atproto network and process them with tap."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
},
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://lexicon.garden/lexicon/did:plc:4v4y5r3lwsbtmsxhile2ljac/app.bsky.graph.listitem"
}
],
"index": {
"byteEnd": 99,
"byteStart": 76
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 135,
"byteStart": 131
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentFacets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
],
"index": {
"byteEnd": 73,
"byteStart": 69
}
}
],
"contentPlaintext": "\"subject\" is a general term in atproto design that means \"the record this record is about\". in the case of listitem subject is a reference to a user",
"footnoteId": "019e030b-5ed1-7dd5-be04-aea0bc1961b0"
}
],
"index": {
"byteEnd": 148,
"byteStart": 147
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 182,
"byteStart": 178
}
}
],
"plaintext": "The problem is even worse with list membership. Those records are stored as app.bsky.graph.listitem objects with references to the list and subject*. Based on my experience with list I suspect I'm going to be throwing away far more objects than I keep."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "a (hopefully) more efficient approach"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 62,
"byteStart": 36
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 216,
"byteStart": 205
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "the crow tilts its head and peers at you",
"footnoteId": "019e088c-dbdf-755a-947c-6470ac259dcf"
}
],
"index": {
"byteEnd": 434,
"byteStart": 433
}
}
],
"plaintext": "I'm going to continue using tap for app.bsky.graph.starterpack records. These are the only records that we need help finding; every other record in our little slice of the network is related to a specific starterpack so we can crawl records starting from there. I'll also use it for bsky profile data since we always want the profile of anyone who owns a starter pack and it's easier to get that delivered than to fetch it ourselves*."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://slingshot.microcosm.blue/"
}
],
"index": {
"byteEnd": 47,
"byteStart": 38
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://constellation.microcosm.blue/"
}
],
"index": {
"byteEnd": 84,
"byteStart": 71
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#didMention",
"did": "did:plc:lulmyldiq4sb2ikags5sfb25"
}
],
"index": {
"byteEnd": 128,
"byteStart": 113
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://sidekiq.org/"
}
],
"index": {
"byteEnd": 194,
"byteStart": 187
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentFacets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://en.wikipedia.org/wiki/Maildir"
}
],
"index": {
"byteEnd": 56,
"byteStart": 48
}
}
],
"contentPlaintext": "I once worked on a system that used a series of Maildirs as a queue. don't do that. you can do anything but that.",
"footnoteId": "019e047f-59f5-7558-b176-41bd8555ac9a"
}
],
"index": {
"byteEnd": 250,
"byteStart": 249
}
}
],
"plaintext": "To get the rest of the data we'll use slingshot (for list records) and constellation (for list memberships) from @microcosm.blue, facilitated by a background processing system. I'm using sidekiq because it's the first thing I thought of but anything* should work fine."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "Here's the general processing flow:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.orderedList",
"children": [
{
"$type": "pub.leaflet.blocks.orderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 88,
"byteStart": 62
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 115,
"byteStart": 93
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "\"caw,\" says the crow. it doesn't seem like it should be this loud",
"footnoteId": "019e0874-7176-755a-946d-3c9f770b7ab0"
}
],
"index": {
"byteEnd": 116,
"byteStart": 115
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 163,
"byteStart": 152
}
}
],
"plaintext": "tap is started using the command line given above to backfill app.bsky.graph.starterpack and app.bsky.actor.profile* records for any repo that contains starterpack records"
}
},
{
"$type": "pub.leaflet.blocks.orderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 62,
"byteStart": 54
}
}
],
"plaintext": "the tap consumer listens for those events (as well as identity events) and injects them into the job queue for further processing"
}
},
{
"$type": "pub.leaflet.blocks.orderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "a dispatcher job picks up those records and decides what to do with them, enqueueing one or more further jobs to process it or to fetch additional data"
}
}
],
"startIndex": 1
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "Nothing especially groundbreaking but it works pretty well. "
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"alignment": "lex:pub.leaflet.pages.linearDocument#textAlignCenter",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#italic"
}
],
"index": {
"byteEnd": 30,
"byteStart": 13
}
}
],
"plaintext": "🕖🕘🕚 a few hours later 🕐🕑🕒",
"textSize": "large"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "\"CAW, CAW,\" cries the crow,\"I AM THE CROW OF FORESHADOWING. I'VE BEEN HERE THE WHOLE TIME\" before instantly vanishing without a trace",
"footnoteId": "019e0896-c307-755a-9494-a5da169a57d2"
}
],
"index": {
"byteEnd": 122,
"byteStart": 121
}
}
],
"plaintext": "\nLet me just take a big sip of protein shake and have a quick peek at my network graphs hey how did this bird get in here*"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.bskyPost",
"clientHost": "bsky.app",
"postRef": {
"cid": "bafyreiez7yswcvp6744xsvh7rspyqbf6tvxyvfpexmrliskw6bc7gocnya",
"uri": "at://did:plc:2ha7bym7sxhtpt3du2lasczt/app.bsky.feed.post/3mle27ltmxk25"
}
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "I dug into it and as best I can tell I was done fetching new records within a couple of hours and everything I processed after that was an update to either a starter pack or a profile. Here's the thing: people update their profiles a lot. And every time you get an update from tap it's the full record."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://atproto.com/specs/xrpc"
}
],
"index": {
"byteEnd": 243,
"byteStart": 239
}
}
],
"plaintext": "I don't have anything instrumented and I only have port-level network stats so I can't say for sure it was tap but it was definitely tap. That said, I did have a lot of inefficiency in my data processing flow (including a ton of redundant xrpc calls) so there are certainly multiple factors. That 12 hour window also captures the tail end of my initial testing when I was firehosing WAY too many records but I can see from the graph that it's a minor contribution."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "tap protip #2: don't run it at home"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "my 1tb data cap, of which ~450gb has now been consumed",
"footnoteId": "019e08b3-b75f-755a-94b7-852378990c1b"
}
],
"index": {
"byteEnd": 149,
"byteStart": 148
}
}
],
"plaintext": "Unless you've got good bandwidth and truly unlimited transfer I would not recommend running this kind of setup at home. I spent 250gb of my data cap* on:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.unorderedList",
"children": [
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "75,309 starter packs"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "75,245 lists"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "63,310 users"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "2,559,463 list membership records"
}
}
]
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "Not a bad haul but it's certainly not 250gb on disk and I'm also sure it's not a complete dataset. Good thing my data cap resets in just 23 short days!!!!!"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "At peak I was averaging 102mbps over a 5 minute period. 12.5mbps avg/5min was the lowest I ever saw. It's just a lot of data, and remember I'm talking about small numbers of records here. If you're trying to backfill more or busier records you might be in a for a Fun time. Be careful and do your job and you should be fine."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "I think a cheap VPS is probably the best way to run tap so I'll experiment with that."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "what's this about an incomplete dataset"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "Yeah so honestly the data is kind of a mess. Here's everything I've found so far that should be true but isn't:"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.unorderedList",
"children": [
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "there should be a 1:1 mapping of lists to starter packs"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "every starter pack should have a corresponding user"
}
},
{
"$type": "pub.leaflet.blocks.unorderedList#listItem",
"content": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "every list should have at least one member"
}
}
]
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#link",
"uri": "https://constellation.microcosm.blue/"
}
],
"index": {
"byteEnd": 189,
"byteStart": 176
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#footnote",
"contentPlaintext": "their website says they have the past 465 days of data (as of 2026-05-08) and \"indexing new records in real time, backfill coming soon!\"",
"footnoteId": "019e08d8-9328-755a-94c7-a5dc240fd03b"
}
],
"index": {
"byteEnd": 231,
"byteStart": 230
}
}
],
"plaintext": "I should be able to fix the problems with starter packs with the data I already have but list membership is trickier. I didn't know this when I started but I have learned that constellation does not have a full copy of the network*. That means that I can't rely on it for list membership. I've already found several lists that have members in reality but that constellation does not have indexed."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 93,
"byteStart": 85
}
},
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#code"
}
],
"index": {
"byteEnd": 153,
"byteStart": 149
}
}
],
"plaintext": "I'm not entirely sure what to do about this. I think I'll need to scrape each user's listitem records and match them up with the their corresponding list. I was trying to avoid that amount of manual scraping but I'm not sure I can get around it. If you have any ideas please let me know."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": ""
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.header",
"level": 2,
"plaintext": "okay that's enough for today"
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"facets": [
{
"features": [
{
"$type": "pub.leaflet.richtext.facet#didMention",
"did": "did:plc:2ha7bym7sxhtpt3du2lasczt"
}
],
"index": {
"byteEnd": 80,
"byteStart": 66
}
}
],
"plaintext": "You said it, friend. Thanks for reading and find me on bluesky at @bleything.net if you want to tell me what I'm doing wrong or otherwise discuss any of this."
}
},
{
"$type": "pub.leaflet.pages.linearDocument#block",
"block": {
"$type": "pub.leaflet.blocks.text",
"plaintext": "Stay tuned for our next episode where I'll have figured all of this out and will also finally reveal why exactly I want all this data. Coming... soon?"
}
}
],
"id": "019dfef6-29eb-7cc3-ae7c-9dadba23cf2a"
}
]
},
"description": "using tap and ruby to fetch every Bluesky Starter Pack",
"path": "/3mlejolb4ds2t",
"publishedAt": "2026-05-08T20:04:54.175Z",
"site": "at://did:plc:2ha7bym7sxhtpt3du2lasczt/site.standard.publication/3mlejj2a5zk26",
"tags": [],
"title": "downloading a slice of atproto"
}