{
"$type": "site.standard.document",
"content": {
"$type": "site.standard.content.markdown",
"text": "I've spent the last few months working on indexing and [building data pipelines for the Filecoin blockchain](https://github.com/filecoin-project/filet). While it's been a great and exciting learning experience, I've realized the space can learn a few things from the so called Modern Data ecosystem.\n\nThe main thing I'd love to explore is how, as a community, we're building the same pipelines over and over and not collaborating on them. Let's start with a bit of context.\n\n## Existing Projects\n\nIf you do a bit of browsing, you'll find many companies and tools building ETLs for different blockchains. I compiled this non-exhaustive list of tools to index chains and companies providing the final datasets. I'm sure I missed a few, so please [let me know](https://twitter.com/davidgasquez) if you know of any other projects!\n\n### Companies\n\n- [Nansen](https://www.nansen.ai/)\n- [Tokenflow](https://docs.tokenflow.live/)\n- [BitQuery](https://bitquery.io/) ([GitHub](https://github.com/bitquery/explorer))\n- [Coherent](https://coherent.xyz/) ([GitHub](https://github.com/coherentopensource))\n- [Transpose](https://www.transpose.io/) ([GitHub](https://github.com/TransposeData))\n- [Covalent](https://www.covalenthq.com/) ([GitHub](https://github.com/covalenthq))\n- [Indexed.xyz](https://github.com/indexed-xyz)\n- [Footprint](https://www.footprint.network/) ([GitHub](https://github.com/footprintanalytics))\n- [Sentio](https://www.sentio.xyz/) ([GitHub](https://github.com/sentioxyz))\n- [GeniiData](https://geniidata.com/)\n- [Allium](https://twitter.com/alliumlabs)\n- [Kyve](https://www.kyve.network/)\n- [Token Terminal](https://tokenterminal.com/)\n- [Probably Nothing Labs](https://www.probablynothinglabs.xyz/)\n- [Space And Time](https://www.spaceandtime.io/)\n- [Credmark](https://credmark.com/) ([GitHub](https://github.com/credmark))\n\n### Tools\n\n- [Trueblocks](https://trueblocks.io/)\n- [Blockchain ETL](https://github.com/blockchain-etl)\n- [Mars](https://github.com/deepeth/mars)\n- [Algoran Indexer](https://github.com/algorand/indexer)\n- [Tezos Indexer](https://github.com/baking-bad/tzkt)\n- [db3](https://github.com/db3-teams/db3)\n- [IceFireDB](https://www.icefiredb.xyz/icefiredb_docs/)\n- [Apollo](https://github.com/chainbound/apollo)\n- [Ether SQL](https://github.com/analyseether/ether_sql)\n- [Spec](https://github.com/spec-dev)\n- [Cosmos ETL](https://github.com/bizzyvinci/cosmos-etl)\n- [Luabase](https://github.com/luabase)\n- [Digital Assets Examples](https://github.com/aws-samples/digital-assets-examples)\n\n## The Problem\n\nAfter [compiling the list](https://publish.obsidian.md/davidgasquez/Web3#Blockchain+Indexing+Projects), I realized that only a few of these projects are open. **We can read the source code of the chains we use, but can't read the code of their data pipelines?** That's a bit weird. Specially when the data world is moving towards the other direction [^1]!\n\nAs the folks from [OpenDataCommunity](https://opendatacommunity.org/) pointed out, the data layer is, currently, mostly centralized [^2] and closed source. Properties that don't represent the open spirit of the movement.\n\nOn the other hand, projects like [Airbyte](https://airbyte.com/), [Estuary's Flow](https://github.com/estuary/flow), [Meltano](https://meltano.com/), [Cloudquery](https://github.com/cloudquery/cloudquery), and many others from the [MDS](https://www.moderndatastack.xyz/), are not only building tools to extract, transform, and load data, but also working on standards and protocols. And the community is building on top of them. This makes possible to have an end to end data pipeline in matter of minutes.\n\n1. Connect an [Stripe source](https://github.com/singer-io/tap-stripe) to your warehouse. A few clicks.\n2. Import the data. Some extra clicks.\n3. Add a [dbt Package](https://hub.getdbt.com/fivetran/stripe/latest/) to your project.\n4. Done. You've gone from zero to a somewhat complex report of your Stripe subscriptions!\n\nWouldn't it be great if we could do the same with blockchain data? **Add the `tap-ethereum` connector, install the `dbt-ethereum` package and start to collaborate! [^3]**\n\nPersonally, I think projects like [Trueblocks](https://trueblocks.io/), [Kamu](https://kamu.dev), and [Bacalhau](https://bacalhau.org/) hold the key to open the data layer a bit more [^4] but won't be possible without a community and standards we can all agree upon and share.\n\n[^1]: And the blockchains data is such a [great candidate for the open data movement](https://publish.obsidian.md/davidgasquez/Open+Data). **Chain data is open, verifiable and immutable! All great properties for open data.**\n\n[^2]: And I totally get it. Centralization makes working with data a less painful job!\n\n[^3]: Dune has [the spellbook](https://github.com/duneanalytics/spellbook) and is awesome. I wish they did something similar for their pipelines though.\n\n[^4]: [Bacalhau](https://www.bacalhau.org/) indexes the chain with [Trueblocks](https://trueblocks.io/), puts the data into Parquet files on IPFS and you query it from anywhere using DuckDB!",
"version": "1.0"
},
"description": "I've spent the last few months working on indexing and building data pipelines for the Filecoin blockchain. While it's been a great and exciting learning experience, I've realized the space can learn a few things from the so called Modern Data ecosystem. The main thing I'd lov...",
"path": "/blockchain-data-pipelines",
"publishedAt": "2023-04-07T00:00:00.000Z",
"site": "at://did:plc:4z5i7njrld66ew36htufcwry/site.standard.publication/3mo43d2tmt2ov",
"textContent": "I've spent the last few months working on indexing and building data pipelines for the Filecoin blockchain. While it's been a great and exciting learning experience, I've realized the space can learn a few things from the so called Modern Data ecosystem.\n\nThe main thing I'd love to explore is how, as a community, we're building the same pipelines over and over and not collaborating on them. Let's start with a bit of context.\n\nExisting Projects\n\nIf you do a bit of browsing, you'll find many companies and tools building ETLs for different blockchains. I compiled this non-exhaustive list of tools to index chains and companies providing the final datasets. I'm sure I missed a few, so please let me know if you know of any other projects!\n\nCompanies\nNansen\nTokenflow\nBitQuery (GitHub)\nCoherent (GitHub)\nTranspose (GitHub)\nCovalent (GitHub)\nIndexed.xyz\nFootprint (GitHub)\nSentio (GitHub)\nGeniiData\nAllium\nKyve\nToken Terminal\nProbably Nothing Labs\nSpace And Time\nCredmark (GitHub)\n\nTools\nTrueblocks\nBlockchain ETL\nMars\nAlgoran Indexer\nTezos Indexer\ndb3\nIceFireDB\nApollo\nEther SQL\nSpec\nCosmos ETL\nLuabase\nDigital Assets Examples\n\nThe Problem\n\nAfter compiling the list, I realized that only a few of these projects are open. We can read the source code of the chains we use, but can't read the code of their data pipelines? That's a bit weird. Specially when the data world is moving towards the other direction 1!\n\nAs the folks from OpenDataCommunity pointed out, the data layer is, currently, mostly centralized 2 and closed source. Properties that don't represent the open spirit of the movement.\n\nOn the other hand, projects like Airbyte, Estuary's Flow, Meltano, Cloudquery, and many others from the MDS, are not only building tools to extract, transform, and load data, but also working on standards and protocols. And the community is building on top of them. This makes possible to have an end to end data pipeline in matter of minutes.\nConnect an Stripe source to your warehouse. A few clicks.\nImport the data. Some extra clicks.\nAdd a dbt Package to your project.\nDone. You've gone from zero to a somewhat complex report of your Stripe subscriptions!\n\nWouldn't it be great if we could do the same with blockchain data? Add the tap-ethereum connector, install the dbt-ethereum package and start to collaborate! 3\n\nPersonally, I think projects like Trueblocks, Kamu, and Bacalhau hold the key to open the data layer a bit more 4 but won't be possible without a community and standards we can all agree upon and share.\n\n1: And the blockchains data is such a great candidate for the open data movement. Chain data is open, verifiable and immutable! All great properties for open data.\n\n2: And I totally get it. Centralization makes working with data a less painful job!\n\n3: Dune has the spellbook and is awesome. I wish they did something similar for their pipelines though.\n\n4: Bacalhau indexes the chain with Trueblocks, puts the data into Parquet files on IPFS and you query it from anywhere using DuckDB!",
"title": "On Blockchain Data Pipelines"
}