Raw Record Source

{
  "$type": "site.standard.document",
  "author": "did:plc:ticz2qmqh2vqxnehf5zalkpl",
  "content": {
    "$type": "pub.leaflet.content",
    "pages": [
      {
        "$type": "pub.leaflet.pages.linearDocument",
        "blocks": [
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I've been building longform.social — a reader and discovery tool for long-form writing published on the AT Protocol. We index site.standard.document records from across the network, and after crawling 2.3 million of them, I wanted to share what we're actually seeing out there. It's messier than you'd expect (or not), and it raised some security issues I should have stopped earlier."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 21,
                    "byteStart": 0
                  }
                }
              ],
              "level": 2,
              "plaintext": "The Spec vs. The Wild"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The site.standard.document lexicon has two main fields for article content:"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "• content — meant for structured data. Leaflet uses this with a pages[].blocks[] format for rich text, images, and embeds. ("
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "• textContent — intended as a plain text representation of the article. Think of it as the fallback."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 245,
                    "byteStart": 236
                  }
                }
              ],
              "plaintext": "\nSo how are publishers actually using these fields? I ran the numbers across just site.standard.document records — excluding WhiteWind, which uses its own com.whtwnd.blog.entry lexicon and stores markdown in content by design. Out of 2,375,668 site.standard.document records:"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 21,
                    "byteStart": 4
                  }
                }
              ],
              "plaintext": "• 2,149,242 (90.5%) have plain text in textContent and nothing in content. These are overwhelmingly link-card-style documents — a URL, a title, and a blob of plain text scraped or pasted from a web article. This is the dominant use case by far."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 16,
                    "byteStart": 4
                  }
                }
              ],
              "plaintext": "• 189,726 (8%) have what looks like markdown in textContent — headers with #, bold with **, code blocks with backticks. Not in the content field where structured data belongs, but in the plain text fallback."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 17,
                    "byteStart": 4
                  }
                }
              ],
              "plaintext": "• 13,819 (0.6%) have raw HTML in textContent. Full <p> tags, <h4> headers, <a> links, <img> tags with responsive srcset attributes, <iframe> embeds."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 17,
                    "byteStart": 4
                  }
                }
              ],
              "plaintext": "• 3,209 (0.14%) use Leaflet-style structured blocks in content."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 15,
                    "byteStart": 4
                  }
                }
              ],
              "plaintext": "• 400 (0.02%) store markdown strings in content."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 17,
                    "byteStart": 4
                  }
                }
              ],
              "plaintext": "• 17,227 (0.7%) have neither field populated with readable content."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The structured field that the spec is designed around? It accounts for 0.14% of documents. The plain text  field is carrying 99% of the content, and a significant chunk of it isn't plain text at all."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 25,
                    "byteStart": 0
                  }
                }
              ],
              "level": 2,
              "plaintext": "Where the HTML Comes From"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The HTML-in-textContent pattern isn't random. The Chicago Sun-Times and WBEZ — major news organizations — account for over 4,100 documents with embedded HTML. Their pipeline takes article HTML straight from the CMS and drops it into textContent. You get the full experience: <picture> elements with responsive sources, <figure> with <figcaption>, <iframe> embeds to Airtable forms, and interactive accordion widgets."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "These aren't intentionally bad actors (I don't think). They're publishers trying to adapt an emerging protocol to their existing workflows. "
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Of course this is a big security issue if we render this stuff . And Claude missed it. (Okay it's my fault, but at least I had sense to recognize it a direct Claude to fix it). When you encounter HTML tags in textContent, the natural thing is to render them. And our rendering path was using raw() to output content directly into the page — no escaping, no filtering."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "That means a malicious actor could publish a site.standard.document containing something like:"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.code",
              "plaintext": "<img src=\"x\" onerror=\"document.location='https://phishing-site.com'\">"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "And it would execute in every reader's browser. The WBEZ articles we were rendering contained <iframe> elements pointing to third-party services. Legitimate in their case, but the same mechanism could embed anything."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Our markdown rendering path had the same problem — marked.parse() doesn't sanitize by default, so <script> tags embedded in markdown would pass right through."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I chose to keep supporting these non-standard documents (for now) rather than rejecting them, just with the added strict sanitization. Every rendering path runs through sanitize-html with an explicit allowlist."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "What gets through: standard content tags (<p>, <h1>-<h6>, <a>, <img>, <blockquote>, lists, tables, <figure>), inline formatting (<strong>, <em>, <code>), and safe attributes like href, src, and alt."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "What gets stripped: <script>, <iframe>, event handlers (onclick, onerror, onload), javascript: URIs, <object>, <embed>, <form>, and anything not on the allowlist."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "For iframe embeds in structured Leaflet blocks (like YouTube videos), I maintain a separate allowlist of trusted domains. Anything else gets a safe external link."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I also locked down facet-based links to block javascript: URIs, since facets in structured blocks generate <a> tags directly."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 28,
                    "byteStart": 0
                  }
                }
              ],
              "level": 2,
              "plaintext": "We've Seen This Movie Before"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "If you were around for the early web, this all feels familiar."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "HTML was a document markup language. Then people started embedding scripts, inline styles, and entire applications inside documents. Email clients rendered HTML and became phishing vectors. RSS readers that rendered feed HTML had the same XSS problems. Every time, the cycle was the same: a format gets adopted, people push beyond its intended boundaries, and then security catches up."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The AT Protocol is in that phase right now. site.standard.document defines a structured content model, but the network has already decided that textContent is a general-purpose content container. 190,000 documents with markdown in a \"plain text\" field is a de facto standard, whether the spec acknowledges it or not."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 42,
                    "byteStart": 0
                  }
                }
              ],
              "level": 2,
              "plaintext": "Should Readers Support Non-Standard Usage?"
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "There are two ways to look at it."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The purist argument: reject anything that doesn't match the spec. If textContent is plain text, render it as plain text. HTML tags become visible <p> literals on screen. Publishers learn to use the structured format. The ecosystem stays clean."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ],
                  "index": {
                    "byteEnd": 150,
                    "byteStart": 137
                  }
                }
              ],
              "plaintext": "My take: that means telling a user \"sorry, this Chicago Sun-Times article can't be displayed\" while showing them raw HTML tags. Remember Postel's Law."
            }
          },
          {
            "$type": "pub.leaflet.pages.linearDocument#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Yes, the better fix is upstream, but for now we will support it."
            }
          }
        ],
        "id": "6ae78728-61d2-4e55-866c-32c7a0ad4eb4"
      }
    ]
  },
  "description": "",
  "path": "/307nrf56zd",
  "publishedAt": "2026-05-17T13:43:09.671Z",
  "site": "at://did:plc:ticz2qmqh2vqxnehf5zalkpl/site.standard.publication/self",
  "tags": [],
  "title": "When Standards Meet Reality: What 2.3 Million AT Protocol Documents Actually Look Like"
}