Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihhkggk67x3fefxecnbn5ievmm2yaomdtandciw5h67cpekve5iba",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moojdukx6ut2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreidi6unl6w2px5cd7ewa5sbreywu2x3topasuvgkx5zzznljdwovaa"
    },
    "mimeType": "image/webp",
    "size": 396018
  },
  "path": "/gbadedata/spatial-transcriptomics-plus-topological-data-analysis-a-complete-end-to-end-tutorial-with-squidpy-1np2",
  "publishedAt": "2026-06-19T23:00:55.000Z",
  "site": "https://dev.to",
  "tags": [
    "spatialtranscriptomics",
    "tutorial",
    "bioinformatics",
    "singlecellevaluation",
    "github.com/gbadedata/squidpy-spatial-benchmark",
    "GitHub",
    "LinkedIn"
  ],
  "textContent": "__Build a real spatial transcriptomics pipeline on Visium mouse brain, then go one step past every tutorial: use persistent homology to find spatial structure that Moran's I misses. Every number here is reproducible from a fresh clone.__\n\nMost spatial transcriptomics tutorials end at the same place. You load a Visium dataset, run Moran's I, colour a few genes on the tissue, and admire the picture. You never find out whether the standard spatial statistic actually captured everything that matters.\n\nThis tutorial goes further. Complete spatial pipeline was built with **squidpy** , then add a second lens that almost no tutorial covers: **topological data analysis** with **GUDHI**. By the end you will have a working benchmark that names a specific gene whose spatial structure Moran's I underestimates, and you will understand exactly what that claim does and does not mean.\n\n##  What we are building\n\nThree tasks, each answering a question a real spatial analysis should ask:\n\n  1. **Spatially variable gene detection.** Does Moran's I recover the genes we know are spatially patterned?\n  2. **Neighbourhood enrichment.** Do anatomically adjacent brain regions actually sit together in the data?\n  3. **TDA versus Moran's I.** Does persistent homology see structure that autocorrelation misses?\n\n\n\nThe dataset is the canonical **V1 Adult Mouse Brain Visium** section: 2,688 spots, 15 annotated anatomical regions, about 18,000 genes.\n\nA quick note on what each task is _for_. Tasks 1 and 2 are validation: they prove the pipeline reproduces known biology, so you can trust it. Task 3 is the contribution: it shows the standard method has a blind spot and that topology fills it.\n\n##  Setup\n\n\n    python3 -m venv .venv && source .venv/bin/activate\n    pip install squidpy gudhi scanpy anndata scipy matplotlib pandas\n\n\nA note on versions: this tutorial targets **squidpy 1.8.x** , which renamed several parameters from older releases. If you copy from a 2022 era tutorial you will hit errors. I flag each one inline.\n\n##  Step 1: Load the data\n\nsquidpy ships the dataset, so there is nothing to download manually:\n\n\n\n    import squidpy as sq\n\n    adata = sq.datasets.visium_hne_adata()\n    print(adata)\n    # AnnData object with n_obs x n_vars = 2688 x 18078\n    #   obs: 'cluster'        (15 anatomical region labels)\n    #   obsm: 'spatial'       (x, y coordinates on the tissue)\n\n\nTwo things make this _spatial_ data. `obsm['spatial']` holds the physical coordinates of each spot on the tissue slide, and `obs['cluster']` holds the manually annotated brain region for each spot. Those region labels are our ground truth in Tasks 1 and 2.\n\n##  Step 2: Preprocess and build the spatial graph\n\nStandard scanpy preprocessing, with one critical addition at the end, the spatial neighbourhood graph.\n\n\n\n    import scanpy as sc\n\n    # Filter, normalise, log transform\n    sc.pp.filter_genes(adata, min_cells=10)\n    adata.layers[\"counts\"] = adata.X.copy()\n    sc.pp.normalize_total(adata, target_sum=1e4)\n    sc.pp.log1p(adata)\n    adata.layers[\"log_norm\"] = adata.X.copy()   # keep a clean log norm copy\n\n    # Standard embedding\n    sc.pp.highly_variable_genes(adata, n_top_genes=3000)\n    sc.pp.pca(adata, n_comps=30)\n\n    # The spatial graph. This is what makes squidpy \"spatial\"\n    sq.gr.spatial_neighbors(adata, coord_type=\"grid\", n_rings=1)\n\n\n**squidpy 1.8.x gotcha number 1:** older tutorials use `coord_type=\"visium\"`. That value no longer exists. For Visium's hexagonal array you now use `coord_type=\"grid\"` with `n_rings=1`, which connects each spot to its roughly six immediate neighbours.\n\nAfter this you have about 15,580 edges, roughly 5.8 neighbours per interior spot. That spatial graph is the foundation for both Moran's I and neighbourhood enrichment.\n\nKeeping the `log_norm` layer matters later: the TDA step reads expression from it, so we want a clean copy taken before any scaling.\n\n##  Step 3: Task 1, spatially variable genes with Moran's I\n\nMoran's I asks one question per gene: do spatially adjacent spots have similar expression? It returns a value roughly between 0 and 1. High Moran's I means the gene's expression is spatially organised rather than scattered at random.\n\n\n\n    sq.gr.spatial_autocorr(\n        adata,\n        mode=\"moran\",\n        genes=adata.var_names.tolist(),\n        n_perms=100,\n        corr_method=\"fdr_bh\",       # gotcha number 2: was \"correction\" pre 1.8\n        show_progress_bar=False,\n    )\n\n    moran = adata.uns[\"moranI\"].sort_values(\"I\", ascending=False)\n    print(moran.head())\n    #          I        pval_norm   pval_norm_fdr_bh\n    # Mbp      0.788    ...\n    # Slc17a7  0.775    ...\n    # Nrgn     0.743    ...\n    # Cck      0.727    ...\n    # Itpka    0.698    ...\n\n\n**squidpy 1.8.x gotcha number 2:** the false discovery rate parameter is now `corr_method`, not `correction`. Results land in `adata.uns[\"moranI\"]` with columns `I`, `pval_norm`, and `pval_norm_fdr_bh`.\n\nThe top gene is **Mbp** at I = 0.788, a myelin marker sharply concentrated in the white matter fibre tract. That is exactly what we expect. The sharpest spatial domain in the section ranks first.\n\n**Now validate it against known biology.** A high Moran's I is only convincing if the genes it promotes are genuinely the ones biology says are spatially patterned. I use a curated panel of Allen Brain Atlas region markers as an oracle. Here is the complete panel, so you can reproduce the exact number:\n\n\n\n    oracle_markers = {\n        \"Layer 1\":      [\"Reln\", \"Ndnf\", \"Cxcl14\"],\n        \"Layer 2/3\":    [\"Cux1\", \"Cux2\", \"Calb1\"],\n        \"Layer 4\":      [\"Rorb\", \"Scnn1a\", \"Rspo1\"],\n        \"Layer 5\":      [\"Bcl11b\", \"Fezf2\", \"Etv1\"],\n        \"Layer 6\":      [\"Ntsr1\", \"Syt6\", \"Ctgf\"],\n        \"White matter\": [\"Mbp\", \"Mog\", \"Plp1\"],\n        \"Hippocampus\":  [\"Prox1\", \"Dkk3\", \"C1ql2\"],\n    }\n\n    # Flatten, keep only markers that survived HVG filtering\n    all_markers = [g for grp in oracle_markers.values() for g in grp]\n    tested = [g for g in all_markers if g in adata.var_names]   # 17 of 21 survive\n    top_500 = set(moran.head(500).index)\n    found = [m for m in tested if m in top_500]\n\n    sensitivity = len(found) / len(tested)\n    print(f\"tested={len(tested)}  found={len(found)}  sensitivity={sensitivity:.2f}\")\n    # tested=17  found=7  sensitivity=0.41\n\n\nSensitivity is **0.41** : seven of the seventeen testable markers appear in the top 500 SVGs. The seven it finds are `Mbp`, `Mog`, `Plp1` (white matter), `Dkk3` (hippocampus), and the strong cortical markers `Calb1`, `Fezf2`, `C1ql2`.\n\nWhich markers does it miss, and why? This is the important part. Print their ranks:\n\n\n\n    missed = [m for m in tested if m not in found]\n    for m in missed:\n        rank = moran.index.get_loc(m) + 1\n        print(f\"{m:10s} rank {rank:5d}/{len(moran)}   I={moran.loc[m,'I']:.3f}\")\n    # Cux2       rank   948   I=0.191\n    # Reln       rank   989   I=0.181\n    # Rorb       rank   996   I=0.179\n    # ...\n\n\nThe missed markers are smooth, low amplitude cortical layer gradients. `Cux2` (rank 948), `Reln` (989), and `Rorb` (996) are real layer markers, but their expression changes gradually across the cortex rather than forming a sharp boundary.\n\nThis is not a bug. **Moran's I ranks sharp concentrated domains above broad gradients** , because a smooth gradient has lower local spot to spot contrast. Hold that thought. It is the entire reason for Step 5.\n\n##  Step 4: Task 2, neighbourhood enrichment\n\nDo brain regions that are anatomically adjacent actually co-localise in the data? squidpy runs a permutation test on the spatial graph:\n\n\n\n    sq.gr.nhood_enrichment(adata, cluster_key=\"cluster\", show_progress_bar=False)\n    z = adata.uns[\"cluster_nhood_enrichment\"][\"zscore\"]   # 15 x 15 z score matrix\n\n\nFor every pair of regions, this returns a z score: how much more adjacent they are than you would expect by chance, given the size of each region. A z score above about 1 is genuine enrichment; a negative z score means the two regions are _more separated_ than chance.\n\nI validate five pairs that real neuroanatomy says should co-localise:\n\n\n\n    expected_pairs = [\n        (\"Hippocampus\",       \"Pyramidal_layer\"),                 # z = 24.9\n        (\"Cortex_4\",          \"Cortex_5\"),                         # z = 14.7\n        (\"Hippocampus\",       \"Pyramidal_layer_dentate_gyrus\"),   # z = 13.3\n        (\"Lateral_ventricle\", \"Striatum\"),                        # z =  5.1\n        (\"Fiber_tract\",       \"Lateral_ventricle\"),               # z =  5.0\n    ]\n\n\nAll five enrich strongly. The hippocampal complex lights up at z = 24.9.\n\n**A trap worth knowing about.** My first attempt validated pairs like `Thalamus_1` plus `Thalamus_2`, assuming the shared name meant they sit together. They do not. Their z score is about minus 3.8. Those numbered subclusters are transcriptionally distinct territories in _separate_ parts of the tissue. They are spatially segregated, not adjacent.\n\nThe lesson is concrete: **spatial adjacency and statistical neighbourhood enrichment are different things.** Trust the enrichment matrix, not the cluster names. I corrected the validated pairs to reflect true anatomy, and the matrix confirmed every one.\n\n##  Step 5: Task 3, where TDA sees what Moran's I cannot\n\nHere is the step no tutorial covers, and the reason this project exists.\n\nMoran's I is a single number measuring _local pairwise similarity_. It is blind to the _shape_ of an expression pattern. A gene whose expression forms a ring, or a reticular multi focal network, has rich spatial structure that pairwise correlation barely registers.\n\n**Persistent homology** measures shape. Before the code, the one paragraph of theory you actually need:\n\n> Imagine each expressing spot as a point on the tissue. Now grow a disc of radius _r_ around every point and let the discs merge as _r_ increases. At small _r_ you have many separate blobs; as _r_ grows they connect, and sometimes a ring of points encloses an empty region before finally filling in. Persistent homology records, across every value of _r_ at once, how many connected pieces exist (called **H0**) and how many enclosed loops exist (called **H1**), and crucially _how long each one survives_ as _r_ grows. A loop that persists over a wide range of _r_ is a real, robust ring in the data. A loop that appears and vanishes immediately is noise. That construction, discs growing over a point cloud, is a **Vietoris Rips filtration** , and GUDHI computes it for us.\n\n###  Build a point cloud per gene\n\nFor each gene, take the spots where it is expressed and treat their coordinates as a point cloud:\n\n\n\n    import numpy as np\n    import gudhi as gd\n\n    def gene_point_cloud(adata, gene, max_points=250):\n        idx = adata.var_names.get_loc(gene)\n        expr = adata.layers[\"log_norm\"][:, idx]\n        expr = np.asarray(expr.todense()).ravel() if hasattr(expr, \"todense\") else expr\n\n        mask = expr > np.median(expr[expr > 0])    # spots expressing above median\n        coords = adata.obsm[\"spatial\"][mask].astype(float)\n\n        # Subsample dense clouds. Critical for memory, see below\n        if len(coords) > max_points:\n            rng = np.random.default_rng(42)\n            coords = coords[rng.choice(len(coords), max_points, replace=False)]\n\n        # Normalise to the unit square so the filtration scale is comparable across genes\n        coords -= coords.min(0)\n        coords /= coords.max(0).clip(min=1e-9)\n        return coords\n\n\n**Memory gotcha number 3:** a Vietoris Rips complex grows roughly _cubically_ with point count. A gene expressed in 800 spots generates tens of millions of simplices and gets killed by the operating system's out of memory killer. Subsampling to 250 points keeps the topology intact while bounding memory. Because the final analysis uses _ranks_ rather than raw values, it is robust to the subsample.\n\n###  Compute persistence and summarise it\n\n\n    def h1_entropy(coords, max_edge=1.5, min_persistence=0.01):\n        rips = gd.RipsComplex(points=coords, max_edge_length=max_edge)\n        st = rips.create_simplex_tree(max_dimension=2)\n        st.compute_persistence()\n\n        # Keep H1 (loops) that survive longer than the noise floor\n        h1 = [(b, d) for dim, (b, d) in st.persistence()\n              if dim == 1 and (d - b) >= min_persistence]\n        if not h1:\n            return 0.0\n\n        lifetimes = np.array([d - b for b, d in h1])\n        p = lifetimes / lifetimes.sum()\n        return float(-(p * np.log(p)).sum())     # persistence entropy\n\n\nPersistence entropy is one scalar summarising the whole H1 diagram: it is high when a gene has _many loops of comparable importance_ and low when it has few loops or one dominant one. It is a reasonable single number for \"topological complexity\", though it is not the only choice, and I will name its limits below.\n\n**Threshold gotcha number 4, the one that nearly fooled me.** My first run returned H1 entropy of exactly `0.0` for _every_ gene, including the most structured. The output was clean and uniform, and it was tempting to accept.\n\nIt was completely wrong. GUDHI was finding more than twenty genuine loops in `Mbp`, but my `min_persistence=0.1` filter discarded all of them. Spatial loops in unit normalised point clouds are _short lived_ : they live in the 0.01 to 0.1 persistence band. Lowering the threshold to **0.01** recovered them.\n\nThe takeaway is an instinct worth more than the fix: **distrust a result that looks too clean.** Zero complexity on a visibly structured dataset is a red flag, not a pass.\n\n###  Compare the two rankings\n\nRank a panel of genes both ways, by Moran's I and by H1 entropy, then look for disagreement:\n\n\n\n    import pandas as pd\n\n    panel = moran.head(18).index   # mix of high, mid, low Moran's I genes\n    rows = []\n    for g in panel:\n        rows.append({\n            \"gene\": g,\n            \"morans_i\":   moran.loc[g, \"I\"],\n            \"h1_entropy\": h1_entropy(gene_point_cloud(adata, g)),\n        })\n\n    df = pd.DataFrame(rows).set_index(\"gene\")\n    df[\"morans_rank\"] = df[\"morans_i\"].rank(ascending=False)\n    df[\"tda_rank\"]    = df[\"h1_entropy\"].rank(ascending=False)\n    df[\"divergence\"]  = (df[\"morans_rank\"] - df[\"tda_rank\"]).abs()\n    print(df.sort_values(\"divergence\", ascending=False).head())\n\n\nYour output will look like this:\n\n\n\n              morans_i  h1_entropy  morans_rank  tda_rank  divergence\n    gene\n    Apbb2        0.087       3.340          8.0       1.0         7.0\n    Gpr17        0.087       3.154          6.0       2.0         4.0\n    Mbp          0.788       3.035          1.0       8.0         7.0\n    Cck          0.727       2.905          4.0       9.0         5.0\n\n\n###  The result, stated precisely\n\n**Apbb2** is the standout. Moran's I ranks it around tenth in the wider gene set; by H1 persistence entropy it ranks **first** , the highest topological complexity in the panel. **Gpr17** shows the same direction of divergence. The mirror image holds for **Mbp** : Moran's rank 1, the sharpest domain in the whole section, but only TDA rank 8, because a single solid blob has strong autocorrelation and _simple_ topology.\n\nHere is the honest framing, because it is what a careful reviewer will probe. What I have measured is that Apbb2 has a high H1 persistence entropy signal, and that this signal disagrees with its Moran's I rank. That is a real, reproducible methodological result. What I have _not_ done is confirm by independent histology that Apbb2 forms a literal anatomical ring. The topology is evidence of loop like structure in the expressing spots, not proof of a biological annulus. Stating that boundary precisely is part of the result, not a hedge.\n\nOne more honest caveat. The H1 entropy values across the panel cluster fairly tightly, mostly between about 2.9 and 3.3. The _ranking_ is stable and reproducible, but the absolute separation between genes is modest. The correct reading is \"the two methods order genes differently, with Apbb2 the clearest case\", not \"TDA found a dramatic outlier Moran's I completely missed\". The honest version is the more defensible one.\n\n##  What the figures show\n\nThe persistence diagram for Apbb2 makes the construction concrete. One H0 component spans the full filtration, the single connected tissue domain, while the rest die near the diagonal. Thirty two H1 loops sit in the 0.05 to 0.2 band, exactly the short lived features the threshold fix recovered. If you ever see an empty H1 diagram on real tissue, you have a threshold bug, not a structureless gene.\n\nThe Moran versus TDA scatter places each gene by its two ranks. Points on the diagonal are genes the two methods agree about. Apbb2 and Gpr17 sit well below the diagonal, low Moran's rank but high TDA rank. That visible distance from the diagonal _is_ the finding.\n\n##  The five gotchas, collected\n\nIf you build this yourself, these will save you hours:\n\n  1. `coord_type=\"grid\"`, not `\"visium\"` (squidpy 1.8.x)\n  2. `corr_method=`, not `correction=` (squidpy 1.8.x)\n  3. Subsample dense point clouds before the Rips complex, or run out of memory\n  4. Set `min_persistence` low, around 0.01, because real spatial loops are short lived\n  5. Validate neighbourhood pairs against the enrichment matrix, not against shared cluster names\n\n\n\n##  Why this matters beyond the tutorial\n\nAnyone can run Moran's I. The skill that separates a spatial genomics engineer from a tutorial follower is knowing the method's blind spots and reaching for the right complementary tool, then being precise about what the complement does and does not prove.\n\nPersistent homology is that complement for loop rich and multi focal expression patterns. It is not a replacement for Moran's I. It is a second lens that catches a different kind of structure, and now you have a working, validated, honestly caveated example that names a real gene.\n\nFull code: github.com/gbadedata/squidpy-spatial-benchmark. Runtime is about 105 seconds end to end.\n\nThe full pipeline, 135 tests, continuous integration, structured logging, and a unified JSON benchmark report, is on GitHub:\n\n**github.com/gbadedata/squidpy-spatial-benchmark**\n\n_Building spatial transcriptomics tools, topological data analysis, or single cell evaluation frameworks? Connect on GitHub or LinkedIn._",
  "title": "Spatial Transcriptomics Plus Topological Data Analysis: A Complete End-to-End Tutorial with squidpy and GUDHI"
}