{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid4i73pw7hdkejs2qrybimygpelzymcogfxkdko2dqxoz6uq2wbny",
    "uri": "at://did:plc:4rgrdigiftglskeax4wvmsev/app.bsky.feed.post/3mjivzqvk7w22"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiflo6xt7is6b2iafwghkjahlgggocme5jwjsbeuqqwcywuvjhmszm"
    },
    "mimeType": "image/png",
    "size": 24783
  },
  "path": "/abs/2604.11656v1",
  "publishedAt": "2026-04-14T00:00:00.000Z",
  "site": "https://arxiv.org",
  "tags": [
    "Victor Maus",
    "Vinicius Pozzobon Borin"
  ],
  "textContent": "**Authors:** Victor Maus, Vinicius Pozzobon Borin\n\nExact hierarchical agglomerative clustering (HAC) of large spatial datasets is limited in practice by the $\\mathcal{O}(n^2)$ time and memory required for the full pairwise distance matrix. We present GSHAC (Geographically Sparse Hierarchical Agglomerative Clustering), a system that makes exact HAC feasible at scales of millions of geographic features on a commodity workstation. GSHAC replaces the distance matrix with a sparse geographic distance graph containing only pairs within a user-specified geodesic bound~$h_{\\max}$, constructed in $\\mathcal{O}(n \\cdot k)$ time via spatial indexing, where~$k$ is the mean number of neighbors within~$h_{\\max}$. Connected components of this graph define independent subproblems, and we prove that the resulting assignments are exact for all standard linkage methods at any cut height $h \\le h_{\\max}$. For single linkage, an MST-based path keeps memory at $\\mathcal{O}(n_k + m_k)$ per component. Applied to a global mining inventory ($n = 261{,}073$), the system completes in 12\\,s (109\\,MiB peak HAC memory) versus $\\approx 545$\\,GiB for the dense baseline. On a 2-million-point GeoNames sample, all tested thresholds completed in under 3\\,minutes with peak memory under 3\\,GiB. We provide a scikit-learn-compatible implementation for direct integration into GIS workflows.",
  "title": "Scalable Exact Hierarchical Agglomerative Clustering via Sparse Geographic Distance Graphs"
}