External Publication
Visit Post

Scalable Exact Hierarchical Agglomerative Clustering via Sparse Geographic Distance Graphs

cstheory.com April 14, 2026
Source

Authors: Victor Maus, Vinicius Pozzobon Borin

Exact hierarchical agglomerative clustering (HAC) of large spatial datasets is limited in practice by the $\mathcal{O}(n^2)$ time and memory required for the full pairwise distance matrix. We present GSHAC (Geographically Sparse Hierarchical Agglomerative Clustering), a system that makes exact HAC feasible at scales of millions of geographic features on a commodity workstation. GSHAC replaces the distance matrix with a sparse geographic distance graph containing only pairs within a user-specified geodesic bound$h_{\max}$, constructed in $\mathcal{O}(n \cdot k)$ time via spatial indexing, where$k$ is the mean number of neighbors within~$h_{\max}$. Connected components of this graph define independent subproblems, and we prove that the resulting assignments are exact for all standard linkage methods at any cut height $h \le h_{\max}$. For single linkage, an MST-based path keeps memory at $\mathcal{O}(n_k + m_k)$ per component. Applied to a global mining inventory ($n = 261{,}073$), the system completes in 12,s (109,MiB peak HAC memory) versus $\approx 545$,GiB for the dense baseline. On a 2-million-point GeoNames sample, all tested thresholds completed in under 3,minutes with peak memory under 3,GiB. We provide a scikit-learn-compatible implementation for direct integration into GIS workflows.

Discussion in the ATmosphere

Loading comments...