Integrating DeepSeek's Engram into OLMo-core — proof of concept complete, looking for compute advice
Hi all,
I’ve been independently integrating DeepSeek’s Engram conditional memory module into AI2’s OLMo-core as an optional architectural component.
What I built:
Native integration via single config flag
All 4 architecture configurations verified (Attention + Dense FFN, Attention + MoE, GDN + Dense FFN, GDN + MoE)
First training run completed last night — loss going down, clean completion on 4×A40s
The research question: The original paper only benchmarks against MoE. My hypothesis is Engram’s gain is largest in dense FFN, where every token pays full compute with no sparsity escape valve. I’ve designed a 2×2 ablation to test this.
The ask: As an independent researcher without institutional affiliation, I’m looking for advice on compute access — grants, programs, or anything others have found useful for running training experiments at this scale.
GitHub
Full writeup and training run details here and here.
Any pointers appreciated.
Discussion in the ATmosphere