External Publication
Visit Post

I'm not an engineer. I just wanted to see if a 3D cube of cells could learn to talk

Hugging Face Forums [Unofficial] May 27, 2026
Source

For now, I asked it to organize the points:


This is a very interesting project. I especially like that it is not presented as “yet another small Transformer,” but as a recurrent 3D substrate whose internal dynamics can be visualized, perturbed, and probed.

I cannot help with large-scale compute, and I am not suggesting that “just train v6.2 harder” should be the immediate next step. My impression is almost the opposite: before scaling the training, it may be more useful to separate the questions that are currently mixed together.

In particular, I think the project becomes much easier to evaluate if we distinguish these questions:

  1. Can this architecture produce language-like sequences at all?
  2. Does the 3D spatial structure actually matter?
  3. Which design choices are doing the work: dilation, fatigue, learned initial state, number of update steps, output-face readout, etc.?
  4. Are the reported spatial specializations functionally causal, or mainly visual/interpretive patterns?
  5. Is this best understood as a standalone language model, a recurrent memory substrate, a synthetic-data generator, an interpretable dynamical system, or an adapter-like module?

To me, the most interesting question may not be:

Can this beat Transformers?

but rather:

Does a local recurrent 3D system develop reproducible, causal internal organization when trained on language-like tasks?

That seems like a genuinely interesting research direction even if the model never becomes practically competitive as a language model.

1. A useful framing

I would frame the project less as:

“A new language model architecture that competes with Transformers”

and more as:

“A probeable 3D recurrent cellular substrate that can be trained on symbolic, semantic, and language-like tasks.”

That framing avoids making the project depend on beating GPT-like baselines, while preserving the interesting part: local communication, emergent spatial organization, recurrent computation, and visible internal dynamics.

This is also closer to how Neural Cellular Automata are usually studied. The classic reference is Growing Neural Cellular Automata, where the point is not just raw task performance, but how local learned update rules can produce stable, self-organizing, regenerative behavior.

There is also recent work connecting NCA-like dynamics to language model training, but in a different way: Training Language Models via Neural Cellular Automata uses NCA-generated spatiotemporal data as synthetic pre-pre-training data for language models. That paper is not doing exactly the same thing as this project, but it suggests that NCA dynamics may be useful as structured non-linguistic training signals, not only as standalone models.

So I think there are several possible interpretations of this project:

Interpretation What would be tested?
Standalone NCA language model Can the 3D recurrent cube directly predict/generate language?
Recurrent memory substrate Can the cube store and propagate information better than simpler recurrent baselines?
Synthetic pretraining generator Can its dynamics produce useful structured data for other models?
Interpretable dynamics model Do grammar/semantic/emotion-like regions emerge in a reproducible, causal way?
Adapter/refinement block Can NCA-like local updates improve a Transformer/RNN/ConvLM as a component?

I think the fourth interpretation — interpretable recurrent dynamics — is currently the most exciting one.

2. Suggested ablations

The first thing I would want is a small ablation table. Not necessarily huge training runs; just enough to clarify what is essential.

Possible variants:

Variant Question
Full v5 Current reference point
No dilation Is global coverage through dilation essential?
Dilation cycle changed Is [1, 2, 4, 8] special, or just one reasonable schedule?
No synaptic fatigue Does fatigue actually reduce repetition collapse?
Fatigue only at inference Is it a training-time mechanism, inference-time heuristic, or both?
Fixed initial state How much comes from the learned init_state?
Random initial state Does the model rely on a learned “brain prior”?
Output face only Is the opposite-face readout important?
Global pooled readout Does reading from the whole cube improve or erase the spatial story?
Random output face Is the z-axis information-flow interpretation robust?
Fewer update steps Where does performance appear?
More update steps Does extra recurrent computation help or degrade output?
1D version Is this just sequence convolution?
2D version Is 3D actually useful over a simpler spatial substrate?
3D ConvNet without recurrence Is recurrence doing real work?
Recurrent Conv3D without NCA framing Is the “cellular” framing adding anything beyond a recurrent ConvNet?

The goal would not be to “disprove” the model. The goal would be to locate the actual source of the effect.

For example, if removing synaptic fatigue causes much more “the the the” collapse, that is useful evidence. If global pooling beats output-face readout, that would weaken the “wave reaches the opposite face” story. If 2D performs similarly to 3D, then maybe the important thing is recurrence + convolution, not specifically a 3D cube. If the learned initial state is crucial, then the “brain DNA” idea becomes a real object of study rather than just a metaphor.

3. Suggested baselines

I would also suggest a few simple baselines with roughly similar parameter counts and the same data/tokenizer where possible.

Possible baselines:

Baseline Why it matters
Small GRU/LSTM Minimal recurrent sequence baseline
Small Transformer Standard language-modeling baseline
1D ConvLM Convolutional sequence baseline
Temporal CNN / TCN Stronger non-attention sequence baseline
Recurrent Conv1D Similar recurrence, no 3D substrate
Recurrent Conv2D Spatial recurrence without full 3D
Recurrent Conv3D Same broad compute family without the NCA interpretation
Neural GPU-like model Classical recurrent convolutional algorithm-learning comparison

The Neural GPU is especially relevant historically because it is a convolutional gated recurrent architecture that was studied for learning algorithmic sequence transformations. It is not the same as this project, but it is a useful comparison point for “local recurrent computation over a grid.”

I would not compare only against GPT-2 or modern Transformers. That comparison is too harsh and not very informative. A more useful question is:

Compared with other small recurrent/convolutional baselines, what does the 3D NCA-like substrate uniquely buy us?

4. Dynamics analysis

The internal dynamics are probably the most interesting part of the project. I would try to turn the qualitative “wave/eureka/decision” story into plots.

Useful measurements:

Measurement Purpose
Loss vs update step Does performance really improve at a particular recurrent depth?
Entropy vs update step Does the model become more confident during propagation?
Top-k distribution vs step Does the predicted word sharpen over time?
Activation norm vs step Does the cube stabilize, explode, or collapse?
Spatial activation center of mass Does information actually move from input face to output face?
Mutual information with input tokens Does input information propagate spatially over time?
Region-wise contribution to logits Which regions causally affect output?
Seed-to-seed consistency Does the same specialization reappear across runs?

The reported “steps 6-7 eureka” is particularly interesting. I would want to see:

  • Does the same step transition appear across many prompts?
  • Does it appear across random seeds?
  • Does it align with the dilation schedule?
  • Does it still appear when the dilation cycle is changed?
  • Does it appear on arithmetic/relations/language tasks equally?
  • Does running more steps help, saturate, or degrade?

If the “eureka” phase is stable across prompts and seeds, that is much stronger than a single visualization.

5. Interpretability/probing checklist

The reported spatial organization is the most exciting claim, but also the claim that needs the most care. Humans are very good at seeing meaning in visualizations. So I would try to convert each qualitative observation into a causal or statistical test.

Possible checks:

Claim Possible test
Region x=12 produces better language Ablate x=12 and compare loss/generation quality
Region x=6 produces garbage Patch x=6 into good generations or ablate it
Grammar is central Train POS/syntax probes on cell states by location
Semantics is peripheral Train semantic-category probes by location
Emotional words use z=12 Compare activation maps for emotional vs neutral words
Semantic clusters exist UMAP/PCA of cell states with word-category labels
Wave carries answer Intervene on intermediate slices and measure output damage
Learned init_state contains specialization Probe/visualize init_state before any input
Good/bad regions are stable Repeat over seeds and datasets

Some concrete interventions:

  • Zero out one spatial region at a time.
  • Add noise to one region at a time.
  • Swap activations between two prompts.
  • Patch the “good region” from one run into another.
  • Freeze parts of the cube during training.
  • Train linear probes per coordinate or per region.
  • Compare probes against shuffled labels.
  • Compare spatial maps across random seeds.

The key distinction is:

A region lighting up is not the same as a region being causally necessary.

If region ablation damages the relevant capability selectively, then the spatial specialization claim becomes much stronger.

6. Synthetic tasks before full natural language

Natural language is very hard to diagnose because many failure modes are entangled: tokenization, data size, frequency bias, repetition collapse, long-range dependency, objective mismatch, readout design, and recurrent depth.

Before focusing too much on open-ended text generation, I would test a ladder of synthetic tasks.

Suggested task ladder:

Task Capability tested
Copy Can the cube preserve input?
Shift Can information propagate directionally?
Reverse Can it perform nontrivial sequence manipulation?
Parity Can it aggregate global information?
Modular addition Can it learn algorithmic rules?
Bracket matching / Dyck language Can it model stack-like structure?
Associative recall Can it bind keys and values?
Small symbolic grammar Can it learn controlled next-token structure?
Character-level corpus Can it model language without large word vocab issues?
Word-level small corpus Can it handle sparse word prediction?

This would clarify where the architecture fails. If it cannot solve copy/reverse/parity reliably, then weak natural-language generation is unsurprising. If it solves synthetic grammar but fails at word-level language, then the bottleneck may be vocabulary/readout/data rather than the recurrent substrate itself.

Related work like LifeGPT, AutomataGPT, and Learning Elementary Cellular Automata with Transformers studies the opposite direction — Transformers learning CA dynamics — but those papers are still useful because they suggest evaluation patterns for local-rule systems: forecasting, rule inference, intermediate-state prediction, and generalization to unseen dynamics.

7. Possible alternative roles for the model

I would not restrict the project to “standalone language model.” There are other ways the idea could be valuable.

A. Recurrent memory substrate

The cube could be a memory/update substrate that receives token embeddings and evolves for several steps. Then another model reads from it.

Questions:

  • Does it store local context better than a simple recurrent state?
  • Does it denoise representations?
  • Does it preserve information over many update steps?
  • Does it help on associative recall or algorithmic tasks?

B. Adapter/refinement module

NCA-like blocks could be used inside another model instead of replacing the whole model. For example, AdaNCA uses NCA-style adapters between Vision Transformer layers to improve robustness. That is vision, not language, but the architectural idea is relevant: NCA as a plug-in refinement module rather than the entire model.

Possible language analogues:

  • Transformer + NCA adapter
  • RNN + NCA memory
  • ConvLM + NCA refinement block
  • Decoder-only LM with local NCA hidden-state smoothing
  • NCA block between attention and MLP layers

C. Synthetic dynamics generator

The cube may be more useful for generating structured non-linguistic trajectories than for directly generating language. This connects to Training Language Models via Neural Cellular Automata, where NCA-generated data is used as synthetic pre-pre-training data before natural-language training.

Questions:

  • Can 3D NCA trajectories produce useful synthetic curricula?
  • Does pretraining on those trajectories help small language models?
  • Does the complexity of the NCA dynamics matter?
  • Are 3D dynamics more useful than 1D/2D CA dynamics?

D. Interpretable dynamical system

Even if the model is weak as an LM, it may be valuable as a visible dynamical system trained on language-like tasks.

Questions:

  • Does syntax-like information localize?
  • Does semantic category information localize?
  • Are localized regions stable across seeds?
  • Are regions causally necessary?
  • Can “thought over time” be measured through recurrent steps?

This seems like the most compelling direction to me.

8. What I would prioritize

If I were organizing the next steps without providing compute, I would prioritize:

  1. Minimal reproducibility
  2. Ablation table
  3. Baseline table
  4. Step-wise dynamics plots
  5. Region ablation
  6. Simple probes
  7. Synthetic task ladder
  8. Only then larger training

A possible short roadmap:

Phase 1: Reproduce and measure

  • Run v5 inference.
  • Record predictions by recurrent step.
  • Plot loss/entropy/confidence over steps.
  • Check repetition rate.
  • Test more/fewer recurrent steps.
  • Save activation maps for a fixed prompt set.

Phase 2: Ablate

  • Remove or alter dilation.
  • Remove fatigue.
  • Compare output-face readout vs pooled readout.
  • Compare learned vs fixed init state.
  • Compare 1D/2D/3D variants if feasible.

Phase 3: Baseline

  • Train/evaluate small GRU, small Transformer, 1D ConvLM, and recurrent Conv3D baselines under similar conditions.
  • Use the same tokenizer/data where possible.
  • Report validation loss, next-token accuracy, repetition rate, and parameter count.

Phase 4: Probe

  • POS probe.
  • Semantic category probe.
  • Emotion/neutral contrast.
  • Region ablation.
  • Activation patching.
  • Seed consistency.

Phase 5: Reframe

Depending on results, decide whether the model is best pursued as:

  • standalone NCA-LM,
  • recurrent memory,
  • interpretable dynamics system,
  • synthetic data generator,
  • or adapter module.

9. A compact experiment matrix

One possible experiment table:

Experiment Minimal output
Step sweep loss/entropy/confidence vs recurrent step
Dilation ablation validation loss + repetition rate
Fatigue ablation repetition/collapse metrics
Init-state ablation performance drop from learned to fixed/random init
Readout ablation output face vs pooled vs random face
Region ablation heatmap of loss increase by region
Probe map spatial map of syntax/semantic probe accuracy
Seed repeat whether specialization recurs
Baseline comparison small Transformer/GRU/ConvLM/recurrent Conv3D
Synthetic tasks copy/reverse/parity/Dyck/associative recall

This would make the project much easier to discuss.

10. Suggested wording of the main contribution

If this were written as a more formal project note, I would avoid claiming:

“A 3D brain learned language.”

I would phrase it more conservatively:

“We explore a recurrent 3D neural cellular substrate for language-like prediction, and investigate whether local update dynamics produce reproducible spatial specialization.”

That keeps the interesting claim while making it more testable.

A stronger version, if supported by ablations, could be:

“Although not competitive with Transformer baselines as a language model, the system shows measurable recurrent phase transitions and spatially localized representations that can be probed and causally intervened upon.”

That would be a very interesting result.

11. Why I think this is worth exploring

The raw language modeling performance is not the main reason this is interesting. The interesting part is that the model gives you a spatial, recurrent, perturbable object.

Transformers are powerful, but their internal representations are not naturally laid out as a physical 3D substrate. Here, even if the model is much weaker, you can ask questions like:

  • Where does information enter?
  • How does it move?
  • When does a prediction become confident?
  • Which regions matter?
  • Do regions specialize?
  • Does specialization survive retraining?
  • Can we damage a region and observe selective failure?
  • Can we watch recurrence improve or destroy the answer?

That makes the project interesting as a small experimental system.

12. Final take

My current impression:

  • I would not spend the next effort mainly on larger training.
  • I would not frame it as a Transformer competitor.
  • I would focus on ablations, baselines, and causal interpretability.
  • I would test synthetic tasks before open-ended language.
  • I would consider roles other than standalone LM: memory substrate, adapter, synthetic-data generator, or interpretable dynamical system.

The most valuable next contribution might simply be a clear evaluation map.

Something like:

“Here are the claims, here are the ablations that test each claim, here are the baselines, and here are the probes that would make the emergent-organization story stronger.”

That kind of organization could make the project much easier for others to engage with, even without anyone immediately providing large-scale compute.

Discussion in the ATmosphere

Loading comments...