External Publication
Visit Post

Subject: 72 Hours of Systematic Hypothesis Denial: From Geometric Algebra to Factorized Attention (Experiment Logs)

Hugging Face Forums [Unofficial] May 13, 2026
Source

I recently spent a 72-hour sprint trying to answer one core question: Can we find a better semantic transmission unit than the static token embedding?

I ended up systematically debunking three of my own hypotheses, but in the process, I unearthed a signal that hasn’t been neutralized yet. I’m sharing the process, data, and failures here in hopes it saves time for anyone working on representation learning.

Hardware: 2x RTX 4090 (24GB).


1. The Core Problem: Static Embeddings

Current LLM token embeddings are static lookup tables. “Apple” gets the same initial vector whether it’s in “eat an apple” or “Apple Event.” The model relies on 12+ layers of Transformer blocks to fix this ambiguous starting point.

I explored three paths to solve this:

BIIC (Geometric Algebra)SFE (Dynamic Modulation)BIF (Factorized Low-Dim Interaction)


2. Path 1: BIIC (Geometric Algebra Representations)

The Idea

Using Clifford Algebra Cl(4,1), multivectors can be decomposed by grade :

  • Grade-0 (Scalars): Strictly invariant under rotation—the “anchor” of a word’s identity.
  • Grade-2 (Bivectors): Equivariant under rotation—intended to carry context-dependent syntactic/semantic relations.

I used the sandwich product R \cdot x \cdot R_{rev} for token transformations, mathematically guaranteeing Grade-0 invariance while Grade-2 covaries.

The Failure: Equivariant Components Remain Inactive

While the algebraic foundations passed (10-layer gradient flow was healthy), training on WikiText-103 showed that Grade-2 components contributed almost nothing.

Metric Result
Full BIIC Loss 10.8285
Grade-0 Only Loss 10.8271
Difference 0.0014 (Negligible)
Transformer Baseline PPL 53.9 (52M params)
BIIC PPL 390+ (Massive regression)

Key Lesson: Equivariance works in molecular design or DNA modeling because those domains have explicit physical symmetries. Language does not. Next-token prediction cares about “what comes next,” not the geometric symmetry between tokens.


3. Path 2: SFE (Dynamic Embedding Modulation)

The Idea

If the embedding layer can adjust based on local context (the previous 4 tokens), the Transformer won’t have to waste early layers “fixing” ambiguity.

e_i = (\alpha_{static, i} + g(ctx_i)) \otimes B

Where B is a global semantic basis and g(ctx) is a lightweight correction network.

The Result: The “Suppression” Mechanism

In every iteration (including direct gradient paths and auxiliary losses), the Transformer’s attention mechanism systematically suppressed the dynamic modulation.

  • Observation: The g(ctx) network initially learned to differentiate tokens, but as training progressed, the attention layers “decided” it was more efficient to handle disambiguation themselves, driving the modulation signal to zero via backprop.
  • Conclusion: In architectures with standard Attention, dynamic embedding modulation has no “evolutionary niche”.

4. Path 3: BIF (Factorized Interaction) — The “Open” Signal

The Idea

Instead of doing token interaction in 256-dim space, move it to a 64-dim “recipe” space (based on PCA evidence that semantic variance effectively lives in ~50 dimensions).

Factorized Attention Module (FAM):

  • Traditional Embedding: ~12.9M params
  • BIF Embedding: ~3.2M params (75% reduction)
  • FAM Layer Params: Only 4,096 (vs 262k in standard attention)

Current Status

Early experiments with FAM showed a -3.9 PPL improvement while using fewer parameters. I am currently running “Phase 1” to see if this gain holds when parameters and FLOPs are strictly aligned against a compressed baseline.


5. Methodology: 5 Gates for New Hypotheses

I’ve started using this framework to “self-attack” an idea before burning GPU hours:

  1. Computational Cost: Is the core op N times more expensive than standard attention?
  2. Condition Transfer: Does the success of the prior work (e.g., Geometric Hyena) depend on conditions (like physical symmetry) that don’t exist here?
  3. Ablation Prediction: Can you quantify how much better the “full” version should be compared to the “simple” version before the run?
  4. Task Fitness: Is the math “beautiful” but the task “indifferent”?
  5. Minimum Falsifiable Point: What is the quickest way to prove this wrong?

Confirmed Findings & Source

  • Grade-2 bivectors carry syntactic info (Probing: POS=0.789, DEP=0.823), but this info is not extractable via geometric products.
  • Effective semantic dimension per token is roughly 46–57 (PCA PR p95=49.6).
  • Contextualization happens in the middle layers , not the embedding layer.

Repo & Ongoing Logs:

I will be updating the experimental process and data here:

GitHub - val1813/BIIC: 一种可以替代token的研究,专业点叫:一种基于代数不变量分解的语言信息处理方法及系统。Algebraic Invariant Decomposition based Language Information Processing Method and System – A research on replacing token embeddings with algebraically grounded invariant & equivariant representations. · GitHub

I’d love to hear from anyone who has tackled the “embedding modulation suppression” problem or worked with Factorized Attention. Let’s discuss!

Discussion in the ATmosphere

Loading comments...