Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihjgxacaiyaidyudmlmqr2v3qnnd2ac6nrnbyahjg5knttvsazawa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlktma2nvsr2"
  },
  "path": "/t/biic-replacing-tokens-with-geometric-algebra-multivectors-early-results/175911#post_1",
  "publishedAt": "2026-05-11T07:01:52.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub - val1813/BIIC: 一种可以替代token的研究，专业点叫：一种基于代数不变量分解的语言信息处理方法及系统。Algebraic Invariant Decomposition based Language Information Processing Method and System – A research on replacing token embeddings with algebraically grounded invariant & equivariant representations. · GitHub"
  ],
  "textContent": "BIIC: Replacing Tokens with Geometric Algebra Multivectors — Early Results (and a reality check)\n\nHi everyone,\n\nRecently, I’ve been exploring a simple question: **what if we ditch conventional flat token embeddings entirely?**\n\nI swapped standard token vectors for Cl(4,1) Clifford algebra multivectors, aiming to break a core inherent limitation of current language model representation structures. I’m sharing this research at a very early stage, mainly to collect feedback on the mathematical logic and overall conception.\n\nA critical reminder: **the current GitHub repository is just a temporary placeholder**. The complete core code has not been uploaded yet, so the experimental data I mention cannot be reproduced by outsiders right now. Please treat this content as a research proposal and a preliminary data verification draft, not an available open-source library.\n\n* * *\n\n## TL;DR\n\nI replaced flat one-dimensional token embeddings with Cl(4,1) multivectors. This mathematical structure can naturally split token information into two independent parts: **invariant components (Grade-0)** and **equivariant components (Grade 1-4)**.\n\n  * Internal early tests verified that Grade-0 features maintain strict invariance even after 100+ layers of inference computation.\n\n  * The full-grade decoder has **5.3 times higher decodable information volume** than the decoder that only retains Grade-0 features.\n\n  * Phase 3 controlled comparison experiments are currently running on dual RTX 4090 GPUs.\n\n  * **Important Note** : All experimental data are local internal test results, and the public repository cannot reproduce them temporarily.\n\n\n\n\n* * *\n\n## Core Defects of Flat Token Embeddings\n\nThe mainstream token embedding design has an obvious flaw: all information including original semantics, contextual updates and position features is compressed into a single flat vector. With the stacking of network layers, the original information will be continuously overwritten.\n\nResidual connections optimize gradient propagation by superposing state information, but they cannot separate the original token semantics from the contextual newly-added state. This defect brings three persistent problems:\n\n  1. Irreversible semantic drift in the inference process;\n\n  2. Continuous accumulation of redundant state information without an effective elimination mechanism;\n\n  3. Linear growth of KV Cache memory overhead.\n\n\n\n\nInstead of relying on residual connections for superficial optimization, I chose to redesign the underlying information carrier of tokens fundamentally.\n\n* * *\n\n## My Solution: Cl(4,1) Conformal Geometric Algebra\n\nThe Cl(4,1) conformal geometric algebra structure can generate 32-dimensional multivectors. Under the rotor sandwich product $$R M \\tilde{R}$$, the Grade-0 scalar component has **strict mathematical invariance** — this is derived from mathematical theorems, not an approximate effect obtained by model training.\n\nComponent Classification | Grade | Dimension | Transformation Property | Semantic Function\n---|---|---|---|---\nImmutable Core | 0 | 1×C | Invariant | Token intrinsic attribute, never overwritten\nMutable State | 1-4 | 30×C | Equivariant | Contextual inference information, dynamically updated\nPseudoscalar | 5 | 1×C | Invariant (pure rotor) | Spatial chirality and polarity feature\n\nThe 32-dimensional setting perfectly matches the GPU warp size, with no extra dimension padding required. It is worth mentioning that the 2026 paper from Huy & Hirst has already verified the applicability of Cl(4,1) structure in sequence modeling tasks with a low-parameter model.\n\n* * *\n\n## Active Forgetting Mechanism: GradeAwareEraser\n\nI analogized the information storage logic of DNA: Grade-0 is equivalent to immutable genome data, while Grade 1-4 components are comparable to dynamically editable epigenome. To solve the problem of redundant state accumulation, I developed the `GradeAwareEraser` selective erasure module.\n\nThis module can attenuate the redundant information of equivariant grades to the baseline state on the premise of completely protecting the Grade-0 invariant core.\n\n\n    # Core conceptual formula\n    # \\rho_{t+1} = (1 - \\lambda_i) \\cdot \\rho_t + \\lambda_i \\cdot \\rho_{prior}\n\n    # \\lambda_i: Learnable decay coefficient for each dimension\n    # Verified: The attenuation coefficient for Grade-0 is absolutely 0.0\n\n\nAs far as I know, few LLM designs adopt **internal active selective forgetting**. Most models simply accumulate state information, and this erasure mechanism is one of the core innovations of this project.\n\n* * *\n\n## Early Experimental Data (Internal Test Only)\n\n### Phase 1 & Phase 2: Mathematical Verification + Encoding-Decoding Pipeline Test\n\n  * **Invariance Verification** : After 100 transformation iterations, the Grade-0 invariance error is only $$6.5 \\times 10^{-6}$$;\n\n  * **Eraser Validity** : The module has zero interference with Grade-0 invariant components;\n\n  * **Decoding Performance** : The loss of the full-grade decoder is far lower than that of the single Grade-0 decoder (0.006 vs 0.032), with a performance gap of 5.3 times. Among all grades, Grade-2 bivectors contribute the most to feature extraction, which conforms to geometric logic.\n\n\n\n\n### Phase 4: Architecture Feasibility Test\n\nI conducted a small-scale test with 8 channels and 10 million parameters:\n\n  * Peak video memory usage: 295 MB;\n\n  * Video memory scaling characteristic: **constant O(1) scaling** (the mutable state can completely replace KV Cache to eliminate linear memory growth).\n\n\n\n\nNote: There is no horizontal comparison with baseline models such as BERT and GPT-2 for the time being, so the current loss optimization data is for reference only.\n\n* * *\n\n## Ongoing Experiment: Phase 3 Controlled Test\n\nI set up 7 groups of controlled experiments to clarify the essential reasons for the model performance improvement, and put forward three core research hypotheses:\n\n  1. **H1** : Is the geometric algebra structure itself effective, or is the performance improvement only brought by orthogonal constraints? (Control group: Token + orthogonal transformation + tanh activation)\n\n  2. **H2** : Do equivariant components have unique functions, or is the advantage just from higher dimensions?\n\n  3. **H3** : Can the erasure module effectively control information entropy in long-sequence tasks?\n\n\n\n\nTo be honest, I cannot predict the experimental results of H1 and H3. I will sort out and release all test data after the experiment is completed and the code is optimized.\n\n* * *\n\n## Repository Status (Honest Explanation)\n\n**Repository Link** : GitHub - val1813/BIIC: 一种可以替代token的研究，专业点叫：一种基于代数不变量分解的语言信息处理方法及系统。Algebraic Invariant Decomposition based Language Information Processing Method and System – A research on replacing token embeddings with algebraically grounded invariant & equivariant representations. · GitHub\n\nIt is necessary to frankly explain the current situation of the repository:\n\n  * Most .py files in the `src/` and `tests/` directories are blank placeholder files;\n\n  * The JSON result files are local test snapshots, not generated by public repository code;\n\n  * Directly cloning and running test scripts will definitely fail.\n\n\n\n\nI am actively sorting out and migrating the local complete code. This post is to solicit opinions on conception and mathematics, not an official release.\n\nIf you are researching geometric algebra, experimental design, or also paying attention to the defects of flat token embeddings, welcome to communicate with me. After the reproducible code is uploaded, I also look forward to in-depth cooperation with interested researchers.\n\n* * *\n\n## References\n\n  * Brehmer et al. (2023). Geometric Algebra Transformer.\n\n  * Huy & Hirst (2026). Versor: A Geometric Sequence Architecture.\n\n  * Ji (2026). CliffordNet: All You Need is Geometric Algebra.\n\n  * Dasgupta et al. (2026). Invariant Features in Language Models.\n\n  * Wu & Zhang (2017). TET-mediated active DNA demethylation.\n\n\n",
  "title": "BIIC: Replacing Tokens with Geometric Algebra Multivectors — Early Results"
}