HoLo/HSL: a 100M change-rate-based multimodal toy model on a single RTX 4070
Thank you for this review — it is exactly the kind of feedback this project needs, and we are adopting your framing more or less wholesale. The layer separation you propose (byte-native / dense packing / HSL substrate / architecture / binding / reproducibility) is now literally our experiment plan. A few things can be answered today; the rest we are committing to with a concrete roadmap.
One factual correction first
“the original byte-to-signal encoder/codec is withheld”
This premise is no longer true (and mostly never was): the encoder is fully public — pip install hsl-embedding, MIT license, formulas + tests + codec included. The only private artifacts are the trained HoLo model weights , which are work-in-progress. So every encoder-side claim is directly reproducible from PyPI today; no surrogate substrate is needed. (Your “which results require the private encoder” table item resolves to: none of the substrate-side ones.)
Shipped today, in direct response to this review (v0.5.0)
The package now contains hsl_embedding/ablation.py so that anyone — including you — can run the controlled comparisons you asked for with a one-line swap:
ControlEmbedding(kind, seed)with four variants sharing the identical 27-D layout and the same 9 context dims:hsl(bit-identical to the real encoder, test-enforced) /learned(trainable byte projection, +4,608 params — your “learned byte projection” arm) /random(seeded random injective LUT, moment-matched per channel — your “random fixed invertible map” arm) /permuted(HSL’s own 256 LUT rows shuffled — per-channel distributions exactly identical, only the value-adjacency geometry is destroyed; we think this is a sharper control than channel shuffling for the capacity-vs-geometry question).feature_groups()/select_channels()for the feature-family ablations (drop Δ², drop FFT, …), plus avalue(18)/context(9)split.value_lut()— the frozen 256×18 table exported as a tensor (your “exported feature tensors” item).
What we can already state (with honest caveats)
Structural facts (verifiable from the public package):
- The per-byte Δ is exactly the binary-reflected Gray code
v ^ (v >> 1)— adjacent byte values differ in exactly one Δ coordinate (raw bits: up to 8). This is the mathematical content behind the “change-rate” framing, and it connects the substrate to the minimal-change-encoding literature. - The 27-D base decomposes as a frozen 256×18 value LUT + 9 context dims (Δ², boundary). So the substrate question reduces cleanly to: does this particular frozen embedding geometry beat a learned/random/permuted one at matched everything? (Linear rank of the value dims is 17/18 — one dependency, dxor0 lies in the FFT span — so we will not claim “18 independent channels”.)
- One channel-scale caveat we will control for:
fft_re0(the DC term) spans 0–8 while other channels are ±1–2, so input-normalization placement will be held identical across all ablation arms. - The FFT dims are the spectrum of the bit pattern of each byte , not a temporal spectrum of the waveform — your documentation-clarity point is correct and the docs now say so.
Preliminary measurements (small scale, single-seed, prior encoder revision — to be re-run multi-seed with the ablation kit before we treat them as findings):
- Same decoder-only architecture, same data/steps: 27-D HSL input 2.058 bpb vs learned byte embedding 2.118 bpb on a byte-level LM task. This is the single number that most needs the multi-seed matched-baseline treatment you describe, and it is first in the queue.
- Architecture axis at matched data/budget: decoder-only prefix-LM (~11M params) outperformed an encoder–decoder twice its size (2.227 vs 2.275 bpb) across all depths tested.
- Mechanism (not quality) results for the disk-offload tier: with the answer present only in a disk-resident value, retrieval-ON reaches 1.000 task accuracy while ablated retrieval and no-memory controls sit at chance — the read mechanism is load-bearing, not decorative.
Claim table (current, your format)
| Claim | Evidence today | Caveat | Next test |
|---|---|---|---|
| Byte-native pipeline runs end-to-end | text/chat/knowledge/video(539B windows) through one trainer | works ≠ quality; generation demo pending trained ckpt | fixed demo + small checkpoint |
| HSL substrate is useful beyond learned bytes | 2.058 vs 2.118 bpb (matched arch/data) | single seed, toy scale, prior encoder rev | multi-seed ControlEmbedding A/B (hsl/learned/random/permuted) |
| Substrate geometry (not just invertibility) matters | Gray-code structure; permuted control exists | unmeasured | the raw-bits(8) vs Δ(8) minimal pair — identical information/dims/scale, geometry only |
| Dense prefix + byte-AR decoder helps | dec-only beat 2× enc-dec at matched budget | params/context confounds partially addressed, not fully | same-param/-context/-FLOP grid |
| Cross-modal binding | matched/mismatched gaps in earlier prototypes | shortcuts not excluded | hard negatives (same-class wrong-instance, entropy-matched) + top-k retrieval |
| Knowledge lives on disk, not FFN | ON 1.000 / ablated chance | mechanism proof, synthetic facts | knowledge-mode training on a real 73k-fact store (wired, training next) |
What is running / next
A depth sweep on the final wired architecture is finishing now; the first full training run on the pinned public encoder follows, then: (1) the multi-seed substrate ablations above, (2) binding probes with your hard-negative list, (3) a small reproducibility packet — fixed tiny split, exact commands, seeds/logs, a small checkpoint, and the claim/evidence/caveat table maintained in the repo.
If you want to poke at the substrate before any of that lands: pip install hsl-embedding (>= 0.5.0), then from hsl_embedding.ablation import ControlEmbedding — the four variants are a one-line swap in any byte-LM training loop. A complete runnable comparison (examples/substrate_ablation.py) ships in the source distribution on PyPI. Thanks again — the review measurably improved the package within a day of being posted.
Discussion in the ATmosphere