Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifgy34dsgzq7jqxydvay7e5677czgydd5sqwej3jcwza7ahjbk2xq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnt3ipobw2e2"
  },
  "path": "/t/holo-hsl-a-100m-change-rate-based-multimodal-toy-model-on-a-single-rtx-4070/176599#post_2",
  "publishedAt": "2026-06-09T01:10:26.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "ByT5",
    "MEGABYTE",
    "Byte Latent Transformer / BLT",
    "BLT",
    "Perceiver",
    "data2vec",
    "CLIP",
    "ImageBind",
    "Chameleon",
    "Unified-IO 2",
    "AnyGPT",
    "4M",
    "Show-o",
    "Model Cards",
    "HF model cards",
    "Datasheets for Datasets"
  ],
  "textContent": "Hmm, I’m not deeply familiar with this area myself, but after digging around a bit, I think the picture is roughly this:\n\n* * *\n\nI think the most useful way to sharpen this is to separate a few claims that can otherwise get blended together.\n\nThe part that seems most distinctive to me is not simply “byte-level modeling.” Byte-level / tokenizer-free modeling already has a real research context — for example ByT5, MEGABYTE, and Byte Latent Transformer / BLT.\n\nWhat seems more specific to HoLo/HSL is the attempt to use a byte-to-signal substrate — change-rate / spectral / phase-like features — as a shared low-level representation across modalities.\n\nSo the question I would try to make maximally clear is:\n\n> under controlled conditions, what part of the result comes from raw bytes, what part comes from dense packing / effective context, and what part comes from the HSL substrate itself?\n\nThat framing might make the project easier for outside readers to discuss, because it separates several different ideas:\n\nLayer | Question\n---|---\nByte-native modeling | Can bytes work as the common input/output unit?\nSequence-length handling | Does dense packing / larger effective context help?\nHSL substrate | Does the change-rate / spectral / phase feature map give a useful inductive bias?\nArchitecture | Does the dense encoder + byte-autoregressive decoder help beyond simpler byte baselines?\nMultimodal binding | Is the model learning cross-modal association, or mostly exploiting simple shortcuts in the toy setup?\nReproducibility | Which parts can outside readers reproduce directly, and which require the private encoder/API?\n\nThat is not a criticism. I think it is a useful way to make the interesting part stand out.\n\n## What I would compare it to conceptually\n\nI would not treat these as direct competitors or required baselines. I would treat them as orientation points: examples of adjacent projects that help clarify what kind of claim is being made.\n\nThread | Useful examples | What they help clarify\n---|---|---\nByte-native / tokenizer-free modeling | ByT5, MEGABYTE, BLT | Bytes remove tokenizer assumptions, but sequence length and compute allocation become central.\nModality-agnostic perception | Perceiver, data2vec | How much modality-specific machinery can be removed before learning becomes unstable or inefficient?\nCross-modal binding | CLIP, ImageBind | Matched/mismatched and retrieval-style probes can make binding evidence easier to interpret.\nUnified multimodal systems | Chameleon, Unified-IO 2, AnyGPT, 4M, Show-o | Many unified systems still rely on tokenized or discrete modality representations; HoLo/HSL seems to be asking a lower-level substrate question.\nReproducible reporting | Model Cards, HF model cards, Datasheets for Datasets | Useful patterns for making “what is claimed / what is not claimed / what is reproducible” easy to read.\n\nThe main distinction I would emphasize is:\n\n> HoLo/HSL is probably not best framed as a small Chameleon or a small Unified-IO. Those systems mostly unify modalities after tokenization or discrete representation design. HoLo/HSL seems to be asking whether a byte-to-signal substrate can sit below modality-specific tokenization.\n\nThat makes the idea easier to place.\n\n## What I would try to isolate next\n\nIf the next goal is to make the mechanism easier for others to evaluate, I think small controlled experiments would be more useful than a larger demo.\n\nThe most useful next table might be something like this:\n\nExperiment | What it isolates\n---|---\nSame-parameter raw-byte baseline | Whether the gain is coming from model size.\nSame-context raw-byte baseline | Whether the gain is coming from larger effective context.\nSame-FLOP baseline | Whether the method is actually more compute-efficient.\nSame architecture + learned byte embedding | Whether the handcrafted substrate helps beyond a learned byte representation.\nSame architecture + simple public byte features | Whether the open architecture works without the private substrate.\nSame architecture + random fixed invertible map | Whether invertibility alone explains anything.\nHSL feature-family ablations | Which parts of the substrate matter: bit-level, first-order change, second-order change, spectral, phase-like, etc.\nMulti-seed small runs | Whether the reported deltas are stable at toy scale.\n\nThe key thing I would want to know is not just:\n\n> does HSL beat raw bytes?\n\nbut rather:\n\n> under matched parameter count, context length, FLOPs, data, and seed budget, which part of the system actually moves the metric?\n\nThat would make the central claim much easier to evaluate.\n\n## Substrate ablations that might be especially informative\n\nSince the substrate is the unusual part, I think ablations around it would be very valuable.\n\nFor example:\n\nAblation / control | What it tests\n---|---\nBit-level only | Are raw byte/bit features already enough?\nBit + first-order change | Does change-rate help?\nRemove second-order change | Does acceleration-like structure matter?\nRemove spectral channels | Are frequency-domain features useful?\nRemove phase-like channels | Are phase-like features doing work?\nShuffle feature channels | Does feature geometry matter, or only feature capacity?\nRandom fixed invertible map | Is “lossless/invertible” enough, or does this specific geometry matter?\nLearned byte projection | Can the model learn an equivalent representation by itself?\nPublic surrogate substrate | Can outside readers reproduce the architecture-side result without the private encoder?\n\nThe “invertible” part is important, but I would not stop there. A random invertible map is also information-preserving. The more interesting question is whether this particular feature geometry gives the model a useful inductive bias.\n\n## Binding probes\n\nThe matched-vs-mismatched bpb results are interesting as binding probes. I would probably strengthen that direction with harder negatives and more intuitive retrieval-style metrics.\n\nPossible additions:\n\nProbe | Why it helps\n---|---\nSame-class wrong-instance negatives | Reduces the chance that the model only learned the class label.\nSame-length caption negatives | Reduces trivial caption-length effects.\nEntropy-matched caption negatives | Reduces byte-distribution shortcuts.\nImage-only / audio-only / image+audio ablations | Shows which modality is actually contributing.\nShifted video windows | Tests whether nearby temporal context explains the gap.\nTop-k retrieval | Easier for readers to interpret than bpb alone.\nCross-dataset transfer | Tests whether the binding survives beyond the original toy distribution.\n\nFor example, in a digit setup, a stronger negative is not just “wrong digit word.” It could be:\n\n  * same digit class, wrong image instance;\n  * same digit class, wrong speaker/audio instance;\n  * same caption length;\n  * similar byte entropy;\n  * image-only vs audio-only vs image+audio.\n\n\n\nThat would help separate “the model found a shortcut” from “the model is actually binding the modalities.”\n\n## A small public reproducibility packet\n\nGiven that the original byte-to-signal encoder/codec is withheld, I think a small public reproducibility packet would help a lot.\n\nI do not think the withheld substrate is a reason to dismiss the project. But it does change what outside readers can verify. A compact evaluation packet could make the open parts much easier to discuss.\n\nSomething like:\n\nPublic artifact | Why it helps\n---|---\nFixed tiny dataset split | Everyone tests the same examples.\nExported feature tensors | People can train/evaluate the open architecture without the private encoder.\nPublic surrogate feature extractor | People can test whether the architecture-side idea survives without the original substrate.\nExpected metric outputs | Easy sanity check for reproduction.\nExact train/eval commands | Lowers reproduction friction.\nConfigs for matched baselines | Makes comparisons less ambiguous.\nSeeds and logs | Helps separate real deltas from toy-scale variance.\nSmall checkpoint | Lets people verify the inference path.\nOne-page claim/evidence/caveat table | Helps forum readers understand the project quickly.\n\nThat might be more useful at this stage than trying to scale the demo immediately.\n\n## A possible “claim table”\n\nOne thing that might help readers is a short table like this in the repo or paper:\n\nClaim | Current evidence | Caveat | Next test\n---|---|---|---\nByte-native pipeline runs end-to-end | Text/image/audio/video-like streams can pass through one pipeline | Does not yet imply strong generation quality | Fixed reproducible demo + small checkpoint\nDense input helps | AsymHSL direction looks promising | Params/context are confounded | Same-param, same-context, same-FLOP baselines\nHSL substrate is useful | Feature-based runs are plausible | Exact substrate is private; effect not fully isolated | Public surrogate + ablations\nCross-modal binding appears | Matched/mismatched bpb gaps | Possible shortcuts in toy setup | Hard negatives + retrieval\nOne architecture can process multiple modalities | Same model family runs on several byte streams | Decoded/normalized streams are not arbitrary raw media formats | Clear preprocessing description\n\nThat kind of table would prevent readers from arguing about the wrong claim.\n\n## Minor documentation clarifications that might help\n\nA few small clarifications could make the project easier to read from the outside:\n\nClarification | Why it helps\n---|---\nWhat exactly counts as “video” in each experiment | Raw MP4 bytes, decoded frames, frame/audio/caption windows, and generated raster frames are different things.\nWhich results require the private encoder | Helps readers know what can be reproduced directly.\nWhich results can run with a substitute feature map | Makes the open architecture easier to test.\nExact preprocessing for image/audio/video streams | Reduces confusion around raw media vs decoded/normalized byte streams.\nWhich datasets are toy proxies vs real-world sources | Helps calibrate the strength of the evidence.\nWhich comparisons are matched vs intentionally not matched | Prevents over-reading of early numbers.\nA short “what this is / what this is not” section | Useful for readers who only skim the repo.\n\nAgain, I do not mean this as criticism. I think the unusual part of the project is interesting enough that making the boundaries extra explicit would help people give better feedback.\n\n## Bottom line\n\nMy bottom-line read is:\n\nHoLo/HSL seems most interesting as a mechanism-level PoC around a lower-level multimodal representation idea.\n\nThe strongest next step, in my opinion, would be to make the core claim easier to isolate:\n\n  1. separate byte-native modeling from the HSL-substrate claim;\n  2. compare against matched raw-byte / learned-byte / public-feature baselines;\n  3. ablate the substrate feature families;\n  4. strengthen matched/mismatched binding tests with hard negatives and retrieval metrics;\n  5. publish a tiny reproducibility packet for the open architecture path.\n\n\n\nThat would make it easier to discuss the idea on its own terms, without forcing it into either “new SOTA model” or “not useful yet.”",
  "title": "HoLo/HSL: a 100M change-rate-based multimodal toy model on a single RTX 4070"
}