{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihoskgapjkjeyaawrupkfsvyz44yjjdue4pqbwdl7b3fdtc6wmmrm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mh5wbparhvv2"
  },
  "path": "/t/purpose-of-commit-hash-in-pretrainedmodel-from-pretrained/174304#post_2",
  "publishedAt": "2026-03-16T06:50:09.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "GitHub"
  ],
  "textContent": "Oh… Complicated…\n\n* * *\n\n`revision` and `_commit_hash` are related, but they are **not the same thing**.\n\n  * **`revision`** = what the caller asked for. It can be a branch, tag, PR ref, or commit hash. That is the public Hub API. (Hugging Face)\n  * **`_commit_hash`** = the exact immutable commit that `revision` resolved to during loading. Transformers tries to discover it early and then carry it through later steps. (GitHub)\n\n\n\n## Why `_commit_hash` exists at all\n\nThe key background is the Hub cache layout.\n\nHugging Face stores symbolic refs like `main` separately from the actual immutable snapshots. In the cache, `refs/` holds mappings such as `main -> <commit>`, while `snapshots/<commit>/...` holds the actual file tree for that exact revision. So a moving ref like `main` is **not** the same thing as a concrete cached snapshot. (Hugging Face)\n\nThat is why `revision` alone is not always enough once loading has started. If Transformers already knows the exact commit, `_commit_hash` lets it target the precise cached snapshot instead of re-resolving a symbolic ref. (GitHub)\n\n## What `from_pretrained` is doing\n\nYour reading of the code is correct.\n\n`from_pretrained` first tries to resolve the config file specifically to learn the commit hash “as soon as possible”, then extracts that hash and passes it along as `_commit_hash` in later calls. (GitHub)\n\nThat exact commit is also preserved in config-loading code. If `_commit_hash` is found in the loaded config, Transformers keeps propagating it instead of discarding it. (GitHub)\n\nSo `_commit_hash` is not a useless internal leftover. It is deliberate state that gets threaded through the loading pipeline. (GitHub)\n\n## What `_commit_hash` is actually used for\n\nInside `cached_file` / `cached_files`, `_commit_hash` is primarily used for **exact cache lookup**.\n\nThe docstring says it is passed when chaining several file loads and that, if files are already cached for that commit hash, Transformers can “avoid calls to head and get from the cache.” The implementation then checks `try_to_load_from_cache(... revision=_commit_hash ...)` before doing any remote download. (GitHub)\n\nSo the main purpose is:\n\n  1. **exact local cache resolution**\n  2. **better offline/cache behavior**\n  3. **provenance propagation** across config/tokenizer/model loading steps (GitHub)\n\n\n\n## Does it pin later remote downloads too?\n\nThis is the subtle part.\n\n### In the normal single-file `cached_file(...)` path:\n\nNo, not directly.\n\nIf the file is **not** already found via the `_commit_hash` cache fast path, Transformers falls back to `hf_hub_download(... revision=revision ...)`. In other words, the actual remote call still uses the original `revision`, not `_commit_hash`. (GitHub)\n\nSo your “it locks all later remote fetches to the same commit” interpretation is **too strong** for that path.\n\n### In the multi-file `snapshot_download(...)` path:\n\nYes, effectively.\n\n`snapshot_download` resolves the requested revision once, gets `repo_info.sha` as `commit_hash`, stores the `refs/<revision>` mapping if needed, and then works from that resolved commit snapshot. (GitHub)\n\nSo the “resolve once, then pin to one exact commit” behavior is real, but it is most clearly implemented in `snapshot_download`, not in every single-file `hf_hub_download` fallback from `cached_file`. (GitHub)\n\n## Direct answers to your questions\n\n### 1. If `_commit_hash` is mainly for local cache resolution, couldn’t `revision` handle that too?\n\nNot as well.\n\n`revision=\"main\"` is only a symbolic name. The cache ultimately needs an exact snapshot directory keyed by commit hash. Once Transformers already knows the resolved commit, `_commit_hash` is the stricter and more useful key. (Hugging Face)\n\n### 2. Does `huggingface_hub` use this commit hash to lock the remote fetch?\n\n**Not in the general single-file`hf_hub_download` path used by `cached_file`.** That fallback still passes `revision=revision`. (GitHub)\n\n**Yes in`snapshot_download`.** That code resolves the revision to an exact SHA and then uses that resolved snapshot. (GitHub)\n\n### 3. Does `_commit_hash` serve other purposes?\n\nYes.\n\nBesides exact cache lookup, it also serves as **provenance metadata** that gets propagated through config loading and later file resolution. That way later steps know not just “the user asked for `main`”, but “this load was actually resolved from commit X”. (GitHub)\n\n## Bottom line\n\nThe most accurate summary is:\n\n> `revision` is the user-facing ref. `_commit_hash` is the exact resolved commit.\n>  Transformers uses `_commit_hash` mainly to make chained loads more deterministic by targeting the exact cached snapshot and carrying forward the resolved provenance.\n>  It is **not** just redundant with `revision`, but it is also **not** a universal remote-download lock in every `from_pretrained` code path. That stronger “pin everything to one commit” behavior is most clearly provided by `snapshot_download`. (GitHub)\n\nIf you want a one-line version:\n\n> `_commit_hash` is mostly an **internal exact-snapshot key** , while `revision` is the **public ref you asked for**.",
  "title": "Purpose of commit_hash in PreTrainedModel.from_pretrained"
}