Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreie6y2ce6csienbl5mbjvookpujtajnxrnhbx3qoktdxl4rncm7e24",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mh5pkwpvl6a2"
  },
  "path": "/t/purpose-of-commit-hash-in-pretrainedmodel-from-pretrained/174304#post_1",
  "publishedAt": "2026-03-16T05:09:11.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I’ve been digging into the source code of the `transformers` library and stumbled upon a detail regarding how files are fetched and cached that I’m hoping someone can clarify.\n\nSpecifically, I am trying to understand the exact role of the `commit_hash` argument within the `PreTrainedModel.from_pretrained` method, and how it differs from `revision`.\n\nMy initial research led me to believe that `commit_hash` is used to pin consecutive file downloads to a specific state. This prevents a race condition where a branch (like `main`) is updated halfway through downloading a multi-file model, which would result in mismatched files.\n\nLooking at the code, it seems to support this. First, it tries to obtain the `commit_hash` of the current revision early on by resolving the config file:\n\n\n    if commit_hash is None:\n        if not isinstance(config, PretrainedConfig):\n            # We make a call to the config file first (which may be absent) to get the commit hash as soon as possible\n            resolved_config_file = cached_file(\n                pretrained_model_name_or_path,\n                CONFIG_NAME,\n                # ... [other args omitted for brevity] ...\n                revision=revision,\n            )\n            commit_hash = extract_commit_hash(resolved_config_file, commit_hash)\n        else:\n            commit_hash = getattr(config, \"_commit_hash\", None)\n\n\n\nThis `commit_hash` is then passed down into `cached_file_kwargs` for subsequent loading code (like fetching the actual model weights):\n\n\n    cached_file_kwargs = {\n        # ... [other args] ...\n        \"revision\": revision,\n        \"_commit_hash\": commit_hash,\n    }\n    resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)\n\n\n\n**Here is my confusion:** When I look inside the `cached_file` method itself, I noticed that the `_commit_hash` appears to only be used for _local cache checks_. If a download from the Hub is actually triggered, it seems to still rely on the `revision` argument. Other loading functions also don’t seem to strictly use the identified `commit_hash` for the remote fetch.\n\n**My questions:**\n\n  1. If `commit_hash` is primarily used just for local cache resolution, couldn’t the `revision` argument handle that on its own?\n\n  2. Does the underlying `huggingface_hub` download logic actually use this `commit_hash` to lock the remote fetch to that specific commit, or is my assumption about preventing mid-download revision changes incorrect?\n\n  3. Does `commit_hash`serves any other purposes?\n\n\n\n\nThank you in advance.",
  "title": "Purpose of commit_hash in PreTrainedModel.from_pretrained"
}