Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihxuf4uncvhlmnkuo7w4myznjrinhlzw7htlqkdz3vgoe3xleczdq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mndrfqajynj2"
  },
  "path": "/t/interest-in-preprocessing-utilities-for-multifile-model-uploads/176211#post_3",
  "publishedAt": "2026-06-02T23:26:08.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Fantastic feedback John, and I really appreciate you taking the time to write that.\n\nThinking over this, I would say the motivating factors are\n\n  * There are some serious deficiencies in dynamic_utils I know of, and likely elsewhere, due to the single file pattern. The one I know of in specific is if you import multiple files which in turn import other files and you save a pretrained model then load it, it fails locally if loaded locally but not if loaded remotely. Some things are really broken in that package. Incidentally, if anyone wants I could probably fix it.\n  * In the meantime, it would be really useful to have a mechanismthat can attempt to fix the problem I faced: You have a model you cannot, in fact, upload to huggingface for some reason. Having an inliner flag in the save_pretrained mechanism would be a fairly ideal case I think. It’s purpose is to let repositories with multiple recurrent files be uploaded if possible. If it cannot be, it would clearly state why and how to fix it. That is the responsibility in a nutcshell\n\n\n\nLets assume we provide an additional preprocessor for this. I would insert in in transformers at transformers.model_utils in PreTrainedModel using an additional flag on save_pretrained; this flag is also passed along to push_to_hub by any needed modifications in transformers.utils.hub. In the meantime, we might consider fixing those bugs in dynamic utils (I could build a more robust import resolution system if anyone is interested) but an explicit opt-in system seems like the right choice for now given I do not know everything I would be breaking if I rebuild the resolution system; I would really want to talk to the maintainers first. The longer term fix would likely be to make dynamic_utils recursively part multilevel imports (that is from .model.attention.cache works now too, not just from .model; and perhaps represent the files in a standardize flat format on the hub. All files are individually retained, but their path in the original repo is parsable by file name. this is not, currently, globally supported, unfortunately. But again, before I start doing surgury I really want to know the patient better. And this is an excellent fix for now.\n\nRegarding your design question\n\n  1. The compiled output is human readable by design. Unlike a pickle dump, a parser moves the files together with all relevant comments retained. I was really annoyed with the hub as I could never read anything on the hub, so this preprocessor deliberately retains comments, and since I refuse to write bad code that is what I ended up in.\n  2. I would say in the save_pretrained step; push to hub uses saved_pretrain as one of it’s actions.\n  3. This utility fills a niche: I have a nicely organized repo with multiple folders and many files, and want to push it to the hub. The existing system cannot even support that. So I would think of it as a push converter.\n  4. I am not sure I understand the mental model where they can differ. One is an entire diverse project with many interconnected imports. The other is a flat file. I could generate it multiple times. But how exactly would automatic CI testing work? The main problem is the same as tracing; I don’t know what inputs to start from. I could make a CI test and unit test for the code itself.\n  5. There are two levels here. One is the design question: is this inlining transform always going to be semantically equivalent? You can formally verify this by using the idea of a Directed Acyclic Graph. If a dependency tree of imports can be built that is acyclic, then the system can be confirmed to be inlinable without conflicts. In practice, this means keeping track of observed inlined files, and if files are trying to inline each other they are cyclic and not supported. This would also not have been supported in normal python and causes a recursive import error.\n  6. That logic is broken. That is why I made this in the first place. If people want me to fix it, I know where most of the breaks are.\n7-8: It already does that. It is very readable.\n\n\n\nRegarding the problem, it is not even really aimed at the hub per say, but at uploading to the hub and covering some inconsistencies between huggingface now and as it originated. The problem it solves is \"I have a really complicated project that deserves lots of files and folders, but huggingface has hardwired assumptions that it turns out I need.",
  "title": "Interest in preprocessing utilities for multifile model uploads"
}