{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicfipnoxprr75z2ts5ffox2r7msslxn2kbopeuwbaikes6ebutjui",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmrycvbdlpy2"
  },
  "path": "/t/i-shrank-olmo-3-7b-think-by-3b-params-it-still-works/176246#post_1",
  "publishedAt": "2026-05-26T21:33:04.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "The model",
    "PMRA"
  ],
  "textContent": "Hi all, I have created a new (not like, without any debts to prior art new) compression method called Variable Allocation Compression, and I was able to use it to shrink Ai2’s OLMo-3-7B-Think 1.8x. The modelretains instruction following, chat completion, and thinking/reasoning capabilities. Evals are expensive so I’m trying to be judicious about how I test this thing, but it definitely doesn’t feel like a model that’s had 3 billion weights removed from it. (PPL increase of 5.92, my model ended recovery training at 26.97, the original is 21.05)\n\nAi2 has such a wealth of knowledge they give away for free, which has made this possible. If anyone is interested, I’m going to continue to develop and document the compression method, and work on finding a solution which allows me to use ggufs for these (the factorized weights don’t currently support them)\n\nI welcome anyone with a GPU to take a look at the model, and to check out my custom mixed-tensor quanted models (PMRA) which have better NLL at smaller payloads than many others.",
  "title": "I shrank OLMo-3-7B-Think by 3B Params & It still works!"
}