Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreia33knop3dww2kjeygiyaer3l3nvgm6i24sf4s22sbmohbtfgxpy4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlat5i7tzku2"
  },
  "path": "/t/running-out-of-memory-in-the-summary-example/175777#post_2",
  "publishedAt": "2026-05-07T07:05:06.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "torch.mps.current_allocated_memory()",
    "torch.mps.driver_allocated_memory()",
    "torch.mps.empty_cache()",
    "Hugging Face summarization notebook",
    "Hugging Face summarization task guide",
    "Hugging Face course: summarization",
    "T5-small model page",
    "XSum dataset page",
    "PyTorch issue: MPS memory leak in training with transformers Trainer",
    "PyTorch issue: MPS memory leak with variable batch size / sequence length",
    "Hugging Face Transformers issue: Trainer class causes massive memory leak when using MPS",
    "PyTorch forum: MPS backend OOM with small allocated memory and large other allocations",
    "PyTorch issue: MPS LSTM loop leaks despite torch.mps.empty_cache()",
    "PyTorch issue: driver_allocated_memory() grows unrestricted",
    "PYTORCH_MPS_HIGH_WATERMARK_RATIO",
    "Hugging Face Trainer docs",
    "Transformers summarization script",
    "Transformers no-Trainer summarization script",
    "Transformers example scripts guide",
    "PyTorch MPS package docs",
    "PyTorch torch.mps.current_allocated_memory()",
    "PyTorch torch.mps.driver_allocated_memory()",
    "PyTorch torch.mps.empty_cache()",
    "PyTorch MPS environment variables",
    "Apple: Accelerated PyTorch training on Mac",
    "Hugging Face example scripts guide",
    "PyTorch: MPS memory leak in training with transformers Trainer",
    "PyTorch: MPS memory leak with variable batch size / sequence length",
    "Transformers: Trainer class causes massive memory leak when using MPS",
    "PyTorch forum: MPS backend out of memory on Mac M2",
    "PyTorch: MPS LSTM leak despite empty_cache()",
    "PyTorch: driver_allocated_memory() grows unrestricted",
    "PyTorch: MPS memory leak minimal examples",
    "Transformers run_summarization.py",
    "Transformers run_summarization_no_trainer.py"
  ],
  "textContent": "There seems to be an issue related to MPS that looks like a memory leak:\n\n* * *\n\n# MPS out-of-memory while running the Hugging Face summarization notebook on Mac M2\n\n## Direct diagnosis\n\nThis does **not** look like a simple “`t5-small` is too large” problem, and it also does **not** look like something `torch.mps.empty_cache()` is expected to fix.\n\nThe key clue is this part of the error:\n\n\n    RuntimeError: MPS backend out of memory\n    MPS allocated: 4.20 GiB\n    other allocations: 43.49 GiB\n    max allowed: 47.74 GiB\n    Tried to allocate 51.44 MiB on private pool.\n    Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations\n\n\nThe important part is the split:\n\n\n    MPS allocated: 4.20 GiB\n    other allocations: 43.49 GiB\n\n\nThat strongly suggests the model’s **live PyTorch tensor memory** is not the whole problem. The large number is in **MPS/Metal-driver/backend-side allocations**.\n\nPyTorch has separate MPS memory counters:\n\n  * torch.mps.current_allocated_memory() reports current GPU memory occupied by tensors and does **not** include cached allocations in MPSAllocator pools.\n  * torch.mps.driver_allocated_memory() reports total GPU memory allocated by the Metal driver and includes cached MPSAllocator pools plus allocations from MPS/MPSGraph frameworks.\n  * torch.mps.empty_cache() releases unoccupied cached memory held by the caching allocator; it does not promise to clear all MPSGraph, driver, or backend allocations.\n\n\n\nSo the likely diagnosis is:\n\n> A long, variable-shape, sequence-to-sequence Hugging Face `Trainer` run is causing MPS driver/backend memory to grow until it hits the MPS allocation limit.\n\nThat is different from ordinary model OOM. Ordinary model OOM usually means “the live tensors for this batch do not fit.” Your error looks more like “backend/driver allocations have grown over time.”\n\n* * *\n\n## Why this notebook is a bad fit for a default Mac M2 MPS run\n\nThe Hugging Face summarization notebook is a teaching notebook, not a carefully constrained Apple Silicon training recipe.\n\nIt uses the XSum summarization task and `t5-small`. Relevant links:\n\n  * Hugging Face summarization notebook\n  * Hugging Face summarization task guide\n  * Hugging Face course: summarization\n  * T5-small model page\n  * XSum dataset page\n\n\n\nThe notebook’s default-style setup is roughly:\n\n\n    model_checkpoint = \"t5-small\"\n    raw_datasets = load_dataset(\"xsum\")\n    metric = load(\"rouge\")\n\n    max_input_length = 1024\n    max_target_length = 128\n    batch_size = 16\n\n    Seq2SeqTrainingArguments(\n        ...,\n        per_device_train_batch_size=batch_size,\n        per_device_eval_batch_size=batch_size,\n        predict_with_generate=True,\n        fp16=True,\n        push_to_hub=True,\n    )\n\n\nThose settings are heavy for local MPS training because summarization is an encoder-decoder sequence-to-sequence task:\n\n  * the input document can be long;\n  * the target summary is generated token by token;\n  * training stores encoder activations, decoder activations, gradients, optimizer state, attention intermediates, labels, and temporary tensors;\n  * evaluation with `predict_with_generate=True` uses generation, which is more memory-heavy than plain loss evaluation;\n  * ROUGE evaluation requires generated predictions and decoded text;\n  * dynamic padding creates changing tensor shapes from batch to batch.\n\n\n\nEven though `t5-small` is small compared with modern LLMs, this workload is not small in the memory-behavior sense. A long-input seq2seq model with batch size 16 and source length up to 1024 is quite aggressive for MPS.\n\n* * *\n\n## Why “around 1300 steps” matters\n\nThe fact that memory grows over time and fails after something like 1300 steps is important.\n\nIf the batch were simply too large, I would expect failure very early, often on the first few steps. A late failure suggests one of these:\n\n  1. backend memory accumulation;\n  2. allocator cache growth;\n  3. shape-specific graph/kernel/resource accumulation;\n  4. fragmentation;\n  5. a real MPS backend leak;\n  6. retained objects in a notebook process;\n  7. evaluation/checkpointing side effects if the failure occurs near those events.\n\n\n\nYour specific error strongly points to MPS backend/driver allocations because the `other allocations` number is enormous.\n\nThere are similar public reports:\n\n  * PyTorch issue: MPS memory leak in training with transformers Trainer\nThis is the closest match. It reports `transformers Trainer` on MPS hitting OOM after several hundred iterations, especially with varying data lengths. It also notes that MPS allocated memory appears unchanged while backend memory runs out.\n\n  * PyTorch issue: MPS memory leak with variable batch size / sequence length\nThis is relevant because summarization datasets naturally produce variable sequence lengths.\n\n  * Hugging Face Transformers issue: Trainer class causes massive memory leak when using MPS\nThis reports continuously growing process memory with `Trainer` on MPS.\n\n  * PyTorch forum: MPS backend OOM with small allocated memory and large other allocations\nThis has the same error shape: modest `MPS allocated`, huge `other allocations`.\n\n  * PyTorch issue: MPS LSTM loop leaks despite torch.mps.empty_cache()\nThis helps explain why `empty_cache()` does not solve this class of problem.\n\n  * PyTorch issue: driver_allocated_memory() grows unrestricted\nThis shows another MPS case where driver memory grows until an OOM with huge `other allocations`.\n\n\n\n\nThis is why I would treat your issue as likely MPS-backend-related, not just a notebook typo.\n\n* * *\n\n## Why dynamic padding is suspicious\n\nThe notebook intentionally defers padding to the data collator. That is usually good practice because each batch is padded only to the longest example in that batch, not to the global maximum.\n\nThe downside is that every batch may have a different shape.\n\nFor example, step shapes may look conceptually like this:\n\n\n    step 1: input shape [16, 742], labels [16, 68]\n    step 2: input shape [16, 1018], labels [16, 114]\n    step 3: input shape [16, 523], labels [16, 51]\n    step 4: input shape [16, 895], labels [16, 103]\n    ...\n\n\nOn CUDA, dynamic padding is usually a good memory/speed tradeoff. On MPS, public issues suggest that changing batch/sequence shapes can contribute to backend memory growth.\n\nThat makes dynamic padding one of the strongest suspects in your case.\n\nThe important tradeoff:\n\nStrategy | Benefit | Risk on MPS\n---|---|---\nDynamic padding | Less padding compute per batch | Many distinct shapes\nFixed padding | Fewer distinct shapes | More padding tokens\nLength bucketing | Fewer distinct shapes with less wasted padding | More setup\n\nFor your specific issue, I would test fixed padding even though it is less elegant.\n\n* * *\n\n## Why `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` is not the right fix\n\nYou are right to avoid this as the main answer.\n\nThe PyTorch MPS environment-variable docs describe:\n\n  * PYTORCH_MPS_HIGH_WATERMARK_RATIO as the hard allocation limit for the MPS allocator.\n  * Setting it to `0.0` disables the high-watermark limit.\n  * The docs warn that disabling the limit may cause system failure if system-wide OOM occurs.\n  * `PYTORCH_MPS_LOW_WATERMARK_RATIO` is the softer limit used for adaptive commit / garbage-collection behavior.\n\n\n\nSo this:\n\n\n    export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0\n\n\nmay postpone the error, but it does not fix the memory-growth slope. It removes the guardrail and can push your whole system into memory pressure or system OOM.\n\nA safer diagnostic, not a real fix, is something like:\n\n\n    export PYTORCH_MPS_HIGH_WATERMARK_RATIO=1.0\n    export PYTORCH_MPS_LOW_WATERMARK_RATIO=0.8\n\n\nThis may make failure happen earlier, but it can help show whether low-watermark cleanup behavior changes the memory curve. I would not treat it as the primary solution.\n\n* * *\n\n## Why `torch.mps.empty_cache()` did not help\n\n`torch.mps.empty_cache()` is not a general reset button.\n\nIt can release unoccupied cached memory held by the caching allocator, but it does not necessarily free:\n\n  * live tensors;\n  * retained Python references;\n  * MPSGraph framework allocations;\n  * driver allocations still considered active;\n  * shape-specific backend resources;\n  * command-buffer-related resources;\n  * actual backend leaks.\n\n\n\nThat matches the public MPS issues where memory growth continues even when `empty_cache()` is called repeatedly.\n\nSo this is not surprising:\n\n\n    torch.mps.empty_cache()\n\n\nIt may help in some allocator-cache situations, but it is not expected to fix a long-run MPS backend memory growth issue.\n\n* * *\n\n## What I would do first\n\n### 1. Instrument MPS memory correctly\n\nAdd a callback that logs both tensor memory and driver memory.\n\n\n    import torch\n    from transformers import TrainerCallback\n\n    def gb(x):\n        return x / 1024**3\n\n    def print_mps_memory(tag=\"\"):\n        if torch.backends.mps.is_available():\n            live = torch.mps.current_allocated_memory()\n            driver = torch.mps.driver_allocated_memory()\n            recommended = torch.mps.recommended_max_memory()\n            print(\n                f\"{tag} | \"\n                f\"live_tensors={gb(live):.2f} GiB | \"\n                f\"driver={gb(driver):.2f} GiB | \"\n                f\"recommended={gb(recommended):.2f} GiB\"\n            )\n\n    class MPSMemoryCallback(TrainerCallback):\n        def on_step_end(self, args, state, control, **kwargs):\n            if state.global_step % 50 == 0:\n                print_mps_memory(f\"step={state.global_step}\")\n\n        def on_evaluate(self, args, state, control, **kwargs):\n            print_mps_memory(f\"after_eval step={state.global_step}\")\n\n\nThen:\n\n\n    trainer.add_callback(MPSMemoryCallback())\n\n\nInterpretation:\n\nObservation | Likely meaning\n---|---\n`live_tensors` grows steadily | real tensor retention, too-large graph, or Python reference retention\n`live_tensors` stable but `driver` grows | MPS allocator / MPSGraph / Metal-driver growth\ngrowth jumps after evaluation | generation / metrics / prediction accumulation\ngrowth appears only after notebook reruns | stale notebook references\nCPU stable but MPS grows | MPS-specific backend issue\nfixed padding flattens driver growth | dynamic-shape churn is probably the trigger\n\nFor your reported error, I would expect live tensor memory to remain much smaller than driver/backend memory.\n\n* * *\n\n### 2. Start from a smaller, MPS-friendly training configuration\n\nDo not start from the original notebook settings. Use a diagnostic configuration first:\n\n\n    from transformers import Seq2SeqTrainingArguments\n\n    args = Seq2SeqTrainingArguments(\n        output_dir=\"t5-small-xsum-mps-debug\",\n\n        # Lower per-step memory.\n        per_device_train_batch_size=1,\n        gradient_accumulation_steps=16,\n\n        # Disable evaluation while diagnosing training memory.\n        eval_strategy=\"no\",\n\n        # Disable saving / pushing while diagnosing memory.\n        save_strategy=\"no\",\n        push_to_hub=False,\n\n        # Remove mixed precision as a variable.\n        fp16=False,\n        bf16=False,\n\n        # Trade speed for lower activation memory.\n        gradient_checkpointing=True,\n\n        # Keep macOS data loading simple.\n        dataloader_num_workers=0,\n        dataloader_pin_memory=False,\n\n        learning_rate=2e-5,\n        weight_decay=0.01,\n        num_train_epochs=1,\n        logging_steps=50,\n    )\n\n\nAlso set:\n\n\n    model.config.use_cache = False\n\n\nWhy these settings:\n\n  * `per_device_train_batch_size=1` greatly reduces per-step memory pressure.\n  * `gradient_accumulation_steps=16` keeps the effective batch size near the original batch size 16.\n  * `eval_strategy=\"no\"` answers the question: “Does training alone leak?”\n  * `save_strategy=\"no\"` removes checkpointing as a confounder.\n  * `push_to_hub=False` removes git/upload behavior as a confounder.\n  * `fp16=False` removes mixed-precision ambiguity on MPS.\n  * `gradient_checkpointing=True` reduces activation memory by recomputing activations during backward.\n  * `dataloader_num_workers=0` and `dataloader_pin_memory=False` simplify data loading on macOS.\n\n\n\nRelevant docs:\n\n  * Hugging Face Trainer docs\n  * Hugging Face summarization task guide\n\n\n\n* * *\n\n### 3. Reduce sequence lengths first\n\nChange:\n\n\n    max_input_length = 1024\n    max_target_length = 128\n\n\nto:\n\n\n    max_input_length = 512\n    max_target_length = 64\n\n\nThis reduces:\n\n  * encoder activation memory;\n  * decoder activation memory;\n  * attention memory;\n  * temporary tensors;\n  * generation memory later;\n  * shape variety.\n\n\n\nFor a Mac M2 diagnostic run, 1024/128 is too aggressive as the first attempt.\n\n* * *\n\n### 4. Test fixed padding\n\nThis is the most important diagnostic for your case.\n\nInstead of dynamic padding, try fixed padding:\n\n\n    max_input_length = 512\n    max_target_length = 64\n\n    def preprocess_function(examples):\n        inputs = [prefix + doc for doc in examples[\"document\"]]\n\n        model_inputs = tokenizer(\n            inputs,\n            max_length=max_input_length,\n            padding=\"max_length\",\n            truncation=True,\n        )\n\n        labels = tokenizer(\n            text_target=examples[\"summary\"],\n            max_length=max_target_length,\n            padding=\"max_length\",\n            truncation=True,\n        )\n\n        model_inputs[\"labels\"] = labels[\"input_ids\"]\n        return model_inputs\n\n\nThen rebuild the tokenized dataset:\n\n\n    tokenized_datasets = raw_datasets.map(\n        preprocess_function,\n        batched=True,\n        load_from_cache_file=False,\n    )\n\n\nIf fixed padding makes memory stable or much flatter, then your main trigger is probably dynamic shape churn on MPS.\n\nIf fixed padding does not help, the issue is more likely a broader MPS backend / Trainer / seq2seq loop memory-growth problem.\n\n* * *\n\n### 5. Use a subset first\n\nDo not debug on the full XSum training set.\n\n\n    small_train = tokenized_datasets[\"train\"].select(range(10_000))\n    small_eval = tokenized_datasets[\"validation\"].select(range(500))\n\n\nThen:\n\n\n    trainer = Seq2SeqTrainer(\n        model=model,\n        args=args,\n        train_dataset=small_train,\n        eval_dataset=small_eval,\n        data_collator=data_collator,\n        processing_class=tokenizer,\n    )\n\n\nIf your installed Transformers version does not accept `processing_class`, use:\n\n\n    trainer = Seq2SeqTrainer(\n        model=model,\n        args=args,\n        train_dataset=small_train,\n        eval_dataset=small_eval,\n        data_collator=data_collator,\n        tokenizer=tokenizer,\n    )\n\n\nRun:\n\n\n    trainer.add_callback(MPSMemoryCallback())\n    trainer.train()\n\n\n* * *\n\n## A complete first-pass MPS-safe training cell\n\nThis is the sort of configuration I would try first.\n\n\n    import torch\n    from transformers import (\n        AutoModelForSeq2SeqLM,\n        DataCollatorForSeq2Seq,\n        Seq2SeqTrainingArguments,\n        Seq2SeqTrainer,\n    )\n\n    model_checkpoint = \"t5-small\"\n\n    model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)\n    model.config.use_cache = False\n\n    if torch.backends.mps.is_available():\n        model.to(\"mps\")\n\n    max_input_length = 512\n    max_target_length = 64\n\n    def preprocess_function(examples):\n        inputs = [prefix + doc for doc in examples[\"document\"]]\n\n        model_inputs = tokenizer(\n            inputs,\n            max_length=max_input_length,\n            padding=\"max_length\",\n            truncation=True,\n        )\n\n        labels = tokenizer(\n            text_target=examples[\"summary\"],\n            max_length=max_target_length,\n            padding=\"max_length\",\n            truncation=True,\n        )\n\n        model_inputs[\"labels\"] = labels[\"input_ids\"]\n        return model_inputs\n\n    tokenized_datasets = raw_datasets.map(\n        preprocess_function,\n        batched=True,\n        load_from_cache_file=False,\n    )\n\n    small_train = tokenized_datasets[\"train\"].select(range(10_000))\n    small_eval = tokenized_datasets[\"validation\"].select(range(500))\n\n    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)\n\n    args = Seq2SeqTrainingArguments(\n        output_dir=\"t5-small-xsum-mps-debug\",\n\n        learning_rate=2e-5,\n        weight_decay=0.01,\n        num_train_epochs=1,\n\n        per_device_train_batch_size=1,\n        gradient_accumulation_steps=16,\n\n        gradient_checkpointing=True,\n\n        eval_strategy=\"no\",\n        save_strategy=\"no\",\n        push_to_hub=False,\n\n        fp16=False,\n        bf16=False,\n\n        dataloader_num_workers=0,\n        dataloader_pin_memory=False,\n\n        logging_steps=50,\n    )\n\n    trainer = Seq2SeqTrainer(\n        model=model,\n        args=args,\n        train_dataset=small_train,\n        eval_dataset=small_eval,\n        data_collator=data_collator,\n        processing_class=tokenizer,\n    )\n\n    trainer.add_callback(MPSMemoryCallback())\n    trainer.train()\n\n\nIf `processing_class=tokenizer` fails because of your Transformers version, replace it with:\n\n\n    tokenizer=tokenizer\n\n\n* * *\n\n## Re-enable evaluation only after training is stable\n\nOnce training alone is stable, add evaluation carefully.\n\nFor summarization, evaluation is expensive because `predict_with_generate=True` runs generation. Hugging Face documents `predict_with_generate` as using `generate()` to calculate generative metrics such as ROUGE/BLEU.\n\nAlso, `eval_accumulation_steps` matters. The Trainer docs explain that if it is unset, predictions are accumulated on the accelerator before being moved to CPU, which is faster but uses more accelerator memory.\n\nUse:\n\n\n    eval_args = Seq2SeqTrainingArguments(\n        output_dir=\"t5-small-xsum-mps-eval\",\n\n        per_device_eval_batch_size=1,\n\n        predict_with_generate=True,\n        generation_max_length=64,\n        generation_num_beams=1,\n        eval_accumulation_steps=1,\n\n        fp16=False,\n        bf16=False,\n\n        save_strategy=\"no\",\n        push_to_hub=False,\n    )\n\n\nRecommended eval strategy:\n\n\n    1. train with eval disabled\n    2. restart the Python process\n    3. load the trained model\n    4. evaluate on 100 to 500 validation examples\n    5. only then try larger validation runs\n\n\nThis avoids forcing a long training process with already-grown MPS driver memory to run generation-heavy evaluation afterward.\n\n* * *\n\n## Experiment matrix\n\nRun these in order.\n\nExperiment | Padding | Batch | Lengths | Eval? | Purpose\n---|---|---|---|---|---\nA | dynamic | 1 | 512/64 | no | Does reduced training still grow memory?\nB | fixed | 1 | 512/64 | no | Does shape stability fix the issue?\nC | fixed | 2 | 512/64 | no | Can you safely increase speed?\nD | fixed | 1 | 768/96 | no | Can you safely increase length?\nE | fixed | 1 | 512/64 | tiny eval | Does generation/eval trigger memory jumps?\nF | fixed | 1 | 512/64 | larger eval | How far can evaluation scale?\n\nStop as soon as `driver_allocated_memory()` shows a steady upward slope.\n\n* * *\n\n## What each result means\n\n### If fixed padding stabilizes memory\n\nThen dynamic shape churn is probably the main trigger.\n\nUse:\n\n  * fixed padding;\n  * shorter max lengths;\n  * batch size 1 or 2;\n  * gradient accumulation;\n  * separate train/eval processes;\n  * no `fp16` until stable.\n\n\n\n### If fixed padding slows but does not stop memory growth\n\nThen shape churn is one contributor, but there is probably broader MPS backend growth.\n\nUse:\n\n  * shorter runs;\n  * restart process between phases;\n  * checkpoint only model weights;\n  * CPU or CUDA/cloud for full training;\n  * track PyTorch MPS issues.\n\n\n\n### If both dynamic and fixed padding leak similarly\n\nThen this is likely a more general MPS backend / seq2seq / Trainer issue.\n\nTry:\n\n  * the official script instead of the notebook;\n  * a no-Trainer loop;\n  * CPU control run;\n  * newer or older PyTorch version as a test;\n  * cloud CUDA if you need the full notebook behavior.\n\n\n\nRelevant scripts:\n\n  * Transformers summarization script\n  * Transformers no-Trainer summarization script\n  * Transformers example scripts guide\n\n\n\n### If CPU is stable but MPS leaks\n\nThen the problem is almost certainly MPS-specific.\n\nA CPU control run:\n\n\n    args = Seq2SeqTrainingArguments(\n        output_dir=\"t5-small-xsum-cpu\",\n        use_cpu=True,\n        per_device_train_batch_size=1,\n        gradient_accumulation_steps=16,\n        eval_strategy=\"no\",\n        save_strategy=\"no\",\n        push_to_hub=False,\n    )\n\n\nCPU will be slower, but it is useful as a control experiment.\n\n### If memory grows only in the notebook\n\nThen notebook state is contributing.\n\nBefore rerunning:\n\n\n    import gc\n    import torch\n\n    try:\n        del trainer\n    except NameError:\n        pass\n\n    try:\n        del model\n    except NameError:\n        pass\n\n    gc.collect()\n\n    if torch.backends.mps.is_available():\n        torch.mps.empty_cache()\n\n\nBut the stronger fix is to restart the kernel or run the training as a plain script:\n\n\n    python train_summarization_mps.py\n\n\n* * *\n\n## Python 3.14: probably not the main cause, but simplify it\n\nPython 3.14 is not necessarily the root cause. Current PyTorch installation guidance includes modern Python versions on macOS. Still, for debugging I would use Python 3.11 or 3.12 first because they are more commonly exercised across ML packages.\n\nA cleaner environment:\n\n\n    python3.12 -m venv .venv-summarization-mps\n    source .venv-summarization-mps/bin/activate\n\n    python -m pip install -U pip\n    python -m pip install -U torch torchvision torchaudio\n    python -m pip install -U transformers datasets evaluate accelerate rouge-score nltk\n\n\nThen print versions:\n\n\n    import sys\n    import torch\n    import transformers\n    import datasets\n    import accelerate\n\n    print(\"python:\", sys.version)\n    print(\"torch:\", torch.__version__)\n    print(\"transformers:\", transformers.__version__)\n    print(\"datasets:\", datasets.__version__)\n    print(\"accelerate:\", accelerate.__version__)\n    print(\"mps built:\", torch.backends.mps.is_built())\n    print(\"mps available:\", torch.backends.mps.is_available())\n\n\nThis does not prove Python 3.14 is bad. It just removes a variable while investigating a likely MPS backend issue.\n\n* * *\n\n## What I would not do\n\n### Do not rely on `torch.mps.empty_cache()`\n\nIt is not a general leak fix.\n\nUse it for cleanup, but do not expect it to solve driver/backend growth.\n\n### Do not set `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` as the solution\n\nThat disables the hard limit and can risk system-wide OOM. It may postpone the crash, but it does not fix the underlying memory-growth pattern.\n\n### Do not start with full XSum + full ROUGE evaluation\n\nUse subsets first.\n\n### Do not debug with `fp16=True`\n\nDisable mixed precision first:\n\n\n    fp16=False\n    bf16=False\n\n\nAfter the memory curve is understood, test mixed precision separately.\n\n### Do not assume `t5-small` means the workload is small\n\nParameter count is only one part of memory behavior. Seq2seq summarization with long inputs is memory-heavy even with a small model.\n\n* * *\n\n## Most practical final local-MPS recipe\n\nFor actually finishing a local run on Mac M2, I would use something like this:\n\n\n    max_input_length = 512\n    max_target_length = 64\n\n    args = Seq2SeqTrainingArguments(\n        output_dir=\"t5-small-xsum-mps\",\n\n        learning_rate=2e-5,\n        weight_decay=0.01,\n        num_train_epochs=1,\n\n        per_device_train_batch_size=1,\n        gradient_accumulation_steps=16,\n\n        gradient_checkpointing=True,\n\n        eval_strategy=\"no\",\n\n        save_strategy=\"epoch\",\n        save_only_model=True,\n        push_to_hub=False,\n\n        fp16=False,\n        bf16=False,\n\n        dataloader_num_workers=0,\n        dataloader_pin_memory=False,\n\n        logging_steps=50,\n    )\n\n\nThen restart the Python process and evaluate separately:\n\n\n    eval_args = Seq2SeqTrainingArguments(\n        output_dir=\"t5-small-xsum-mps-eval\",\n\n        per_device_eval_batch_size=1,\n\n        predict_with_generate=True,\n        generation_max_length=64,\n        generation_num_beams=1,\n        eval_accumulation_steps=1,\n\n        save_strategy=\"no\",\n        push_to_hub=False,\n\n        fp16=False,\n        bf16=False,\n    )\n\n\n* * *\n\n## Best links to read\n\n### Official docs\n\n  * PyTorch MPS package docs\n  * PyTorch torch.mps.current_allocated_memory()\n  * PyTorch torch.mps.driver_allocated_memory()\n  * PyTorch torch.mps.empty_cache()\n  * PyTorch MPS environment variables\n  * Apple: Accelerated PyTorch training on Mac\n  * Hugging Face Trainer docs\n  * Hugging Face summarization task guide\n  * Hugging Face example scripts guide\n\n\n\n### Closest issues / reports\n\n  * PyTorch: MPS memory leak in training with transformers Trainer\n  * PyTorch: MPS memory leak with variable batch size / sequence length\n  * Transformers: Trainer class causes massive memory leak when using MPS\n  * PyTorch forum: MPS backend out of memory on Mac M2\n  * PyTorch: MPS LSTM leak despite empty_cache()\n  * PyTorch: driver_allocated_memory() grows unrestricted\n  * PyTorch: MPS memory leak minimal examples\n\n\n\n### Useful code references\n\n  * Hugging Face summarization notebook\n  * Transformers run_summarization.py\n  * Transformers run_summarization_no_trainer.py\n\n\n\n* * *\n\n## Bottom line\n\nYour issue is most likely:\n\n> MPS backend / Metal-driver memory growth during a long, variable-shape Hugging Face seq2seq training run.\n\nThe original notebook makes that likely because it combines:\n\n  * XSum summarization;\n  * `t5-small`;\n  * dynamic padding;\n  * `max_input_length=1024`;\n  * `max_target_length=128`;\n  * batch size 16;\n  * `fp16=True`;\n  * generation-based evaluation;\n  * checkpointing / Hub-push behavior;\n  * a long run on Apple Silicon MPS.\n\n\n\nThe best fix is not `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`.\n\nThe best practical workaround is:\n\n  * log both `current_allocated_memory()` and `driver_allocated_memory()`;\n  * use batch size 1 or 2;\n  * use gradient accumulation;\n  * reduce lengths to 512/64 first;\n  * disable eval/generation during the first training test;\n  * disable `fp16` while debugging;\n  * use gradient checkpointing;\n  * disable save/push during diagnosis;\n  * test fixed padding against dynamic padding;\n  * use a fresh Python process for long runs;\n  * evaluate separately after restarting;\n  * use CPU or CUDA/cloud if the full long MPS run still leaks.\n\n\n\nShort version:\n\n  * `MPS allocated: 4.20 GiB` means live tensor memory is not enormous.\n  * `other allocations: 43.49 GiB` points to backend/driver allocations.\n  * `empty_cache()` is not expected to fix this.\n  * high-watermark `0.0` only removes a safety guardrail.\n  * dynamic-shape seq2seq training on MPS is the main suspect.\n\n",
  "title": "Running out of memory in the summary example"
}