Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigiovzixqvjqudeoseag2diyyzp4b3b32gf2zwd2esynu24llbnia",
    "uri": "at://did:plc:i3hyx5sw7cz7ofijrwp4tqua/app.bsky.feed.post/3mksf6eepwdh2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreibqmlnlvf77ilwsvulprhn4pwg2m2b3s3fiukekbiwupo3vel4p5i"
    },
    "mimeType": "image/png",
    "size": 541803
  },
  "description": "copy fail in kubernetes: when your pod escapes to the host with four bytes\n\nif you thought containers were a security boundary, cve-2026-31431 (\"copy fail\") has some unfortunate news for you. discovered by xint, this linux kernel vulnerability lets an unprivileged local user overwrite four controlled bytes in the page cache of any readable file—and yes, that includes binaries inside your containers. worse: because the page cache is a host-wide resource, corruption in one container can silently p",
  "path": "/en/how-the-linux-kernel-copyfail-vulnerability-impacts-kubernetes-what-you-need-to-know-and-what-you-can-do/",
  "publishedAt": "2026-05-01T14:56:12.000Z",
  "site": "https://www.sredevops.org",
  "tags": [
    "GitHub - Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoCContribute to Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC development by creating an account on GitHub.GitHubPercivalll",
    "a664bf3d603d",
    "wiz.io: copy fail vulnerability advisory",
    "xint technical writeup: copy fail",
    "kubernetes poc: cve-2026-31431 container escape",
    "upstream kernel fix: commit a664bf3d603d",
    "kubernetes pod security admission docs",
    "seccomp profiles for kubernetes"
  ],
  "textContent": "## copy fail in kubernetes: when your pod escapes to the host with four bytes\n\nif you thought containers were a security boundary, cve-2026-31431 (\"copy fail\") has some unfortunate news for you. discovered by xint, this linux kernel vulnerability lets an unprivileged local user overwrite four controlled bytes in the page cache of _any readable file_ —and yes, that includes binaries inside your containers. worse: because the page cache is a host-wide resource, corruption in one container can silently propagate to another. the result? a fully unprivileged pod can achieve node-level code execution.\n\nthis isn't a theoretical \"what if.\" a public 732-byte python proof-of-concept demonstrates container escape on every major kubernetes distribution by exploiting shared image layers between an attacker-controlled pod and a privileged daemonset like `kube-proxy`. if your cluster runs linux kernels built between 2017 and april 2026, you should probably stop reading and start patching.\n\n## the container escape primitive: shared page cache, shared fate\n\nthe core vulnerability lives in the kernel's `algif_aead` subsystem, where improper handling of scatter-gather lists during in-place aead decryption allows a controlled 4-byte write into the page cache. the exploit chain is elegantly brutal:\n\n\n    # simplified exploit flow (full PoC: https://github.com/Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC)\n    import os, socket\n\n    # 1. open AF_ALG socket to vulnerable crypto template\n    s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET)\n    s.bind((\"aead\", \"authencesn(hmac(sha256),cbc(aes))\"))\n    # ... set key, accept request socket ...\n\n    # 2. splice target file (e.g., /usr/sbin/ipset) into crypto operation\n    os.splice(target_fd, pipe_wr, offset=chosen_offset)\n    os.splice(pipe_rd, alg_fd, length=auth_tag_size)\n\n    # 3. trigger decrypt → kernel writes 4 controlled bytes into page cache\n    req_socket.recv(1)  # hmac fails, but corruption persists\n\n\nthe magic—and the danger—lies in how linux manages file i/o. when a container reads a file from a shared image layer, the kernel serves it from the _same physical page cache pages_ across all containers on that node. this is a performance optimization, not a bug. but when combined with copy fail, it becomes an escape hatch.\n\n### why overlay filesystems make this worse\n\ncontainer runtimes like `containerd` and `cri-o` use overlayfs to implement copy-on-write semantics. when multiple pods reference the same image layer:\n\n\n    host page cache\n    ├── lowerdir: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/<layer>/usr/sbin/ipset\n    ├── upperdir: (container-specific, empty for read-only files)\n    └── merged view: served from shared page cache pages\n\n\nif an unprivileged pod corrupts `/usr/sbin/ipset` in the page cache, _every_ pod on that node that reads the same file from the same layer sees the corrupted in-memory version—without any cross-container communication, without touching disk, and without triggering traditional file integrity monitors.\n\n## the kube-proxy attack vector: a privileged daemonset waiting to happen\n\nthe public kubernetes poc targets `/usr/sbin/ipset`, a binary used by `kube-proxy` to manage iptables/ipset rules. here's why this is a perfect storm:\n\ncharacteristic | why it matters\n---|---\n`kube-proxy` runs as a privileged daemonset | executes with `hostnetwork: true`, full capabilities, and root uid\n`ipset` is invoked periodically | corrupted binary gets executed automatically, no user interaction needed\nimage layer is shared across nodes | same base image (`registry.k8s.io/kube-proxy:v1.35.2`) means same page cache mapping\nbinary is readable by unprivileged users | satisfies the \"any readable file\" prerequisite for copy fail\n\nthe attack sequence:\n\n  1. attacker deploys an unprivileged pod with the poc script (no special capabilities required)\n  2. poc corrupts the page cache for `/usr/sbin/ipset` in the shared image layer\n  3. `kube-proxy` on the same node executes the corrupted binary during its next reconciliation loop\n  4. attacker-controlled shellcode runs with kube-proxy's privileges: root on the node, access to host namespaces, and full cluster control via the node's service account\n\n\n\nthis isn't a \"maybe.\" the poc has been tested and confirmed working on ubuntu, amazon linux, rhel, and suse kernels spanning versions 6.12 through 6.18.\n\nGitHub - Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoCContribute to Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC development by creating an account on GitHub.GitHubPercivalll\n\n## kubernetes-specific mitigations: patch first, architect second\n\n### immediate actions (today)\n\n**disable the vulnerable kernel module** (temporary)\n\n\n    # node-level mitigation via DaemonSet\n    apiVersion: apps/v1\n    kind: DaemonSet\n    metadata: { name: disable-algif-aead }\n    spec:\n      template:\n        spec:\n          hostPID: true\n          containers:\n          - name: mitigator\n            image: alpine:latest\n            command: [\"/bin/sh\", \"-c\"]\n            args:\n            - |\n              echo \"install algif_aead /bin/false\" > /host/etc/modprobe.d/disable-algif.conf\n              chroot /host rmmod algif_aead 2>/dev/null || true\n            volumeMounts:\n            - name: host-root\n              mountPath: /host\n          volumes:\n          - name: host-root\n            hostPath: { path: /, type: Directory }\n\n\n**block`af_alg` at the runtime level**\nuse seccomp profiles to prevent `af_alg` socket creation in untrusted pods:\n\n\n    # pod securityContext with seccomp\n    securityContext:\n      seccompProfile:\n        type: Localhost\n        localhostProfile: profiles/block-af-alg.json\n\n\n\n    // profiles/block-af-alg.json\n    {\n      \"defaultAction\": \"SCMP_ACT_ALLOW\",\n      \"syscalls\": [{\n        \"names\": [\"socket\"],\n        \"action\": \"SCMP_ACT_ERRNO\",\n        \"args\": [{\"index\": 0, \"value\": 38, \"op\": \"SCMP_CMP_EQ\"}] // AF_ALG = 38\n      }]\n    }\n\n\n**patch your nodes**\napply a kernel containing the upstream fix (commit a664bf3d603d). for managed kubernetes services:\n\n\n    # EKS: trigger node group update\n    aws eks update-nodegroup-config --cluster-name my-cluster --nodegroup-name my-ng \\\n      --launch-template version=$NEW_VERSION\n\n    # GKE: enable auto-upgrade or manually upgrade nodes\n    gcloud container clusters upgrade my-cluster --node-pool default-pool \\\n      --cluster-version=1.35.2-gke.100\n\n\n### architectural hardening (this quarter)\n\n  * **isolate image layers for privileged workloads**\nuse distinct base images for daemonsets like `kube-proxy` that aren't shared with user workloads. this breaks the page-cache propagation path.\n  * **adopt pod security admission (psa) or gatekeeper policies**\nenforce that pods cannot request `hostpath` volumes, `privileged` mode, or `af_alg`-capable seccomp exemptions.\n\n\n\n**restrict pod placement with node affinity**\nprevent untrusted workloads from scheduling on nodes running privileged daemonsets with shared base images:\n\n\n    affinity:\n      nodeAffinity:\n        requiredDuringSchedulingIgnoredDuringExecution:\n          nodeSelectorTerms:\n          - matchExpressions:\n            - key: node-role.kubernetes.io/control-plane\n              operator: DoesNotExist\n            - key: workload-trust-level\n              operator: In\n              values: [\"untrusted\"]\n\n\n**enforce read-only root filesystems**\nwhile copy fail bypasses on-disk checks, a read-only rootfs limits post-exploitation persistence options:\n\n\n    securityContext:\n      readOnlyRootFilesystem: true\n      allowPrivilegeEscalation: false\n\n\n## detection strategies for kubernetes environments\n\ncopy fail is stealthy by design: the corrupted page is never marked dirty, so on-disk checksums remain valid. detection requires behavioral signals:\n\n  1. **monitor for anomalous`kube-proxy` behavior**\ncorrupted `ipset` execution may cause:\n     * unexpected iptables rule modifications\n     * `kube-proxy` crash loops with unusual stack traces\n     * auth.log entries with missing invoking usernames (see original advisory)\n  2. **watch for poc network artifacts**\nnon-stealthy attackers may fetch exploit code from `https://copy.fail/exp`. alert on egress to this domain from cluster pods.\n\n\n\n**correlate pod scheduling with kernel version**\nflag any unprivileged pod scheduled on a node running an unpatched kernel:\n\n\n    # quick cluster audit\n    kubectl get nodes -o json | jq -r '.items[] |\n      select(.status.nodeInfo.kernelVersion | test(\"6\\\\.(1[0-7]|[0-9])\")) |\n      .metadata.name'\n\n\n**audit`af_alg` socket creation**\nuse auditd or ebpf-based tracing to alert on unexpected `socket(AF_ALG, ...)` calls from containerized processes:\n\n\n    # ebpf trace example (bpftrace)\n    tracepoint:syscalls:sys_enter_socket /args->family == 38/ {\n      printf(\"AF_ALG socket from pid %d (%s)\\n\", pid, comm);\n    }\n\n\n## the uncomfortable truth about container \"isolation\"\n\ncopy fail exposes a fundamental tension in container security: performance optimizations (shared page cache, overlayfs) directly conflict with isolation guarantees. the linux kernel was never designed with multi-tenant container workloads as a primary threat model—and it shows.\n\nthis isn't a call to abandon containers. it's a reminder that \"isolation\" is a spectrum, not a binary. defense-in-depth means:\n\n  * assuming local privesc vulnerabilities will exist\n  * minimizing the blast radius when they do\n  * treating kernel patch latency as a first-order risk metric\n\n\n\nbecause when four bytes can buy you the entire node, your pod security policy just became a suggestion.\n\n* * *\n\n## references\n\n  * wiz.io: copy fail vulnerability advisory\n  * xint technical writeup: copy fail\n  * kubernetes poc: cve-2026-31431 container escape\n  * upstream kernel fix: commit a664bf3d603d\n  * kubernetes pod security admission docs\n  * seccomp profiles for kubernetes\n\n\n\n_source: adapted from wiz.io blog post by amitai cohen, merav bar, and shahar dorfman (may 1, 2026) and xint code research (april 29, 2026)_",
  "title": "How the Linux kernel copyfail vulnerability impacts kubernetes: What you need to know and what you can do",
  "updatedAt": "2026-05-01T14:56:13.944Z"
}