Raw Record Source

{
  "$type": "site.standard.document",
  "canonicalUrl": "https://rednafi.com/python/outage-caused-by-eager-loading-file/",
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreibz4kwi3yba6ehh7omf5irvknstbjgvvq6g72d4hmkpe55hg2mehu"
    },
    "mimeType": "image/png",
    "size": 121470
  },
  "description": "Learn from a production outage caused by loading large CSV files into memory. Stream process files to prevent OOM errors and crashes.",
  "path": "/python/outage-caused-by-eager-loading-file/",
  "publishedAt": "2022-10-14T00:00:00.000Z",
  "site": "at://did:plc:fgtm2c26vfcj74rfmeggbyqj/site.standard.publication/3mnl6f7ob462z",
  "tags": [
    "Python",
    "Incident Post-mortem",
    "Django",
    "Performance"
  ],
  "textContent": "Python makes it freakishly easy to load the whole content of any file into memory and\nprocess it afterward. This is one of the first things that's taught to people who are new to\nthe language. While the following snippet might be frowned upon by many, it's definitely not\nuncommon:\n\nAdopting this pattern as the default way of handling files isn't the most terrible thing in\nthe world for sure. Also, this is often the preferred way of dealing with image files or\nblobs. However, overzealously loading file content is only okay as long as the file size is\nsmaller than the volatile memory of the working system.\n\n_Moreover, you'll need to be extra careful if you're accepting files from users and running\nfurther procedures on the content of those files. Indiscriminantly loading up the full\ncontent into memory can be dangerous as it can cause OOM errors and crash the working\nprocess if the system runs out of memory while processing a large file. This simple overlook\nwas the root cause of a major production incident at my workplace today._\n\nThe affected part of our primary Django monolith asks the users to upload a CSV file to a\npanel, runs some procedures on the content of the file, and displays the transformed rows in\na paginated HTML table. Since the application is primarily used by authenticated users and\nwe knew the expected file size, there wasn't any guardrail that'd prevent someone from\nuploading a humongous file and crashing down the whole system. To make things worse, the\nassociated background function in the Django view was buffering the entire file into memory\nbefore starting to process the rows. Buffering the entire file surely makes the process a\nlittle faster but at the cost of higher memory usage.\n\nAlthough we were using background processes to avoid chugging files in the main server\nprocess, that didn't help when the users suddendly started to send large CSV files in\nparallel. The workers were hitting OOM errors and getting restarted by the process manager.\nIn our particular case, we didn't have much reason to buffer the whole file before\nprocessing. Apparently, the naive way scaled up pretty gracefully and we didn't pay much\nattention since no one was uploading file that our server instances couln't handle. We were\nstoring the incoming file in a models.FileField type attribute of a Django model. When a\nuser uploads a CSV file, we'd:\n\n- Open the file in binary mode via the open(filepath, \"rb\") callable.\n- Buffer the whole file in memory and transform the binary content into a unicode string.\n- Pass the stringified file-like object to csv.DictReader to load that as a CSV file.\n- Apply transformation on the rows line by line and render the HTML table.\n\nThis is how the code looks:\n\nThe csv.DictReader callable only accepts a file-like object that's been opened in text\nmode. However, Django's FileField type doesn't make any assumptions about the file\ncontent. It mandates us to open the file in binary mode and then decode it if necessary. So,\nwe open the file in binary mode with model_instance.file.open(mode=\"rb\") which returns an\nio.BufferedReader type file object. This file-like object can't be passed directly to the\ncsv.DictReader because a byte stream doesn't have the concept of EOL and the CSV reader\nneed that to know where a row ends. As a consequence, the csv.DictReader expects a\nfile-like object opened in text mode where the rows are explicitly delineated by\nplatform-specific EOLs like \\n or \\n\\r.\n\nTo solve this, we load the content of the file in memory with f.read() and decode it by\ncalling .decode() on the result of the preceding operation. Then we create an in-memory\ntext file-like buffer by passing the decoded string to io.StringIO. Now the CSV reader can\nconsume this transformed file-like object and build dictionaries of rows off of that.\nUnfortunately, this stringified file buffer stays alive in the memory throughout the entire\nlifetime of the processor function. Imagine 100s of large CSV files getting thrown at the\nworkers that execute the above code snippet. You see, at this point, overwhelming the\nbackground workers doesn't seem too difficult.\n\nWhen our workers started to degrade in production and the alerts went bonkers, we began\ninvestigating the problem. After pinpointing the issue, we immediately responded to it by\nvertically scaling up the machines. The surface area of this issue was quite large and we\ndidn't want to hotfix it in fear of triggering inadvertent regressions. Once we were out of\nthe woods, we started patching the culprit.\n\nThe solution to this is quite simple - convert the binary file-like object into a text\nfile-like object without buffering everything in memory and then pass the file to the CSV\nreader. We were already processing the CSV rows in a lazy manner and just removing\nf.read() fixed the overzealous buffering issue. The corrected code snippet looks like\nthis:\n\nHere, io.TextIOWrapper wraps the binary file-like object in a way that makes it behave as\nif it were opened in text mode. In fact when you open a file in text mode, the native\nimplementation of open returns a file-like object wrapped in io.TextIOWrapper. You can\nfind more details about the [implementation of open] in [PEP-3116].\n\nThe csv.DictReader callable can consume this transformed file-like object without any\nfurther modifications. Since we aren't calling f.read() anymore, no overzealous content\nbuffering is going on here and we can lazily ask for new rows from the reader object as we\nsequentially process them.\n\nFurther reading\n\n- [How to use python csv.DictReader with a binary file?]\n\n\n\n\n[implementation of open]:\n    https://peps.python.org/pep-3116/#the-open-built-in-function\n\n[pep-3116]:\n    https://peps.python.org/pep-3116/\n\n[how to use python csv.dictreader with a binary file?]:\n    https://stackoverflow.com/questions/51152023/how-to-use-python-csv-dictreader-with-a-binary-file-for-a-babel-custom-extract",
  "title": "Dissecting an outage caused by eager-loading file content"
}