{
"$type": "site.standard.document",
"canonicalUrl": "https://rednafi.com/python/read-s3-file-in-memory/",
"description": "Download and process S3 CSV files in memory using boto3 and tempfile.NamedTemporaryFile without cluttering disk with temporary files.",
"path": "/python/read-s3-file-in-memory/",
"publishedAt": "2022-06-26T00:00:00.000Z",
"site": "at://did:plc:fgtm2c26vfcj74rfmeggbyqj/site.standard.publication/3mnl6f7ob462z",
"tags": [
"Python",
"AWS"
],
"textContent": "I frequently have to write ad-hoc scripts that download a CSV file from [AWS S3], do some\nprocessing on it, and then create or update objects in the production database using the\nparsed information from the file. In Python, it's trivial to download any file from s3 via\n[boto3], and then the file can be read with the csv module from the standard library.\nHowever, these scripts are usually run from a separate script server and I prefer not to\nclutter the server's disk with random CSV files. Loading the s3 file directly into memory\nand reading its contents isn't difficult but the process has some subtleties. I do this\noften enough to justify documenting the workflow here.\n\nAlong with boto3, we can leverage Python's\n[tempfile.NamedTemporaryFile][NamedTemporaryFile] to directly download the contents of the\nfile to a temporary in-memory file. Afterward, we can do the processing, create the objects\nin the DB, and delete the file once we're done. The NamedTemporaryFile class can be used\nas a context manager and it'll delete the file automatically when the with block ends.\n\nThis is quite straightforward with a simple gotcha. Here's how you'd usually download a file\nfrom s3 and save that to a file-like object:\n\nOkay but the doc reminds us about this:\n\n> The download_fileobj method accepts a writeable file-like object. The file object must\n> be opened in binary mode, not text mode.\n\nOpening the file in binary mode is an issue. The CSV reader needs the file to be opened in\ntext mode. This is not an issue when you download the file to disk since you can open the\nfile again in text mode to feed it to the CSV reader. However, we're trying to avoid saving\nthe file to disk and opening that again in text mode. So, you can't do this:\n\nThe above snippet won't work because:\n\n- The file-like object is opened in binary mode but the csv.DictReader expects the file\n pointer to be opened in text mode. So, it'll raise an error.\n\n- Even if you fixed that, the CSV reader wouldn't be able to read anything since the file\n currently only allows writing in binary mode, not reading.\n\n- Even if you fixed the second issue, the content of the CSV file would be empty. That's\n because after boto3 downloads and saves the file to the file object, it sets the file\n handle to the end of the file. So loading the content from there would result in an empty\n file. Here's how I fixed all three of these problems:\n\nYou can see that the snippet first opens a temporary file in w+b mode which allows both\nbinary read and write operations. Then it downloads the file from s3 and saves it to the\nfile-like object.\n\nOnce the download is finished, the file handle is placed at the bottom of the file. So,\nwe'll need to call f.seek(0) to place the handle at the beginning of the file; otherwise,\nour read operation will yield no content. Also, since the currently opened file object only\nallows binary read and write operations, we'll need to convert it to a text file object\nbefore passing it to the CSV reader. The io.TextIOWrapper class does exactly that. Once\nthe file object is in text mode, we pass it to the CSV reader and do further processing.\n\nFurther reading\n\n- [How to use Python csv.DictReader with a binary file?]\n\n\n\n\n[aws s3]:\n https://aws.amazon.com/s3/\n\n[boto3]:\n https://boto3.amazonaws.com/v1/documentation/api/latest/index.html\n\n[namedtemporaryfile]:\n https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile\n\n[how to use python csv.dictreader with a binary file?]:\n https://stackoverflow.com/questions/51152023/how-to-use-python-csv-dictreader-with-a-binary-file-for-a-babel-custom-extract",
"title": "Read a CSV file from s3 without saving it to the disk"
}