Read a CSV file from s3 without saving it to the disk
I frequently have to write ad-hoc scripts that download a CSV file from AWS S3, do some processing on it, and then create or update objects in the production database using the parsed information from the file. In Python, it's trivial to download any file from s3 via boto3, and then the file can be read with the csv module from the standard library. However, these scripts are usually run from a separate script server and I prefer not to clutter the server's disk with random CSV files. Loading the s3 file directly into memory and reading its contents isn't difficult but the process has some subtleties. I do this often enough to justify documenting the workflow here.
Along with boto3, we can leverage Python's tempfile.NamedTemporaryFile to directly download the contents of the file to a temporary in-memory file. Afterward, we can do the processing, create the objects in the DB, and delete the file once we're done. The NamedTemporaryFile class can be used as a context manager and it'll delete the file automatically when the with block ends.
This is quite straightforward with a simple gotcha. Here's how you'd usually download a file from s3 and save that to a file-like object:
Okay but the doc reminds us about this:
The download_fileobj method accepts a writeable file-like object. The file object must be opened in binary mode, not text mode.
Opening the file in binary mode is an issue. The CSV reader needs the file to be opened in text mode. This is not an issue when you download the file to disk since you can open the file again in text mode to feed it to the CSV reader. However, we're trying to avoid saving the file to disk and opening that again in text mode. So, you can't do this:
The above snippet won't work because:
The file-like object is opened in binary mode but the csv.DictReader expects the file pointer to be opened in text mode. So, it'll raise an error.
Even if you fixed that, the CSV reader wouldn't be able to read anything since the file currently only allows writing in binary mode, not reading.
Even if you fixed the second issue, the content of the CSV file would be empty. That's because after boto3 downloads and saves the file to the file object, it sets the file handle to the end of the file. So loading the content from there would result in an empty file. Here's how I fixed all three of these problems:
You can see that the snippet first opens a temporary file in w+b mode which allows both binary read and write operations. Then it downloads the file from s3 and saves it to the file-like object.
Once the download is finished, the file handle is placed at the bottom of the file. So, we'll need to call f.seek(0) to place the handle at the beginning of the file; otherwise, our read operation will yield no content. Also, since the currently opened file object only allows binary read and write operations, we'll need to convert it to a text file object before passing it to the CSV reader. The io.TextIOWrapper class does exactly that. Once the file object is in text mode, we pass it to the CSV reader and do further processing.
Further reading
Discussion in the ATmosphere