Redowan's Reflections

Disallow large file download from URLs in Python

Redowan Delowar March 23, 2022

I was working on a DRF POST API endpoint where the consumer is expected to add a URL containing a PDF file and the system would then download the file and save it to an S3 bucket. While this sounds quite straightforward, there's one big issue. Before I started working on it, the core logic looked like this:

In the above snippet, there's no guardrail against how large the target file can be. You could bring the entire server down to its knees by posting a link to a ginormous file. The server would be busy downloading the file and keep consuming resources.

I didn't want to use urllib at all for this purpose and went for HTTPx. It exposes a neat API to perform streaming file download. Also, I didn't want to peek into the Content-Length header to assess the file size since the file server can choose not to include that header key. I was looking for something more dependable than that. Here's how I solved it:

The chunk_size parameter explicitly dictates the buffer size of the file being downloaded. This means the entire file won't be loaded into memory while being downloaded. The max_size parameter defines the maximum file size that'll be allowed. In this example, we're keeping track of the size of the already downloaded bytes in the downloaded_content_length variable and raising an error if the size exceeds 10MB. Sweet!

Discussion in the ATmosphere