Cameron

Getting the Pile

Cameron June 29, 2023

I've been interested in various NLP stuff lately, as one might imagine with all the ChatGPT stuff going on. Something I've become interested in is methods for anlayzing large amounts of text. I've been looking at the Pile dataset, which is a commonly-used dataset in NLP. I believe ChatGPT has been trained on it, as have many other large foundation models. I'm trying to download it to tinker with discrete normalizing flows for token prediction. It's a big dataset -- about 825GB uncompressed. Being a hardo, I wrote my only little cloning script to pull in all the new data. It's not very efficient, but it works. I'll probably write a better one later. If you want to use this code, make sure to change the variable to wherever you want to store the data.

Discussion in the ATmosphere