Raw Record Source

{
  "path": "/the-pile",
  "site": "at://did:plc:gfrmhdmjvxn2sjedzboeudef/site.standard.publication/3md7ylshxzk2y",
  "$type": "site.standard.document",
  "title": "Getting the Pile",
  "content": {
    "$type": "site.standard.document#markdown",
    "value": "I've been interested in various NLP stuff lately, as one might imagine with all the ChatGPT stuff going on. Something I've become interested in is methods for anlayzing large amounts of text. I've been looking at the [Pile](https://pile.eleuther.ai/) dataset, which is a commonly-used dataset in NLP. I believe ChatGPT has been trained on it, as have many other large [foundation models](https://en.wikipedia.org/wiki/Foundation_models). \n\nI'm trying to download it to tinker with [discrete normalizing flows](https://arxiv.org/abs/1905.10347) for token prediction. It's a big dataset -- about 825GB uncompressed. Being a hardo, I wrote my only little cloning script to pull in all the new data. It's not very efficient, but it works. I'll probably write a better one later. \n\nIf you want to use this code, make sure to change the `data_dir` variable to wherever you want to store the data.\n\n```julia\nusing HTTP\nusing ProgressMeter\nimport SHA\nimport Downloads\n\n# Pile root directory\npile_root = \"https://the-eye.eu/public/AI/pile/\"\ndata_dir = \"/data/the-pile/mirror/\"\n\n# Links path\nlinks_path = joinpath(data_dir, \"links.txt\")\n\nfunction update_progress(meter, total, now)\n    meter.n = total\n    if now == total\n        # println(\"Done!\")\n    else\n        update!(meter, now)\n    end\nend\n\n\"\"\"\nExtract the links from a url and return them as a vector. Remove any any links that include\n\"..\" in the path.\n\"\"\"\nfunction extract_links(url)\n    # Send simple get query to pile root directory\n    response = HTTP.get(url)\n    body = String(response.body)\n\n    # Extract hrefs from html\n    hrefs = eachmatch(r\"(?<=href=\\\")[^\\\"]+\", body)\n    links = map(x -> url * x.match, hrefs)\n    filter!(x -> basename(dirname(x)) != \"..\", links)\n\n    # # Write links to file\n    # open(links_path, \"w\") do f\n    #     for link in links\n    #         println(f, link)\n    #     end\n    # end\n\n    # Find all the links that are directories\n    dirs = filter(x -> endswith(x, \"/\"), links)\n\n    # Call extract_links on each directory and concatenate the results\n    for dir in dirs\n        links = vcat(links, extract_links(dir))\n    end\n\n    # Remove duplicates\n    return unique(links)\nend\n\nif !isfile(links_path)\n    # Send simple get query to pile root directory\n    links = extract_links(pile_root)\n\n    # Write links to file\n    open(links_path, \"w\") do f\n        for link in links\n            println(f, link)\n        end\n    end\n\n    # Fink the link that contains SHA\n    sha_link = links[findfirst(x -> occursin(\"SHA\", x), links)]\n\n    # Download the SHA file if it doesn't exist\n    ddir = joinpath(data_dir, basename(sha_link))\n    !isdir(dirname(ddir)) && mkdir(dirname(ddir))\n    if !isfile(ddir)\n        download(sha_link, ddir)\n    end\nend\n\n# Read the SHA file\nsha = open(ddir) do f\n    read(f, String)\nend\n\n# Split the SHA file into lines\nlines = split(sha, \"\\n\")\n\n# Split each line into SHA and file name\nlines = map(x -> split(x, \" \"), lines)\n\n# Filter out empty lines\nlines = filter(x -> length(x) > 0, [filter(x -> length(x) > 0, line) for line in lines])\n\n# Separate into filename and sha\nfilenames = [joinpath(line[2]) for line in lines]\nshas = [line[1] for line in lines]\n\n# Create a dictionary of filenames and shas\nsha_dict = Dict(zip(filenames, shas))\n\n# Open links file\nlinks = open(links_path) do f\n    readlines(f)\nend\n\n# Filter out links ending in /\nfilter!(x -> !endswith(x, \"/\"), links)\n\n# For each link, check if it's been downloaded\nfor link in links\n    # Get the filename\n    file_relative = replace(link, pile_root => \"\")\n\n    # Check if the file exists\n    file = joinpath(data_dir, file_relative)\n\n    # Determine whether to re-download the file\n    download_file = if isfile(file)\n        # Check if the file is the correct size\n        file_size = filesize(file)\n\n        if file_size == 0 \n            true\n        else\n            sha_local = open(file) do f\n                SHA.sha2_256(f)\n            end\n\n            if haskey(sha_dict, \"./\" * file_relative)\n                sha_local != sha_dict[\"./\" * file_relative]\n            else\n                true\n            end\n        end\n    else\n        true\n    end\n\n    # Download the file if necessary\n    if download_file\n        # Create the directory if it doesn't exist\n        !isdir(dirname(file)) && mkdir(dirname(file))\n\n        # Make the meter\n        p = ProgressMeter.Progress(1; desc=file_relative, dt=1)\n        update_fun(total, now) = update_progress(p, total, now)\n\n        # Download the file\n        println(\"Downloading $file\")\n        Downloads.download(link, file, progress=update_fun)\n    end\nend\n```"
  },
  "publishedAt": "2023-06-29T07:00:00.000Z",
  "textContent": "I've been interested in various NLP stuff lately, as one might imagine with all the ChatGPT stuff going on. Something I've become interested in is methods for anlayzing large amounts of text. I've been looking at the Pile dataset, which is a commonly-used dataset in NLP. I believe ChatGPT has been trained on it, as have many other large foundation models. \n\nI'm trying to download it to tinker with discrete normalizing flows for token prediction. It's a big dataset -- about 825GB uncompressed. Being a hardo, I wrote my only little cloning script to pull in all the new data. It's not very efficient, but it works. I'll probably write a better one later. \n\nIf you want to use this code, make sure to change the  variable to wherever you want to store the data."
}