Raw Record Source

{
  "path": "/posts/2015/2015-11-09-pyspark",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "tags": [
    "code",
    "spark",
    "python",
    "pyspark"
  ],
  "$type": "site.standard.document",
  "title": "PySpark dependencies",
  "updatedAt": "2015-11-09T07:19:00.000Z",
  "publishedAt": "2015-11-09T07:19:00.000Z",
  "textContent": "Recently, I have been working with the Python API for [Spark][pyspark] to use distrbuted computing techniques to perform analytics at scale. When you write Spark code in Scala or Java, you can bundle your dependencies in the jar file that you submit to Spark. However, when writing Spark code in Python, dependency management becomes more difficult because each of the Spark executor nodes performing computations needs to have all of the Python dependencies installed locally.\n\nTypically, Python deals with dependencies using [pip][pip_link] and [virtualenv][virtualenv_link]. However, even if you follow this convention, you will still need to install your Spark code dependencies on each Spark executor machine in the cluster.\n\nA way around this is to bundle the dependencies in a zip file and pass them to Spark when you submit your job using the --py-files flag. The command will look something like this:\n\nBuilding the deps.zip is easiest if you use [virtualenvwrapper][virtualenvwrapper_link]. If you don't have virtualenvwrapper set up already, I like this [guide][venv_guide] to get started. When you install dependencies within a virtualenv via pip, they are placed in the folder $VIRTUAL_ENV'/lib/python2.7/site-packages where $VIRTUAL_ENV is ~/Envs/<env_name>. $VIRTUAL_ENV also becomes available as a bash variable if you are using virtualenvwrapper have used workon <env_name>. To create the deps.zip file, cd into the site-packages folder and run:\n\nThis command will zip all files and folders in site-packages at the top level of deps.zip. This distinction is worth noting because the files and folders must appear at the top level when deps.zip is unzipped. For make sure this works, create file from inside the site-packages folder -- _do not_ zip a _folder_ containing all of the files and folders. If you run the zip command properly, you will see\n\nrather than\n\nNote: OSX obfuscates this distinction when you unzip deps.zip in Finder. For the former case, OSX will unzip all of the files and folders to a new folder with the same name as the zip file. For example, if your zip file is named my_deps.zip, OSX will create a folder named my_deps and unzip the contents of my_deps.zip to that folder. For the later case, also unzipping with Finder, OSX will unzip the contents as they were zipped, yielding a folder named folder_name. The results are similar, but only the former case will work when you zipping dependencies for Spark. The distinction becomes more obvious if you use zip and unzip, as the former case will extract all files and folders to the current working directory, while the latter case will extract to a folder containing those same files and folders in the current working directory.\n\nYou should be ready to run PySpark jobs in a \"jarified\" way.\n\nAfternote: I've run into issues getting boto3 to run on a remote Spark cluster using this method.\n\n[pyspark]: http://spark.apache.org/docs/latest/api/python/\n[pip_link]: https://pip.readthedocs.org/en/stable/\n[virtualenv_link]: https://virtualenv.readthedocs.org/en/latest/\n[virtualenvwrapper_link]: https://virtualenvwrapper.readthedocs.org/en/latest/\n[venv_guide]: http://mkelsey.com/2013/04/30/how-i-setup-virtualenv-and-virtualenvwrapper-on-my-mac/",
  "canonicalUrl": "https://www.danielcorin.com/posts/2015/2015-11-09-pyspark"
}