2 years ago

#68221

test-img

user4601931

PySpark virtual environment archive on S3

I'm trying to deploy PySpark applications to an EMR cluster that have various, differing, third-party dependencies, and I am following this blog post, which describes a few approaches to packaging a virtual environment and distributing that across the cluster.

So, I've made a virtual environment with virtualenv, used venv-pack to create a tarball of the virtual environment, and I'm trying to pass that as an --archives argument to spark-submit:

spark-submit \
    --deploy-mode cluster \
    --master yarn \
    --conf spark.pyspark.python=./venv/bin/python \
    --archives s3://path/to/my/venv.tar.gz#venv \
    s3://path/to/my/main.py

This fails with Cannot run program "./venv/bin/python": error=2, No such file or directory. Without the spark.pyspark.python option, the job fails with import errors, so my question is mainly about the syntax of this command when the archive is a remote object.

I can run a job that has no extra dependencies and a main method that's located remote to the cluster in S3, so I know at least something on S3 can be referenced (much like a Spark application JAR, which I'm much more familiar with). The problem is the virtual environment. I've found much literature about this, but it's all where the virtual environment archive is physically on the cluster. For many reasons, I would like to avoid having to copy virtual environments to the cluster every time a developer makes a new application.

Can I reference a remote archive? If so, what is the syntax for this and what other configuration options might I need?

I don't think it should matter, but just in case, I'm using a Livy client to submit this job remotely (the thing analogous to the above spark-submit).

apache-spark

pyspark

amazon-emr

livy

0 Answers

Your Answer

Accepted video resources