Categories
container Data

How to Run Apache Spark from a Container

Apach Spark is a large-scale data analytics engine that can utilize distributed computing resources. It supports common data science languages, e.g. Python and R. Its support for Python is provided through the PySpark package. Some advantages of using PySpark over the traditional vanilla Python (numpy and Pandas) are:

  • Speed. Spark can operate on multiple computers. So you don’t have to write your own parallel computing code. People are claiming 100x speed gain using Spark.
  • Scale. You can develop code on a laptop and deploy it on cluster computers to process data at scale.
  • Robust. It won’t crack if some nodes are taken off during the execution time.

Other features, like SparkSQL, Spark ML, and support for data streaming sources bring additional advantages.

After a quick tryout of the Spark container image from Bitnami, I moved on to another image released by Jupyter stack with good documentation. To run the container and expose the Jupyter notebook and share the current host directory with the container, use this command:

docker run -d -p 80:8888 -p 4040:4040 -p 4041:4041 -v ${PWD}:/home/jovyan jupyter/all-spark-notebook

If you need to install additional packages to the container image provided, you could install them by either going inside the container (“docker exec -it spark /bin/bash” or modifying the original docker-compose.yml file.

Leave a Reply

Your email address will not be published.