How to run services on Docker Swarm
Apache Spark Cluster on Docker Swarm
This is a follow up tutorial to – “How to create a docker swarm cluster“. I already introduced you to the Docker Swarm Basics and showed you how to generate a Swarm. Today it’s all about providing services in a swarm. I’ll make you more comfortable with more Swarm commands, explain what the Docker Hub is and show you how you can deploy services to Docker Swarm. The application we deploy to Swarm is Apache Spark. Spark is an open-source cluster computing framework from the Apache Foundation, which is mainly used for data processing.
Spark consists of different components to cover different use cases:
Spark Core – for Big Data Analysis
Spark SQL – Fast and scalable data structure console
Spark Streaming – for the processing of data streams
MLlib Machine Learning Library – Machine Learning for Spark Systems
Spark GraphX – Calculations on Graphs
In this tutorial we will mainly deal with the infrastructure part of Docker Swarm. If you want to learn how to do cool data science tasks with Apache Swarm, you should have a look at the Spark documentation.
In the near future there will also be an Apache Spark tutorial at gridscale.
Advantages of Spark on Docker Swarm
The provision of Apache Spark as Swarm Services and also of applications in general has some advantages.
Starting services is fast
The services are light & do not affect the performance of your machine
The services can be scaled flexibly
Via docker pull you can load images for your application without writing complex scripts.
You can generate an overlay network to take advantage of the Docker Network features
Goal of the tutorial
The goal of this tutorial is to deploy Spark in standalone mode to Docker Swarm.
The cluster architecture we will use for this purpose consists of 3 cloud servers: A master and 2 workers. I have put together the setup in a short time and with little effort in my gridscale panel. See our features and build your own cloud environment at gridscale: Sign up for free? This way!
The replica factor consists of 1 for the master and 2 for the workers. The worker nodes are used to execute the spark jobs.
In real setups Spark often runs on Mesos, YARN or Hadoop. We will use Spark in standalone mode for the sake of simplicity, this does not require any additional precautions and is the easiest to implement. Spark distributes the resources in standalone cluster mode between cores and memory. In order for our application not to fully use the available memory within the cluster, it is necessary that we limit the resources of the workers in our configuration. In this tutorial we will use an Apache Spark image from the Docker Hub.
What is the Docker Hub?
At this point I will give you a few more fact’s about the Docker Hub.
If you work with Docker Containers and Services, you won’t be able to avoid the Docker Hub. Within the Docker Hub you will find a selection of public images that users or official publishers make available to other users. There is also a non-public area where you can manage your own Docker images.
So the Docker Hub is your first port of call if you want to deploy containers or services. Often you’ll find ready-made and tested images for known applications here. The degree of documentation for the individual images may vary, of course. If you can’t find a finished image that meets your requirements, you can create your own. To deploy, you need to upload it to the Docker Hub and then decide if you want to make it public. This requires registration in the Docker Hub.
For our tutorial we use an existing image that a user has provided for others. You can find the image in the Docker Hub:
Take a look at the link to see how such a Dockerfile is structured.
On all servers that should be part of your Swarm cluster, you need Docker. After you have installed Docker on your cloud servers, you need to initialize your swarm and add the worker nodes.
In these tutorials you will learn everything you need to know to install Docker:
And how to initialize a swarm with Docker is shown here:
Once you’re through with the preparations and your setup is complete with Swarm Cluster, we’re ready to go.
The Apache Spark Docker image that we’re going to use I’ve already shown you above. So we can start by pulling the image for our cluster. The Spark version we get with the image is Spark v2.2.1.
The starting point for the next step is a setup that should look something like this:
Pull Docker Image
Switch to your master host and execute the following command:
docker pull birgerk/apache-spark
You can always find the command to pull a docker image on the respective page under “Docker Pull Command”.
Create Overlay Network
The next step is to create an overlay network for the cluster so that the hosts can communicate directly with each other at Layer 2 level.
docker network create --driver overlay spark
Then you add labels to your nodes, so you have the possibility to filter later. You give the master node the label sparkmaster and your worker nodes the label sparkworker.
The syntax for it:
docker node update --label-add type=sparkmaster [HOSTNAME von deinem Master] docker node update --label-add type=sparkworker [HOSTNAME von deinem Worker1] docker node update --label-add type=sparkworker [HOSTNAME von deinem Worker2]
Setting labels is a good habit and makes your swarm clearer. With the two different tags you are able to define the following constraints on the swarm.
So you have the possibility between the constraints type ==sparkworker (is equal) and type !=sparkmaster (is not equal) when defining services on the swarm:
--constrain “node.labels.type==sparkworker” and --constraint “node.labels.type!=sparkmaster”
Deploy Spark on Docker Swarm
Now you can start creating the Spark Services on Docker Swarm.
The command that initiates a service is:
docker service create
Start with the Master Host first:
docker service create \ --hostname master \ --container-label sparkmaster \ --network spark \ --constraint node.labels.type==sparkmaster \ --publish 8080:8080 \ --publish 7077:7077 \ --publish 6066:6066 \ --publish 50070:50070 \ --replicas 1 \ --limit-memory 1g \ --name spark-master \ birgerk/apache-spark
When you do the whole thing for the first time, you will probably now see a few question marks light up or perhaps you can already anticipate many things. That’s why we go through each line again.
The first option sets the hostname of the container and the second option sets the container label. The third option appends your previously created overlay network. Option 4 uses the node labels, constraint makes sure that the service is only deployed on nodes that match the type sparkmaster label. Then a few more ports will be published so Spark can run properly (Spark Web UI, Spark Master Port, Spark REST Server). Then follows the replica factor of 1 for the master and a limitation of 1 GB for the heap size of Spark (amount of memory resources available to Spark). After the options the name of the image is defined.
Then deploy the Spark service for the worker nodes:
docker service create \ --constraint node.labels.type==sparkworker \ --network spark \ --publish 8081:8081 \ --replicas 2 \ --limit-memory 512m \ --env "SPARK_ROLE=slave" \ --env "SPARK_MASTER=10.0.0.60" \ --name spark-worker \ birgerk/apache-spark
As you can see, the command here differs slightly from the previous command. With port 8081 you exposed the port for the Spark Worker Web UI. The replica is 2 and the memory is set to 512MB.
Important at this point are the two options with the environment variables. With most usable Docker Images in the Docker Hub you will find additional information for the configuration. The image we are using in this tutorial is given as the default role of the image master. Furthermore it is explained that the environment variable Spark_Role must be set to Slave for Slaves – i.e. Workers.
The second variable is important so that the spark hosts can communicate with each other. At this point you have to specify the IP address that is specified in your Master Web UI under REST URL.
For more information on all create options, take a look at the help page:
docker service create --help
The finished Swarm
So far so good. If everything worked out, you can now get a list of services on your Master Node with this command:
docker service ls
And also in your Spark Master user interface you can now see the Worker-Nodes.
In this tutorial we deployed a real application on Docker Swarm. You gained practice in how to start a cluster, create an overlay network for the cluster and how to define your service correctly. There are still a few steps to a production ready Spark Cluster, e.g. the storage setup is missing, but the beginning is done!
If you want to know more about container management, Kubernetes, Docker and Co., then have a look at the following articles:
- Docker Swarm vs. Kubernetes: Comparison of both container management tools
- Best-practice Kubernetes Cluster with kubeadm
See you soon!
Apache Spark Cluster on Docker Swarm This is a follow up tutorial to – “How to create a docker swarm cluster“. I already introduced you to the Docker Swarm Basics and showed you how to generate a Swarm. Today it’s all about providing services in a swarm. I’ll make you more comfortable with more Swarm […]
Thank you for your feedback!
We will get back to you as soon as the article is finished.