###1. Spark init order
For simplicity, we set up two roles:
- Spark Admin: who maintain the whole Spark cluster
- Spark User: who submit applications, e.g. KMeans, PageRank
Firstly, the Admin will launch the Spark Master
and Worker
via scripts (SPARK_HOME/start-all.sh, start-master.sh, start-workers.sh etc.). The initialization logs are recorded in the directory SPARK_HOME/logs/
. Additionally, when the user submits applications to the Spark cluster, logs of master/workers objects will still be appended to log files in the directory. More info please refer here.
Besides, there are various parameters to initialize the Spark cluster, e.g. deploy on YARN, Mesos, Standalone. In general, the deloy mode is setup in system env.
###2. Run an application
- As far as I know, there are two approaches to submit an application:
- sbt run (application configurations including CPU cores, memory, Master URL etc. have been set up in application source code.)
- spark-submit script. We mainly focus on this method in following paragraphs.
When using spark-submit
to submit applications, a parameter deploy-mode
is used to set up deployMode in SparkSubmit.scala
.
In SparkSubmit.scala
:
val deployOnCluster = Option(args.deployMode).getOrElse("client") == "cluster"
Thus, in StandAlone Mode
when deployMode
is client
, the spark-submit
will run as the Driver
.
While if the deployMode
is ‘cluster’, for StandAlone Mode
, the spark-submit
will launch org.apache.spark.deploy.Client
. In Client
a request for deploying the Driver
will be sent to the Master Actor. Then, the Master will call schedule()
function to assign a Driver
on a random Worker node.
Actually, the driver is the application source code itself, which expresses the main logic of the application.