How to build and run Scala Spark locally

13,561

Building Spark locally, the short answer:

git clone [email protected]:apache/spark.git
cd spark
sbt/sbt compile

Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'. To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.

Then, when creating the Spark Context, use sparkConfig.local[1] as master like:

val conf = new SparkConf()
      .setMaster("local[1]")
      .setAppName("SparkDebugExample")

so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.

If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.

Share:
13,561
blue-sky
Author by

blue-sky

scala :: java

Updated on August 26, 2022

Comments

  • blue-sky
    blue-sky over 1 year

    I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.

    So I have cloned the latest from Spark repo :

    git clone https://github.com/apache/spark.git
    

    Spark appears to be a Maven project so when I create it in Eclipse here is the structure :

    enter image description here

    Some of the top level folders also have pom files :

    enter image description here

    So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?

    • maasg
      maasg almost 10 years
      To see Spark internals, you only need core. This should get you there: syndeticlogic.net/?p=311 BTW, SBT is better to get Spark up and running. I also recommend you to use Intellij instead of Eclipse.
  • RagHaven
    RagHaven about 9 years
    Can you elaborate on what you mean by "All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'." Currently I have a simple spark application which is similar to the one on the Apache Spark website. I'd like to run this from within Eclipse, and step through the code, so that I can step into the actual core implementation of spark to get an idea of how certain things work within Spark.