How to run a Spark-java program from command line

10,727

Pick up the wordcount example from say: https://github.com/holdenk/fastdataprocessingwithsparkexamples/tree/master/src/main/scala/pandaspark/examples. Follow these steps to create the fat jar file:

mkdir example-java-build/; cd example-java-build

mvn archetype:generate \
   -DarchetypeGroupId=org.apache.maven.archetypes \
   -DgroupId=spark.examples \
   -DartifactId=JavaWordCount \
   -Dfilter=org.apache.maven.archetypes:maven-archetype-quickstart

cp ../examples/src/main/java/spark/examples/JavaWordCount.java
JavaWordCount/src/main/java/spark/examples/JavaWordCount.java

You add the relevant spark-core and spark examples dependencies. Make sure you have the dependencies based on your version of spark. I use spark 1.1.0 and so I have the relevant dependencies. My pom.xml looks like this:

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>

<dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-examples_2.10</artifactId>
        <version>1.1.0</version>
</dependency>
<dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
</dependency>
  </dependencies>

Build your jar file using mvn.

cd example-java-build/JavaWordCount
mvn package

This creates your fat jar file inside the target directory. Copy the jar file to any location on the server. Go to the your bin folder of your spark. ( in my case: /root/spark-1.1.0-bin-hadoop2.4/bin)

Submit spark job: My job looks like this:

./spark-submit --class "spark.examples.JavaWordCount" --master yarn://myserver1:8032 /root/JavaWordCount-1.0-SNAPSHOT.jar  hdfs://myserver1:8020/user/root/hackrfoe.txt

Here --class is: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) The last argument is any text file of your choice for the program.

The output should like this, giving word counts of all words in the text file.

in: 17
sleeping.: 1
sojourns: 1
What: 4
protect: 1
largest: 1
other: 1
public: 1
worst: 1
hackers: 12
detected: 1
from: 4
and,: 1
secretly: 1
breaking: 1
football: 1
answer.: 1
attempting: 2
"hacker: 3

Hope this helps!

Share:
10,727

Related videos on Youtube

Pooja3101
Author by

Pooja3101

Updated on June 17, 2022

Comments

  • Pooja3101
    Pooja3101 almost 2 years

    I am running the wordcount java program in spark. How do I run it from the command line.

  • WestCoastProjects
    WestCoastProjects over 9 years
    +1 Well documented answer. I haven't tried it yet but even if it has any small bugs it will be helpful. I will report back if any details missing.