How to run concurrent jobs(actions) in Apache Spark using single spark context

20,902

Try something like this:

    final JavaSparkContext sc = new JavaSparkContext("local[2]","Simple_App");
    ExecutorService executorService = Executors.newFixedThreadPool(2);
    // Start thread 1
    Future<Long> future1 = executorService.submit(new Callable<Long>() {
        @Override
        public Long call() throws Exception {
            JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
            return file1.count();
        }
    });
    // Start thread 2
    Future<Long> future2 = executorService.submit(new Callable<Long>() {
        @Override
        public Long call() throws Exception {
            JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");
            return file2.count();
        }
    });
    // Wait thread 1
    System.out.println("File1:"+future1.get());
    // Wait thread 2
    System.out.println("File2:"+future2.get());
Share:
20,902

Related videos on Youtube

Sporty
Author by

Sporty

Updated on May 10, 2020

Comments

  • Sporty
    Sporty almost 4 years


    It says in Apache Spark documentation "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads". Can someone explain how to achieve this concurrency for the following sample code?

        SparkConf conf = new SparkConf().setAppName("Simple_App");
        JavaSparkContext sc = new JavaSparkContext(conf);
    
        JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
        JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");
    
        System.out.println(file1.count());
        System.out.println(file2.count());
    

    These two jobs are independent and must run concurrently.
    Thank You.

    • pzecevic
      pzecevic about 9 years
      You'd need to start a new thread, or two (you can find instructions for that online, I am sure), and then use the same SparkContext from both threads.
    • Sporty
      Sporty about 9 years
      @pzecevic Thanks for the reply. I have written a sample code and am able to perform threading in local mode. However while running in YARN cluster, spark context finishes with status SUCCEEDED before the thread execution is complete and hence I am not getting any output. Can you suggest something on that? I can share the code but there is a limit in comment section.
    • void
      void about 8 years
      @Sporty I am not sure about this, but won't setting this conf help? spark.streaming.concurrentJobs
    • Lars Francke
      Lars Francke over 7 years
      @void: As the name says that setting is only for Spark Streaming jobs. There however you are correct. That setting controls the number of parallel jobs to run and defaults to 1 (at least in Spark 2.0.0 and before)
    • Shirish Kumar
      Shirish Kumar over 7 years
      You can use the fair scheduling within an application. See here - spark.apache.org/docs/1.6.1/job-scheduling.html
  • void
    void about 8 years
    Can't we simply use spark.streaming.concurrentJobs conf to set the concurrency level?
  • lightbox142
    lightbox142 over 3 years
    Does anyone have a python implementation of this?