How to run concurrent jobs(actions) in Apache Spark using single spark context

java concurrency apache-spark

20,902

Try something like this:

    final JavaSparkContext sc = new JavaSparkContext("local[2]","Simple_App");
    ExecutorService executorService = Executors.newFixedThreadPool(2);
    // Start thread 1
    Future<Long> future1 = executorService.submit(new Callable<Long>() {
        @Override
        public Long call() throws Exception {
            JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
            return file1.count();
        }
    });
    // Start thread 2
    Future<Long> future2 = executorService.submit(new Callable<Long>() {
        @Override
        public Long call() throws Exception {
            JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");
            return file2.count();
        }
    });
    // Wait thread 1
    System.out.println("File1:"+future1.get());
    // Wait thread 2
    System.out.println("File2:"+future2.get());

20,902

Sporty

Updated on May 10, 2020

Comments

Sporty almost 4 years
It says in Apache Spark documentation "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads". Can someone explain how to achieve this concurrency for the following sample code?
```
    SparkConf conf = new SparkConf().setAppName("Simple_App");
    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
    JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");

    System.out.println(file1.count());
    System.out.println(file2.count());
```
These two jobs are independent and must run concurrently.
Thank You.
- pzecevic about 9 years
  
  You'd need to start a new thread, or two (you can find instructions for that online, I am sure), and then use the same SparkContext from both threads.
- Sporty about 9 years
  
  @pzecevic Thanks for the reply. I have written a sample code and am able to perform threading in local mode. However while running in YARN cluster, spark context finishes with status SUCCEEDED before the thread execution is complete and hence I am not getting any output. Can you suggest something on that? I can share the code but there is a limit in comment section.
- void about 8 years
  
  @Sporty I am not sure about this, but won't setting this conf help? spark.streaming.concurrentJobs
- Lars Francke over 7 years
  
  @void: As the name says that setting is only for Spark Streaming jobs. There however you are correct. That setting controls the number of parallel jobs to run and defaults to 1 (at least in Spark 2.0.0 and before)
- Shirish Kumar over 7 years
  
  You can use the fair scheduling within an application. See here - spark.apache.org/docs/1.6.1/job-scheduling.html
void about 8 years

Can't we simply use spark.streaming.concurrentJobs conf to set the concurrency level?
lightbox142 over 3 years

Does anyone have a python implementation of this?