How to run concurrent jobs(actions) in Apache Spark using single spark context
20,902
Try something like this:
final JavaSparkContext sc = new JavaSparkContext("local[2]","Simple_App");
ExecutorService executorService = Executors.newFixedThreadPool(2);
// Start thread 1
Future<Long> future1 = executorService.submit(new Callable<Long>() {
@Override
public Long call() throws Exception {
JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
return file1.count();
}
});
// Start thread 2
Future<Long> future2 = executorService.submit(new Callable<Long>() {
@Override
public Long call() throws Exception {
JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");
return file2.count();
}
});
// Wait thread 1
System.out.println("File1:"+future1.get());
// Wait thread 2
System.out.println("File2:"+future2.get());
Related videos on Youtube
Author by
Sporty
Updated on May 10, 2020Comments
-
Sporty almost 4 years
It says in Apache Spark documentation "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads". Can someone explain how to achieve this concurrency for the following sample code?SparkConf conf = new SparkConf().setAppName("Simple_App"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1"); JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2"); System.out.println(file1.count()); System.out.println(file2.count());
These two jobs are independent and must run concurrently.
Thank You.-
pzecevic about 9 yearsYou'd need to start a new thread, or two (you can find instructions for that online, I am sure), and then use the same SparkContext from both threads.
-
Sporty about 9 years@pzecevic Thanks for the reply. I have written a sample code and am able to perform threading in local mode. However while running in YARN cluster, spark context finishes with status SUCCEEDED before the thread execution is complete and hence I am not getting any output. Can you suggest something on that? I can share the code but there is a limit in comment section.
-
void about 8 years@Sporty I am not sure about this, but won't setting this conf help?
spark.streaming.concurrentJobs
-
Lars Francke over 7 years@void: As the name says that setting is only for Spark Streaming jobs. There however you are correct. That setting controls the number of parallel jobs to run and defaults to 1 (at least in Spark 2.0.0 and before)
-
Shirish Kumar over 7 yearsYou can use the fair scheduling within an application. See here - spark.apache.org/docs/1.6.1/job-scheduling.html
-
-
void about 8 yearsCan't we simply use
spark.streaming.concurrentJobs
conf to set the concurrency level? -
lightbox142 over 3 yearsDoes anyone have a python implementation of this?