Spark query running very slow
Solution 1
This is normal, don't except Spark to run in a few milli-secondes like mysql or postgres do. Spark is low latency compared to other big data solutions like Hive, Impala... you cannot compare it with classic database, Spark is not a database where data are indexed!
Watch this video: https://www.youtube.com/watch?v=8E0cVWKiuhk
They clearly put Spark here:
Did you try Apache Drill? I found it a bit faster (I use it for small HDFS JSON files, 2/3Gb, much faster than Spark for SQL queries).
Solution 2
- Set
default.parallelism
to 2 - Start spark with
--num-executor-cores 8
- Modify this part
df.registerTempTable('test')
d=sqlContext.sql("""...
to
df.registerTempTable('test')
sqlContext.cacheTable("test")
d=sqlContext.sql("""...
Related videos on Youtube
Arpit
A software professional. Currently working on non preferred language java :D Joined this site to help others and keep my knowledge up to date. Visit my blog: CodeWithLogic
Updated on September 14, 2022Comments
-
Arpit about 1 year
i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell
df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass") df.registerTempTable('test') d=sqlContext.sql(""" select user_id from ( select -- (i1) sum(total), user_id from (select --(i2) avg(total) as total, user_id from test group by order_id, user_id) as a group by user_id having sum(total) > 0 ) as b """ )
When i do d.count(), the above query takes 30 sec when
df
is not cached and 17sec whendf
is cached in memory.I'm expecting these timings to be closer to 1-2s.
These are my spark configurations:
spark.executor.memory 6154m spark.driver.memory 3g spark.shuffle.spill false spark.default.parallelism 8
rest is set to its default values. Can any one see what i'm missing here ?
-
Geek over 8 yearsWhat is the performance for ONLY the Inner query ?
-