Spark query running very slow

11,393

Solution 1

This is normal, don't except Spark to run in a few milli-secondes like mysql or postgres do. Spark is low latency compared to other big data solutions like Hive, Impala... you cannot compare it with classic database, Spark is not a database where data are indexed!

Watch this video: https://www.youtube.com/watch?v=8E0cVWKiuhk

They clearly put Spark here:

Did you try Apache Drill? I found it a bit faster (I use it for small HDFS JSON files, 2/3Gb, much faster than Spark for SQL queries).

Solution 2

Set default.parallelism to 2
Start spark with --num-executor-cores 8
Modify this part

df.registerTempTable('test') d=sqlContext.sql("""...

df.registerTempTable('test') sqlContext.cacheTable("test") d=sqlContext.sql("""...

11,393

Arpit

A software professional. Currently working on non preferred language java :D Joined this site to help others and keep my knowledge up to date. Visit my blog: CodeWithLogic

Updated on September 14, 2022

Comments

Arpit over 1 year

i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell

    df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
    df.registerTempTable('test')
    d=sqlContext.sql("""

    select user_id from (

    select -- (i1)

        sum(total),

        user_id

    from

        (select --(i2)

            avg(total) as total,

            user_id

        from

                test

        group by

            order_id,

            user_id) as a

    group by

        user_id

    having sum(total) > 0

    ) as b
"""
)

When i do d.count(), the above query takes 30 sec when df is not cached and 17sec when df is cached in memory.

I'm expecting these timings to be closer to 1-2s.

These are my spark configurations:

spark.executor.memory 6154m
spark.driver.memory 3g
spark.shuffle.spill false
spark.default.parallelism 8

rest is set to its default values. Can any one see what i'm missing here ?

Geek over 8 years

What is the performance for ONLY the Inner query ?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

convert dataframe to libsvm format

How to use a Scala class inside Pyspark

How to use correlation in Spark with Dataframes?

You need to build Spark before running this program error when running bin/pyspark

EMR 5.x | Spark on Yarn | Exit code 137 and Java heap space Error

Calculate time between two dates in pyspark

pyspark; check if an element is in collect_list

Spark SQL(PySpark) - SparkSession import Error

PySpark - split the string column and join part of them to form new columns

PySpark: How to check if list of string values exists in dataframe and print values to a list