Spark query running very slow

11,393

Solution 1

This is normal, don't except Spark to run in a few milli-secondes like mysql or postgres do. Spark is low latency compared to other big data solutions like Hive, Impala... you cannot compare it with classic database, Spark is not a database where data are indexed!

Watch this video: https://www.youtube.com/watch?v=8E0cVWKiuhk

They clearly put Spark here:

Spark not so low latency

Did you try Apache Drill? I found it a bit faster (I use it for small HDFS JSON files, 2/3Gb, much faster than Spark for SQL queries).

Solution 2

  1. Set default.parallelism to 2
  2. Start spark with --num-executor-cores 8
  3. Modify this part

df.registerTempTable('test') d=sqlContext.sql("""...

to

df.registerTempTable('test') sqlContext.cacheTable("test") d=sqlContext.sql("""...

Share:
11,393

Related videos on Youtube

Arpit
Author by

Arpit

A software professional. Currently working on non preferred language java :D Joined this site to help others and keep my knowledge up to date. Visit my blog: CodeWithLogic

Updated on September 14, 2022

Comments

  • Arpit
    Arpit over 1 year

    i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell

        df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
        df.registerTempTable('test')
        d=sqlContext.sql("""
    
        select user_id from (
    
        select -- (i1)
    
            sum(total),
    
            user_id
    
        from
    
            (select --(i2)
    
                avg(total) as total,
    
                user_id
    
            from
    
                    test
    
            group by
    
                order_id,
    
                user_id) as a
    
        group by
    
            user_id
    
        having sum(total) > 0
    
        ) as b
    """
    )
    

    When i do d.count(), the above query takes 30 sec when df is not cached and 17sec when df is cached in memory.

    I'm expecting these timings to be closer to 1-2s.

    These are my spark configurations:

    spark.executor.memory 6154m
    spark.driver.memory 3g
    spark.shuffle.spill false
    spark.default.parallelism 8
    

    rest is set to its default values. Can any one see what i'm missing here ?

    • Geek
      Geek over 8 years
      What is the performance for ONLY the Inner query ?