How to check if spark dataframe is empty?

143,039

Solution 1

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you.

df.head(1).isEmpty
df.take(1).isEmpty

with Python equivalent:

len(df.head(1)) == 0  # or bool(df.head(1))
len(df.take(1)) == 0  # or bool(df.take(1))

Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. first() calls head() directly, which calls head(1).head.

def first(): T = head()
def head(): T = head(1).head

head(1) returns an Array, so taking head on that Array causes the java.util.NoSuchElementException when the DataFrame is empty.

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

So instead of calling head(), use head(1) directly to get the array and then you can use isEmpty.

take(n) is also equivalent to head(n)...

def take(n: Int): Array[T] = head(n)

And limit(1).collect() is equivalent to head(1) (notice limit(n).queryExecution in the head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException exception when the DataFrame is empty.

df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty

I know this is an older question so hopefully it will help someone using a newer version of Spark.

Solution 2

I would say to just grab the underlying RDD. In Scala:

df.rdd.isEmpty

in Python:

df.rdd.isEmpty()

That being said, all this does is call take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?

Solution 3

I had the same question, and I tested 3 main solution :

  1. (df != null) && (df.count > 0)
  2. df.head(1).isEmpty() as @hulin003 suggest
  3. df.rdd.isEmpty() as @Justin Pihony suggest

and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :

  1. it takes ~9366ms
  2. it takes ~5607ms
  3. it takes ~1921ms

therefore I think that the best solution is df.rdd.isEmpty() as @Justin Pihony suggest

Solution 4

Since Spark 2.4.0 there is Dataset.isEmpty.

It's implementation is :

def isEmpty: Boolean = 
  withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
}

Note that a DataFrame is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):

type DataFrame = Dataset[Row]

Solution 5

You can take advantage of the head() (or first()) functions to see if the DataFrame has a single row. If so, it is not empty.

Share:
143,039
auxdx
Author by

auxdx

Updated on July 08, 2022

Comments

  • auxdx
    auxdx almost 2 years

    Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?

    PS: I want to check if it's empty so that I only save the DataFrame if it's not empty

  • architectonic
    architectonic over 8 years
    This is surprisingly slower than df.count() == 0 in my case
  • Alok
    Alok about 8 years
    Isn't converting to rdd a heavy task?
  • Justin Pihony
    Justin Pihony about 8 years
    Not really. RDD's still are the underpinning of everything Spark for the most part.
  • FelixHo
    FelixHo almost 8 years
    if dataframe is empty it throws "java.util.NoSuchElementException: next on empty iterator" ; [Spark 1.3.1]
  • Nandakishore
    Nandakishore over 7 years
    Don't convert the df to RDD. It slows down the process. If you convert it will convert whole DF to RDD and check if its empty. Think if DF has millions of rows, it takes lot of time in converting to RDD itself.
  • TheM00s3
    TheM00s3 over 7 years
    if you run this on a massive dataframe with millions of records that count method is going to take some time.
  • Raul H
    Raul H over 7 years
    .rdd slows down so much the process like a lot
  • LetsPlayYahtzee
    LetsPlayYahtzee about 7 years
    using df.take(1) when the df is empty results in getting back an empty ROW which cannot be compared with null
  • Vasile Surdu
    Vasile Surdu about 7 years
    i'm using first() instead of take(1) in a try/catch block and it works
  • Nandakishore
    Nandakishore almost 7 years
    @LetsPlayYahtzee I have updated the answer with same run and picture that shows error. take(1) returns Array[Row]. And when Array doesn't have any values, by default it gives ArrayOutOfBounds. So I don't think it gives an empty Row. I would say to observe this and change the vote.
  • AntiPawn79
    AntiPawn79 over 6 years
    For those using pyspark. isEmpty is not a thing. Do len(d.head(1)) > 0 instead.
  • Dan Ciborowski - MSFT
    Dan Ciborowski - MSFT over 6 years
    why is this better then df.rdd.isEmpty?
  • y2k-shubham
    y2k-shubham over 6 years
    won't it require the schema of two dataframes (sqlContext.emptyDataFrame & df) to be same in order to ever return true?
  • Alper t. Turker
    Alper t. Turker about 6 years
    This won't work. eq is inherited from AnyRef and tests whether the argument (that) is a reference to the receiver object (this).
  • Abdul Mannan
    Abdul Mannan about 6 years
    If you call rdd's isEmpty method on a big dataframe, it will extremely slow it down especially when the dataframe isn't cached.
  • Rakesh Sabbani
    Rakesh Sabbani about 5 years
    df.head(1).isEmpty is taking huge time is there any other optimized solution for this.
  • hulin003
    hulin003 about 5 years
    Hey @Rakesh Sabbani, If df.head(1) is taking a large amount of time, it's probably because your df's execution plan is doing something complicated that prevents spark from taking shortcuts. For example, if you are just reading from parquet files, df = spark.read.parquet(...), I'm pretty sure spark will only read one file partition. But if your df is doing other things like aggregations, you may be inadvertently forcing spark to read and process a large portion, if not all, of you source data.
  • Sandeep540
    Sandeep540 over 4 years
    isEmpty is slower than df.head(1).isEmpty
  • Beryllium
    Beryllium over 4 years
    @Sandeep540 Really? Benchmark? Your proposal instantiates at least one row. The Spark implementation just transports a number. head() is using limit() as well, the groupBy() is not really doing anything, it is required to get a RelationalGroupedDataset which in turn provides count(). So that should not be significantly slower. It is probably faster in case of a data set which contains a lot of columns (possibly denormalized nested data). Anway you have to type less :-)
  • Vzzarr
    Vzzarr over 4 years
    just reporting my experience to AVOID: I was using df.limit(1).count() naively. On big datasets it takes much more time than the reported examples by @hulin003 which are almost instantaneous
  • jd2050
    jd2050 about 4 years
    a little remark to this solution: you should avoid using df.head(1).isEmpty OR df.take(1).isEmpty on dataframes with > 100 columns because in can cause org.codehaus.janino.JaninoRuntimeException
  • Pushpendra Jaiswal
    Pushpendra Jaiswal almost 4 years
    All these are bad options taking almost equal time
  • Jordan Morris
    Jordan Morris almost 4 years
    @PushpendraJaiswal yes, and in a world of bad options, we should chose the best bad option
  • aiguofer
    aiguofer almost 4 years
    out of curiosity... what size DataFrames was this tested with?
  • user2441441
    user2441441 over 3 years
    @hulin003 I'm using df.take(1).isEmpty based on your answer, but it takes a very long time(2 mins) for even a a couple of hundred rows. Any help?
  • Mark Rajcok
    Mark Rajcok about 2 years
    Beware: I am using .option("mode", "DROPMALFORMED") and df.isEmpty returned false whereas df.head(1).isEmpty returned the correct result of true because... all of the rows were malformed (someone upstream changed the schema on me).
  • Glib Martynenko
    Glib Martynenko almost 2 years
    'DataFrame' object has no attribute 'isEmpty'. Spark 3.0
  • Glib Martynenko
    Glib Martynenko almost 2 years
    I've tested 10 million rows... and got the same time as for df.count() or df.rdd.isEmpty()
  • ZygD
    ZygD almost 2 years
    In PySpark, it's introduced only from version 3.3.0