How to check if spark dataframe is empty?

143,039

Solution 1

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you.

df.head(1).isEmpty
df.take(1).isEmpty

with Python equivalent:

len(df.head(1)) == 0  # or bool(df.head(1))
len(df.take(1)) == 0  # or bool(df.take(1))

Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. first() calls head() directly, which calls head(1).head.

def first(): T = head()
def head(): T = head(1).head

head(1) returns an Array, so taking head on that Array causes the java.util.NoSuchElementException when the DataFrame is empty.

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

So instead of calling head(), use head(1) directly to get the array and then you can use isEmpty.

take(n) is also equivalent to head(n)...

def take(n: Int): Array[T] = head(n)

And limit(1).collect() is equivalent to head(1) (notice limit(n).queryExecution in the head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException exception when the DataFrame is empty.

df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty

I know this is an older question so hopefully it will help someone using a newer version of Spark.

Solution 2

I would say to just grab the underlying RDD. In Scala:

df.rdd.isEmpty

in Python:

df.rdd.isEmpty()

That being said, all this does is call take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?

Solution 3

I had the same question, and I tested 3 main solution :

(df != null) && (df.count > 0)
df.head(1).isEmpty() as @hulin003 suggest
df.rdd.isEmpty() as @Justin Pihony suggest

and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :

it takes ~9366ms
it takes ~5607ms
it takes ~1921ms

therefore I think that the best solution is df.rdd.isEmpty() as @Justin Pihony suggest

Solution 4

Since Spark 2.4.0 there is Dataset.isEmpty.

It's implementation is :

def isEmpty: Boolean = 
  withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
}

Note that a DataFrame is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):

type DataFrame = Dataset[Row]

Solution 5

You can take advantage of the head() (or first()) functions to see if the DataFrame has a single row. If so, it is not empty.

View more solutions

143,039

Author by

auxdx

Updated on July 08, 2022

Comments

auxdx almost 2 years

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?

PS: I want to check if it's empty so that I only save the DataFrame if it's not empty
architectonic over 8 years

This is surprisingly slower than df.count() == 0 in my case
Alok about 8 years

Isn't converting to rdd a heavy task?
Justin Pihony about 8 years

Not really. RDD's still are the underpinning of everything Spark for the most part.
FelixHo almost 8 years

if dataframe is empty it throws "java.util.NoSuchElementException: next on empty iterator" ; [Spark 1.3.1]
Nandakishore over 7 years

Don't convert the df to RDD. It slows down the process. If you convert it will convert whole DF to RDD and check if its empty. Think if DF has millions of rows, it takes lot of time in converting to RDD itself.
TheM00s3 over 7 years

if you run this on a massive dataframe with millions of records that count method is going to take some time.
Raul H over 7 years

.rdd slows down so much the process like a lot
LetsPlayYahtzee about 7 years

using df.take(1) when the df is empty results in getting back an empty ROW which cannot be compared with null
Vasile Surdu about 7 years

i'm using first() instead of take(1) in a try/catch block and it works
Nandakishore almost 7 years

@LetsPlayYahtzee I have updated the answer with same run and picture that shows error. take(1) returns Array[Row]. And when Array doesn't have any values, by default it gives ArrayOutOfBounds. So I don't think it gives an empty Row. I would say to observe this and change the vote.
AntiPawn79 over 6 years

For those using pyspark. isEmpty is not a thing. Do len(d.head(1)) > 0 instead.
Dan Ciborowski - MSFT over 6 years

why is this better then df.rdd.isEmpty?
y2k-shubham over 6 years

won't it require the schema of two dataframes (sqlContext.emptyDataFrame & df) to be same in order to ever return true?
Alper t. Turker about 6 years

This won't work. eq is inherited from AnyRef and tests whether the argument (that) is a reference to the receiver object (this).
Abdul Mannan about 6 years

If you call rdd's isEmpty method on a big dataframe, it will extremely slow it down especially when the dataframe isn't cached.
Rakesh Sabbani about 5 years

df.head(1).isEmpty is taking huge time is there any other optimized solution for this.
hulin003 about 5 years

Hey @Rakesh Sabbani, If df.head(1) is taking a large amount of time, it's probably because your df's execution plan is doing something complicated that prevents spark from taking shortcuts. For example, if you are just reading from parquet files, df = spark.read.parquet(...), I'm pretty sure spark will only read one file partition. But if your df is doing other things like aggregations, you may be inadvertently forcing spark to read and process a large portion, if not all, of you source data.
Sandeep540 over 4 years

isEmpty is slower than df.head(1).isEmpty
Beryllium over 4 years

@Sandeep540 Really? Benchmark? Your proposal instantiates at least one row. The Spark implementation just transports a number. head() is using limit() as well, the groupBy() is not really doing anything, it is required to get a RelationalGroupedDataset which in turn provides count(). So that should not be significantly slower. It is probably faster in case of a data set which contains a lot of columns (possibly denormalized nested data). Anway you have to type less :-)
Vzzarr over 4 years

just reporting my experience to AVOID: I was using df.limit(1).count() naively. On big datasets it takes much more time than the reported examples by @hulin003 which are almost instantaneous
jd2050 about 4 years

a little remark to this solution: you should avoid using df.head(1).isEmpty OR df.take(1).isEmpty on dataframes with > 100 columns because in can cause org.codehaus.janino.JaninoRuntimeException
Pushpendra Jaiswal almost 4 years

All these are bad options taking almost equal time
Jordan Morris almost 4 years

@PushpendraJaiswal yes, and in a world of bad options, we should chose the best bad option
aiguofer almost 4 years

out of curiosity... what size DataFrames was this tested with?
user2441441 over 3 years

@hulin003 I'm using df.take(1).isEmpty based on your answer, but it takes a very long time(2 mins) for even a a couple of hundred rows. Any help?
Mark Rajcok about 2 years

Beware: I am using .option("mode", "DROPMALFORMED") and df.isEmpty returned false whereas df.head(1).isEmpty returned the correct result of true because... all of the rows were malformed (someone upstream changed the schema on me).
Glib Martynenko almost 2 years

'DataFrame' object has no attribute 'isEmpty'. Spark 3.0
Glib Martynenko almost 2 years

I've tested 10 million rows... and got the same time as for df.count() or df.rdd.isEmpty()
ZygD almost 2 years

In PySpark, it's introduced only from version 3.3.0