How to check if spark dataframe is empty?
Solution 1
For Spark 2.1.0, my suggestion would be to use head(n: Int)
or take(n: Int)
with isEmpty
, whichever one has the clearest intent to you.
df.head(1).isEmpty
df.take(1).isEmpty
with Python equivalent:
len(df.head(1)) == 0 # or bool(df.head(1))
len(df.take(1)) == 0 # or bool(df.take(1))
Using df.first()
and df.head()
will both return the java.util.NoSuchElementException
if the DataFrame is empty. first()
calls head()
directly, which calls head(1).head
.
def first(): T = head()
def head(): T = head(1).head
head(1)
returns an Array, so taking head
on that Array causes the java.util.NoSuchElementException
when the DataFrame is empty.
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
So instead of calling head()
, use head(1)
directly to get the array and then you can use isEmpty
.
take(n)
is also equivalent to head(n)
...
def take(n: Int): Array[T] = head(n)
And limit(1).collect()
is equivalent to head(1)
(notice limit(n).queryExecution
in the head(n: Int)
method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException
exception when the DataFrame is empty.
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
I know this is an older question so hopefully it will help someone using a newer version of Spark.
Solution 2
I would say to just grab the underlying RDD
. In Scala:
df.rdd.isEmpty
in Python:
df.rdd.isEmpty()
That being said, all this does is call take(1).length
, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?
Solution 3
I had the same question, and I tested 3 main solution :
(df != null) && (df.count > 0)
df.head(1).isEmpty()
as @hulin003 suggestdf.rdd.isEmpty()
as @Justin Pihony suggest
and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :
- it takes ~9366ms
- it takes ~5607ms
- it takes ~1921ms
therefore I think that the best solution is df.rdd.isEmpty()
as @Justin Pihony suggest
Solution 4
Since Spark 2.4.0 there is Dataset.isEmpty
.
It's implementation is :
def isEmpty: Boolean =
withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
Note that a DataFrame
is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):
type DataFrame = Dataset[Row]
Solution 5
You can take advantage of the head()
(or first()
) functions to see if the DataFrame
has a single row. If so, it is not empty.
auxdx
Updated on July 08, 2022Comments
-
auxdx almost 2 years
Right now, I have to use
df.count > 0
to check if theDataFrame
is empty or not. But it is kind of inefficient. Is there any better way to do that?PS: I want to check if it's empty so that I only save the
DataFrame
if it's not empty -
architectonic over 8 yearsThis is surprisingly slower than df.count() == 0 in my case
-
Alok about 8 yearsIsn't converting to rdd a heavy task?
-
Justin Pihony about 8 yearsNot really. RDD's still are the underpinning of everything Spark for the most part.
-
FelixHo almost 8 yearsif dataframe is empty it throws "java.util.NoSuchElementException: next on empty iterator" ; [Spark 1.3.1]
-
Nandakishore over 7 yearsDon't convert the df to RDD. It slows down the process. If you convert it will convert whole DF to RDD and check if its empty. Think if DF has millions of rows, it takes lot of time in converting to RDD itself.
-
TheM00s3 over 7 yearsif you run this on a massive dataframe with millions of records that
count
method is going to take some time. -
Raul H over 7 years.rdd slows down so much the process like a lot
-
LetsPlayYahtzee about 7 yearsusing df.take(1) when the df is empty results in getting back an empty ROW which cannot be compared with null
-
Vasile Surdu about 7 yearsi'm using first() instead of take(1) in a try/catch block and it works
-
Nandakishore almost 7 years@LetsPlayYahtzee I have updated the answer with same run and picture that shows error. take(1) returns Array[Row]. And when Array doesn't have any values, by default it gives ArrayOutOfBounds. So I don't think it gives an empty Row. I would say to observe this and change the vote.
-
AntiPawn79 over 6 yearsFor those using pyspark. isEmpty is not a thing. Do len(d.head(1)) > 0 instead.
-
Dan Ciborowski - MSFT over 6 yearswhy is this better then
df.rdd.isEmpty
? -
y2k-shubham over 6 yearswon't it require the
schema
of two dataframes (sqlContext.emptyDataFrame
&df
) to be same in order to ever returntrue
? -
Alper t. Turker about 6 yearsThis won't work.
eq
is inherited fromAnyRef
and tests whether the argument (that) is a reference to the receiver object (this). -
Abdul Mannan about 6 yearsIf you call rdd's
isEmpty
method on a big dataframe, it will extremely slow it down especially when the dataframe isn't cached. -
Rakesh Sabbani about 5 yearsdf.head(1).isEmpty is taking huge time is there any other optimized solution for this.
-
hulin003 about 5 yearsHey @Rakesh Sabbani, If
df.head(1)
is taking a large amount of time, it's probably because yourdf
's execution plan is doing something complicated that prevents spark from taking shortcuts. For example, if you are just reading from parquet files,df = spark.read.parquet(...)
, I'm pretty sure spark will only read one file partition. But if yourdf
is doing other things like aggregations, you may be inadvertently forcing spark to read and process a large portion, if not all, of you source data. -
Sandeep540 over 4 yearsisEmpty is slower than df.head(1).isEmpty
-
Beryllium over 4 years@Sandeep540 Really? Benchmark? Your proposal instantiates at least one row. The Spark implementation just transports a number. head() is using limit() as well, the groupBy() is not really doing anything, it is required to get a RelationalGroupedDataset which in turn provides count(). So that should not be significantly slower. It is probably faster in case of a data set which contains a lot of columns (possibly denormalized nested data). Anway you have to type less :-)
-
Vzzarr over 4 yearsjust reporting my experience to AVOID: I was using
df.limit(1).count()
naively. On big datasets it takes much more time than the reported examples by @hulin003 which are almost instantaneous -
jd2050 about 4 yearsa little remark to this solution: you should avoid using df.head(1).isEmpty OR df.take(1).isEmpty on dataframes with > 100 columns because in can cause org.codehaus.janino.JaninoRuntimeException
-
Pushpendra Jaiswal almost 4 yearsAll these are bad options taking almost equal time
-
Jordan Morris almost 4 years@PushpendraJaiswal yes, and in a world of bad options, we should chose the best bad option
-
aiguofer almost 4 yearsout of curiosity... what size DataFrames was this tested with?
-
user2441441 over 3 years@hulin003 I'm using df.take(1).isEmpty based on your answer, but it takes a very long time(2 mins) for even a a couple of hundred rows. Any help?
-
Mark Rajcok about 2 yearsBeware: I am using
.option("mode", "DROPMALFORMED")
anddf.isEmpty
returnedfalse
whereasdf.head(1).isEmpty
returned the correct result oftrue
because... all of the rows were malformed (someone upstream changed the schema on me). -
Glib Martynenko almost 2 years'DataFrame' object has no attribute 'isEmpty'. Spark 3.0
-
Glib Martynenko almost 2 yearsI've tested 10 million rows... and got the same time as for df.count() or df.rdd.isEmpty()
-
ZygD almost 2 yearsIn PySpark, it's introduced only from version 3.3.0