How to find the size or shape of a DataFrame in PySpark?

254,523

Solution 1

You can get its shape with:

print((df.count(), len(df.columns)))

Solution 2

Use df.count() to get the number of rows.

Solution 3

Add this to the your code:

import pyspark
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large table that has not been persisted.

Solution 4

print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)

Solution 5

I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)

Share:
254,523
Admin
Author by

Admin

Updated on July 08, 2022

Comments

  • Admin
    Admin almost 2 years

    I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

    In Python, I can do this:

    data.shape()
    

    Is there a similar function in PySpark? This is my current solution, but I am looking for an element one

    row_number = data.count()
    column_number = len(data.dtypes)
    

    The computation of the number of columns is not ideal...

  • JanLauGe
    JanLauGe almost 7 years
    that just gives you number of columns. What about number of rows?
  • ponadto
    ponadto about 4 years
    Isn't .toPandas an action? Meaning: isn't this going to collect the data to your master, and then call shape on it? If so, it would be inadvisable to do that, unless you're sure it's going to fit in master's memory.
  • Melkor.cz
    Melkor.cz over 3 years
    If the dataset is huge, collecting to Pandas is exactly what you do NOT want to do. Btw: Why do you enable cross join for this? And does the arrow configuration help collecting to pandas?
  • Purushothaman Srikanth
    Purushothaman Srikanth almost 3 years
    Will this work fine for larger datasets spread across nodes?
  • Artur
    Artur over 2 years
    this is exactly what @Louis Yang wrote 3 years back
  • THIS USER NEEDS HELP
    THIS USER NEEDS HELP about 2 years
    Why doesn't Pyspark Dataframe simply store the shape values like pandas dataframe does with .shape? Having to call count seems incredibly resource-intensive for such a common and simple operation.