Describe a Dataframe on PySpark

python pandas apache-spark pyspark

10,862

Solution 1

What are the stats you need? Spark has a similar feature

file.summary().show()

+-------+----+
|summary|test|
+-------+----+
|  count|   3|
|   mean| 2.0|
| stddev| 1.0|
|    min|   1|
|    25%|   1|
|    50%|   2|
|    75%|   3|
|    max|   3|
+-------+----+

Solution 2

In Spark you can use df.describe() or df.summary() to check statistical information.

The difference is that df.summary() returns the same information as df.describe() plus quartile information (25%, 50% and 75%).

If you want to delete string columns, you can use a list comprehension to access the values of dtypes, which returns a tuple ('column_name', 'column_type'), and delete the string type, passing these columns as a parameter to df.select().

Command example:

df.select([col[0] for col in df.dtypes if col[1] != 'string']).describe().show()

10,862

Author by

Tokyo

Web Developer with multiple years of experience. (HTML, CSS, JavaScript, Drupal)

Updated on June 15, 2022

Comments

Tokyo almost 2 years
I have a fairly large Parquet file which I am loading using
```
file = spark.read.parquet('hdfs/directory/test.parquet')
```
Now I want to get some statistics (similar to pandas describe() function). What I've tried to do was:
```
file_pd = file.toPandas()
file_pd.describe()
```
but obviously this requires to load all the data in memory and it will fail. Can anyone suggest a workaround?