Pyspark: Is there an equivalent method to pandas info()?
Solution 1
Also there is summary method to get row numbers and some other descritive statistics. It is similar to describe method already mentioned.
From PySpark manual:
df.summary().show()
+-------+------------------+-----+
|summary| age| name|
+-------+------------------+-----+
| count| 2| 2|
| mean| 3.5| null|
| stddev|2.1213203435596424| null|
| min| 2|Alice|
| 25%| 2| null|
| 50%| 2| null|
| 75%| 5| null|
| max| 5| Bob|
+-------+------------------+-----+
or
df.select("age", "name").summary("count").show()
+-------+---+----+
|summary|age|name|
+-------+---+----+
| count| 2| 2|
+-------+---+----+
Solution 2
To figure out type information about data frame you could try df.schema
spark.read.csv('matchCount.csv',header=True).printSchema()
StructType(List(StructField(categ,StringType,true),StructField(minv,StringType,true),StructField(maxv,StringType,true),StructField(counts,StringType,true),StructField(cutoff,StringType,true)))
For Summary stats you could also have a look at describe method from the documentation.
Solution 3
I could not find a good answer so I use the slightly cheating
dataFrame.toPandas().info()
Solution 4
Check this answer to get a count of the null and not null values.
from pyspark.sql.functions import isnan, when, count, col
import numpy as np
df = spark.createDataFrame(
[(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
('session', "timestamp1", "id2"))
df.show()
# +-------+----------+----+
# |session|timestamp1| id2|
# +-------+----------+----+
# | 1| 1|null|
# | 1| 2| 5.0|
# | 1| 3| NaN|
# | 2| 4|null|
# | 1| 5|10.0|
# | 1| 6| NaN|
# | 1| 6| NaN|
# +-------+----------+----+
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# | 0| 0| 3|
# +-------+----------+---+
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# | 0| 0| 5|
# +-------+----------+---+
df.describe().show()
# +-------+-------+------------------+---+
# |summary|session| timestamp1|id2|
# +-------+-------+------------------+---+
# | count| 7| 7| 5|
# | mean| 1.0| 3.857142857142857|NaN|
# | stddev| 0.0|1.9518001458970662|NaN|
# | min| 1| 1|5.0|
# | max| 1| 6|NaN|
# +-------+-------+------------------+---
There is no equivalent to pandas.DataFrame.info()
that I know of.
PrintSchema
is useful, and toPandas.info()
works for small dataframes but When I use pandas.DataFrame.info()
I often look at the null values.
Brian Waters
Updated on July 12, 2022Comments
-
Brian Waters almost 2 years
Is there an equivalent method to pandas info() method in PySpark?
I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of dataframe
Info() method in pandas provides all these statistics.