Pyspark: display a spark data frame in a table format

202,682

Solution 1

The show method does what you're looking for.

For example, given the following dataframe of 3 rows, I can print just the first two rows like this:

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)

which yields:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
+---+---+
only showing top 2 rows

Solution 2

As mentioned by @Brent in the comment of @maxymoo's answer, you can try

df.limit(10).toPandas()

to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit() will not keep the order of original spark dataframe.

Solution 3

Let's say we have the following Spark DataFrame:

df = sqlContext.createDataFrame(
    [
        (1, "Mark", "Brown"), 
        (2, "Tom", "Anderson"), 
        (3, "Joshua", "Peterson")
    ], 
    ('id', 'firstName', 'lastName')
)

There are typically three different ways you can use to print the content of the dataframe:

Print Spark DataFrame

The most common way is to use show() function:

>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

Print Spark DataFrame vertically

Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.

>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
 id        | 1        
 firstName | Mark     
 lastName  | Brown    
-RECORD 1-------------
 id        | 2        
 firstName | Tom      
 lastName  | Anderson 
only showing top 2 rows

Convert to Pandas and print Pandas DataFrame

Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas() and finally print() it.

>>> df_pd = df.toPandas()
>>> print(df_pd)
   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames

Solution 4

Yes: call the toPandas method on your dataframe and you'll get an actual pandas dataframe !

Solution 5

If you are using Jupyter, this is what worked for me:

[1] df= spark.read.parquet("s3://df/*")

[2] dsp = users

[3] %%display dsp

This shows well-formated HTML table, you can also draw some simple charts on it straight away. For more documentation of %%display, type %%help.

Share:
202,682
Edamame
Author by

Edamame

Updated on November 26, 2021

Comments

  • Edamame
    Edamame over 2 years

    I am using pyspark to read a parquet file like below:

    my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
    

    Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.

    Is it possible to display the data frame in a table format like pandas data frame? Thanks!

  • Edamame
    Edamame over 7 years
    I tried to do: my_df.toPandas().head(). But got the error: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 301 in stage 2.0 failed 1 times, most recent failure: Lost task 301.0 in stage 2.0 (TID 1871, localhost): java.lang.OutOfMemoryError: Java heap space
  • David Arenburg
    David Arenburg about 7 years
    This is dangerous as this will collect the whole data frame into a single node.
  • deepelement
    deepelement over 6 years
    It should be emphasized that this will quickly cap out memory in traditional Spark RDD scenarios.
  • Brent
    Brent about 6 years
    It should be used with a limit, like this df.limit(10).toPandas() to protect from OOMs
  • WestCoastProjects
    WestCoastProjects about 6 years
    It is v primitive vs pandas: e.g. for wrapping it does not allow horizontal scrolling
  • M PAUL
    M PAUL almost 6 years
    Using .toPandas(), i am getting the following error: An error occurred while calling o86.get. : java.util.NoSuchElementException: spark.sql.execution.pandas.respectSessionTimeZone How do i deal with this?
  • technazi
    technazi over 5 years
    There should be a method like fromPandas.
  • Giorgos Myrianthous
    Giorgos Myrianthous over 3 years
    If you are using toPandas() consider enabling PyArrow optimisations: medium.com/@giorgosmyrianthous/…
  • Giorgos Myrianthous
    Giorgos Myrianthous over 3 years
    If you are using toPandas() consider enabling PyArrow optimisations: medium.com/@giorgosmyrianthous/…
  • sotmot
    sotmot about 3 years
    Thank you for the answer! But, the link seems to be broken.
  • eddies
    eddies about 3 years
    Thanks for the heads up. Updated the link to point to the new docs location
  • bhargav3vedi
    bhargav3vedi about 2 years
    display is not a function, PySpark provides functions like head, tail, show to display data frame.
  • Admin
    Admin almost 2 years
    Please re-read the question. The answer very well serves it well.