'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe

35,307

Solution 1

The pivot() method returns a GroupedData object, just like groupBy(). You cannot use show() on a GroupedData object without using an aggregate function (such as sum() or even count()) on it before.

See this article for more information

Solution 2

Let's create some test data that resembles your dataset:

data = [
    ("123", "McDonalds"),
    ("123", "Starbucks"),
    ("123", "McDonalds"),
    ("777", "McDonalds"),
    ("777", "McDonalds"),
    ("777", "Dunkin")
]
df = spark.createDataFrame(data, ["customer_id", "name"])
df.show()
+-----------+---------+
|customer_id|     name|
+-----------+---------+
|        123|McDonalds|
|        123|Starbucks|
|        123|McDonalds|
|        777|McDonalds|
|        777|McDonalds|
|        777|   Dunkin|
+-----------+---------+

Let's pivot the dataset so the customer_ids are columns:

df.groupBy("name").pivot("customer_id").count().show()

+---------+----+----+
|     name| 123| 777|
+---------+----+----+
|McDonalds|   2|   2|
|Starbucks|   1|null|
|   Dunkin|null|   1|
+---------+----+----+

Now let's pivot the DataFrame so the restaurant names are columns:

df.groupBy("customer_id").pivot("name").count().show()

+-----------+------+---------+---------+
|customer_id|Dunkin|McDonalds|Starbucks|
+-----------+------+---------+---------+
|        777|     1|        2|     null|
|        123|  null|        2|        1|
+-----------+------+---------+---------+

Code like df.groupBy("name").show() errors out with the AttributeError: 'GroupedData' object has no attribute 'show' message. You can only call methods defined in the pyspark.sql.GroupedData class on instances of the GroupedData class.

Share:
35,307
Nabih Bawazir
Author by

Nabih Bawazir

Data Scientist at ADA

Updated on July 27, 2020

Comments

  • Nabih Bawazir
    Nabih Bawazir almost 4 years

    I want to pivot a spark dataframe, I refer pyspark documentation, and based on pivot function, the clue is .groupBy('name').pivot('name', values=None). Here's my dataset,

     In[75]:  spDF.show()
     Out[75]:
    
    +-----------+-----------+
    |customer_id|       name|
    +-----------+-----------+
    |      25620| MCDonnalds|
    |      25620|  STARBUCKS|
    |      25620|        nan|
    |      25620|        nan|
    |      25620| MCDonnalds|
    |      25620|        nan|
    |      25620| MCDonnalds|
    |      25620|DUNKINDONUT|
    |      25620|   LOTTERIA|
    |      25620|        nan|
    |      25620| MCDonnalds|
    |      25620|DUNKINDONUT|
    |      25620|DUNKINDONUT|
    |      25620|        nan|
    |      25620|        nan|
    |      25620|        nan|
    |      25620|        nan|
    |      25620|   LOTTERIA|
    |      25620|   LOTTERIA|
    |      25620|  STARBUCKS|
    +-----------+-----------+
    only showing top 20 rows
    

    And then I try to di pivot the table name

    In [96]:
    spDF.groupBy('name').pivot('name', values=None)
    Out[96]:
    <pyspark.sql.group.GroupedData at 0x7f0ad03750f0>
    

    And when I try to show them

    In [98]:
    spDF.groupBy('name').pivot('name', values=None).show()
    Out [98]:
    
        ---------------------------------------------------------------------------
    AttributeError                       Traceback (most recent call last)
    <ipython-input-98-94354082e956> in <module>()
    ----> 1 spDF.groupBy('name').pivot('name', values=None).show()
    AttributeError: 'GroupedData' object has no attribute 'show'
    

    I don't know why 'GroupedData' can't be shown, what should I do to solve the issue?

  • Nabih Bawazir
    Nabih Bawazir over 5 years
    Thanks for the explanation, upvoted, still waiting for best answer
  • ech0
    ech0 over 5 years
    So what is the best answer that you are looking for? This is the only answer as to why you can't show it. The object that is being returned in fact does not have the attribute show and that is why the error is prompted. Think of it this way, lets say that you create a class and create a method for that class. Lets call this method getName. Now, try calling that method on a different type of object, for example, a string. It won't work, because the string doesn't have the method. That IS the only and best explanation.
  • ech0
    ech0 over 5 years
    Or were you hoping for me to type the code that you can just copy-paste in and get it working?
  • Nabih Bawazir
    Nabih Bawazir over 5 years
    working code is the best answer, still reading the article, basically I only give best answer when my problem comp-letely solved
  • ech0
    ech0 over 5 years
    That is actually incorrect. The best answer points you to the direction. The easiest answer will give you the code off the bat. Now that you know what the problem is (HINT: you have to use an aggregate function) you can learn by solving this and not ever get this problem again. If I give you the functioning code off the bat you have no idea whats going on and will keep on tripping over this, as this is a quite common thing.
  • Nabih Bawazir
    Nabih Bawazir over 5 years
    @cromaerror, Agree with your point, The best answer points you to the direction to write the working code, best answer by clue will be given if the questioner can apply the advice, still try the code from the article
  • Yamur
    Yamur over 4 years
    Could you please add an example? Thanks
  • Leyth G
    Leyth G about 4 years
    Do a .agg function on the grouped data and take what you want from that df. For example, .agg(max("userid")).select("userId").show().