'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe
Solution 1
The pivot()
method returns a GroupedData
object, just like groupBy()
. You cannot use show()
on a GroupedData
object without using an aggregate function (such as sum()
or even count()
) on it before.
See this article for more information
Solution 2
Let's create some test data that resembles your dataset:
data = [
("123", "McDonalds"),
("123", "Starbucks"),
("123", "McDonalds"),
("777", "McDonalds"),
("777", "McDonalds"),
("777", "Dunkin")
]
df = spark.createDataFrame(data, ["customer_id", "name"])
df.show()
+-----------+---------+
|customer_id| name|
+-----------+---------+
| 123|McDonalds|
| 123|Starbucks|
| 123|McDonalds|
| 777|McDonalds|
| 777|McDonalds|
| 777| Dunkin|
+-----------+---------+
Let's pivot the dataset so the customer_ids are columns:
df.groupBy("name").pivot("customer_id").count().show()
+---------+----+----+
| name| 123| 777|
+---------+----+----+
|McDonalds| 2| 2|
|Starbucks| 1|null|
| Dunkin|null| 1|
+---------+----+----+
Now let's pivot the DataFrame so the restaurant names are columns:
df.groupBy("customer_id").pivot("name").count().show()
+-----------+------+---------+---------+
|customer_id|Dunkin|McDonalds|Starbucks|
+-----------+------+---------+---------+
| 777| 1| 2| null|
| 123| null| 2| 1|
+-----------+------+---------+---------+
Code like df.groupBy("name").show()
errors out with the AttributeError: 'GroupedData' object has no attribute 'show'
message. You can only call methods defined in the pyspark.sql.GroupedData
class on instances of the GroupedData
class.
Comments
-
Nabih Bawazir almost 4 years
I want to pivot a spark dataframe, I refer pyspark documentation, and based on
pivot
function, the clue is.groupBy('name').pivot('name', values=None)
. Here's my dataset,In[75]: spDF.show() Out[75]: +-----------+-----------+ |customer_id| name| +-----------+-----------+ | 25620| MCDonnalds| | 25620| STARBUCKS| | 25620| nan| | 25620| nan| | 25620| MCDonnalds| | 25620| nan| | 25620| MCDonnalds| | 25620|DUNKINDONUT| | 25620| LOTTERIA| | 25620| nan| | 25620| MCDonnalds| | 25620|DUNKINDONUT| | 25620|DUNKINDONUT| | 25620| nan| | 25620| nan| | 25620| nan| | 25620| nan| | 25620| LOTTERIA| | 25620| LOTTERIA| | 25620| STARBUCKS| +-----------+-----------+ only showing top 20 rows
And then I try to di pivot the table name
In [96]: spDF.groupBy('name').pivot('name', values=None) Out[96]: <pyspark.sql.group.GroupedData at 0x7f0ad03750f0>
And when I try to show them
In [98]: spDF.groupBy('name').pivot('name', values=None).show() Out [98]: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-98-94354082e956> in <module>() ----> 1 spDF.groupBy('name').pivot('name', values=None).show() AttributeError: 'GroupedData' object has no attribute 'show'
I don't know why
'GroupedData'
can't be shown, what should I do to solve the issue? -
Nabih Bawazir over 5 yearsThanks for the explanation, upvoted, still waiting for best answer
-
ech0 over 5 yearsSo what is the best answer that you are looking for? This is the only answer as to why you can't show it. The object that is being returned in fact does not have the attribute
show
and that is why the error is prompted. Think of it this way, lets say that you create a class and create a method for that class. Lets call this methodgetName
. Now, try calling that method on a different type of object, for example, a string. It won't work, because the string doesn't have the method. That IS the only and best explanation. -
ech0 over 5 yearsOr were you hoping for me to type the code that you can just copy-paste in and get it working?
-
Nabih Bawazir over 5 yearsworking code is the best answer, still reading the article, basically I only give best answer when my problem comp-letely solved
-
ech0 over 5 yearsThat is actually incorrect. The best answer points you to the direction. The easiest answer will give you the code off the bat. Now that you know what the problem is (HINT: you have to use an aggregate function) you can learn by solving this and not ever get this problem again. If I give you the functioning code off the bat you have no idea whats going on and will keep on tripping over this, as this is a quite common thing.
-
Nabih Bawazir over 5 years@cromaerror, Agree with your point, The best answer points you to the direction to write the working code, best answer by clue will be given if the questioner can apply the advice, still try the code from the article
-
Yamur over 4 yearsCould you please add an example? Thanks
-
Leyth G about 4 yearsDo a .agg function on the grouped data and take what you want from that df. For example, .agg(max("userid")).select("userId").show().