Column is not iterable in pySpark

12,386

You're using wrong sum:

from pyspark.sql.functions import sum

sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

In practice you'll probably want alias or package import:

from pyspark.sql.functions import sum as sql_sum

# or

from pyspark.sql.functions as F
F.sum(...)
Share:
12,386
toddysm
Author by

toddysm

Cloud computing, Open Source, Azure. Interested in making people's life easier. Read more about me on my blog

Updated on July 29, 2022

Comments

  • toddysm
    toddysm over 1 year

    So, we are a bit puzzled. In Jupyter Notebook we have the following data frame:

    +--------------------+--------------+-------------+--------------------+--------+-------------------+ 
    |          created_at|created_at_int|  screen_name|            hashtags|ht_count|     single_hashtag|
    +--------------------+--------------+-------------+--------------------+--------+-------------------+
    |2017-03-05 00:00:...|    1488672001|     texanraj|  [containers, cool]|       1|         containers|
    |2017-03-05 00:00:...|    1488672001|     texanraj|  [containers, cool]|       1|               cool|
    |2017-03-05 00:00:...|    1488672002|   hubskihose|[automation, future]|       1|         automation|
    |2017-03-05 00:00:...|    1488672002|   hubskihose|[automation, future]|       1|             future|
    |2017-03-05 00:00:...|    1488672002|    IBMDevOps|            [DevOps]|       1|             devops|
    |2017-03-05 00:00:...|    1488672003|SoumitraKJana|[VoiceOfWipro, Cl...|       1|       voiceofwipro|
    |2017-03-05 00:00:...|    1488672003|SoumitraKJana|[VoiceOfWipro, Cl...|       1|              cloud|
    |2017-03-05 00:00:...|    1488672003|SoumitraKJana|[VoiceOfWipro, Cl...|       1|             leader|
    |2017-03-05 00:00:...|    1488672003|SoumitraKJana|      [Cloud, Cloud]|       1|              cloud|
    |2017-03-05 00:00:...|    1488672003|SoumitraKJana|      [Cloud, Cloud]|       1|              cloud|
    |2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|       voiceofwipro|
    |2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|              cloud|
    |2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|managedfiletransfer|
    |2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|         asaservice|
    |2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|   interconnect2017|
    |2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|                hmi|
    |2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|              cloud|
    |2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|managedfiletransfer|
    |2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|         asaservice|
    |2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|   interconnect2017|
    +--------------------+--------------+-------------+--------------------+--------+-------------------+
    only showing top 20 rows
    
    root
     |-- created_at: timestamp (nullable = true)
     |-- created_at_int: integer (nullable = true)
     |-- screen_name: string (nullable = true)
     |-- hashtags: array (nullable = true)
     |    |-- element: string (containsNull = true)
     |-- ht_count: integer (nullable = true)
     |-- single_hashtag: string (nullable = true)
    

    We are trying to get the count of hashtags per hour. The approach we are taking is to use Window to partition by single_hashtag. Something like this:

    # create WindowSpec                                 
    hashtags_24_winspec = Window.partitionBy(hashtags_24.single_hashtag). \  
                orderBy(hashtags_24.created_at_int).rangeBetween(-3600, 3600)
    

    However, when we try to do the sum of the ht_count column using:

    #sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)
    

    we get the following error:

    Column is not iterable
    Traceback (most recent call last):
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/column.py", line 240, in __iter__
        raise TypeError("Column is not iterable")
    TypeError: Column is not iterable
    

    The error message is not very informative and we are puzzled, which column exactly to investigate. Any ideas?

  • toddysm
    toddysm about 7 years
    Wow, that was a silly mistake. Thanks for pointing out the obvious!