How to convert rows into a list of dictionaries in pyspark?

35,296

Solution 1

I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.

Something like:

df.rdd.map(lambda row: row.asDict())

Solution 2

How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.

df_list_of_dict = [row.asDict() for row in df.collect()]

type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)

df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]

Solution 3

If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.

First collect the data:

df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]

This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:

df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]

1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.

Solution 4

Given:

+++++++++++++++++++++++++++++++++++++++++++
|  Name    |    URL visited               |
+++++++++++++++++++++++++++++++++++++++++++
|  person1 | [google,msn,yahoo]           |
|  person2 | [fb.com,airbnb,wired.com]    |
|  person3 | [fb.com,google.com]          |
+++++++++++++++++++++++++++++++++++++++++++

This should work:

df_dict = df \
    .rdd \
    .map(lambda row: {row[0]: row[1]}) \
    .collect()

df_dict

#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]

This way you just collect after processing.

Please, let me know if that works for you :)

Share:
35,296
user8946942
Author by

user8946942

Updated on June 28, 2021

Comments

  • user8946942
    user8946942 almost 3 years

    I have a DataFrame(df) in pyspark, by reading from a hive table:

    df=spark.sql('select * from <table_name>')
    
    
    +++++++++++++++++++++++++++++++++++++++++++
    |  Name    |    URL visited               |
    +++++++++++++++++++++++++++++++++++++++++++
    |  person1 | [google,msn,yahoo]           |
    |  person2 | [fb.com,airbnb,wired.com]    |
    |  person3 | [fb.com,google.com]          |
    +++++++++++++++++++++++++++++++++++++++++++
    

    When i tried the following, got an error

    df_dict = dict(zip(df['name'],df['url']))
    "TypeError: zip argument #1 must support iteration."
    

    type(df.name) is of 'pyspark.sql.column.Column'

    How do i create a dictionary like the following, which can be iterated later on

    {'person1':'google','msn','yahoo'}
    {'person2':'fb.com','airbnb','wired.com'}
    {'person3':'fb.com','google.com'}
    

    Appreciate your thoughts and help.

  • pault
    pault almost 5 years
    Note this will produce a rows of the form: {"Name": "person1", "URL Visited": ["google","msn","yahoo"] } which is not the output that the OP asked for, though it's quite easy to change the map function to address this.