How to select multiple non-contigous columns from a list into another dataframe in python

10,424

For example like this:

rdd = sc.parallelize([("a", 1, 2, 4.0, "foo"), ("b", 3, 4, 5.0, "bar")])
columns_num = [0, 3]

df = rdd.toDF()
df2 = df.select(*(df.columns[i] for i in columns_num))
df2.show()

##  +---+---+
##  | _1| _4|
##  +---+---+
##  |  a|4.0|
##  |  b|5.0|
##  +---+---+

or like this:

df = rdd.map(lambda row: [row[i] for i in columns_num]).toDF()
df.show()

##  +---+---+
##  | _1| _4|
##  +---+---+
##  |  a|4.0|
##  |  b|5.0|
##  +---+---+

On a side not you should never collect data just to reshape. In the best case scenario it will be slow, in the worst case scenario it will simply crash.

Share:
10,424
Jason Donnald
Author by

Jason Donnald

Updated on June 12, 2022

Comments

  • Jason Donnald
    Jason Donnald almost 2 years

    I am working on Ipython and Spark and I have a RDD from which I form a list. Now from this list I want to form a dataframe which has multiple columns from parent list but these columns are not contiguous. I wrote this but it seems to be working wrong:

    list1 = rdd.collect()
    columns_num = [1,8,11,17,21,24]
    df2 = [list[i] for i in columns_num]
    

    The above code only selects 6 rows, with only column 1 data, from parent list and forms the new dataframe with those 6 columns 1 data.

    How can I form a new dataframe with multiple not contiguous columns from another list

  • zero323
    zero323 over 8 years
    I am glad to hear that and thanks for accepting the answer. Still, I would be grateful I could provide an example which fails with the first approach if you have time :)
  • Jason Donnald
    Jason Donnald over 8 years
    Theres one thing I noticed. This is not forming a pandas dataframe right? How can I form a pandas dataframe?
  • zero323
    zero323 over 8 years
    No it doesn't. You want a local data structure? You can simply call toPandas() on a data frame.