How to select multiple non-contigous columns from a list into another dataframe in python

python apache-spark apache-spark-sql pyspark

10,424

For example like this:

rdd = sc.parallelize([("a", 1, 2, 4.0, "foo"), ("b", 3, 4, 5.0, "bar")])
columns_num = [0, 3]

df = rdd.toDF()
df2 = df.select(*(df.columns[i] for i in columns_num))
df2.show()

##  +---+---+
##  | _1| _4|
##  +---+---+
##  |  a|4.0|
##  |  b|5.0|
##  +---+---+

or like this:

df = rdd.map(lambda row: [row[i] for i in columns_num]).toDF()
df.show()

##  +---+---+
##  | _1| _4|
##  +---+---+
##  |  a|4.0|
##  |  b|5.0|
##  +---+---+

On a side not you should never collect data just to reshape. In the best case scenario it will be slow, in the worst case scenario it will simply crash.

10,424

Author by

Jason Donnald

Updated on June 12, 2022

Comments

Jason Donnald almost 2 years
I am working on Ipython and Spark and I have a RDD from which I form a list. Now from this list I want to form a dataframe which has multiple columns from parent list but these columns are not contiguous. I wrote this but it seems to be working wrong:
```
list1 = rdd.collect()
columns_num = [1,8,11,17,21,24]
df2 = [list[i] for i in columns_num]
```
The above code only selects 6 rows, with only column 1 data, from parent list and forms the new dataframe with those 6 columns 1 data.

How can I form a new dataframe with multiple not contiguous columns from another list
zero323 over 8 years

I am glad to hear that and thanks for accepting the answer. Still, I would be grateful I could provide an example which fails with the first approach if you have time :)
Jason Donnald over 8 years

Theres one thing I noticed. This is not forming a pandas dataframe right? How can I form a pandas dataframe?
zero323 over 8 years

No it doesn't. You want a local data structure? You can simply call toPandas() on a data frame.