How to select multiple non-contigous columns from a list into another dataframe in python
10,424
For example like this:
rdd = sc.parallelize([("a", 1, 2, 4.0, "foo"), ("b", 3, 4, 5.0, "bar")])
columns_num = [0, 3]
df = rdd.toDF()
df2 = df.select(*(df.columns[i] for i in columns_num))
df2.show()
## +---+---+
## | _1| _4|
## +---+---+
## | a|4.0|
## | b|5.0|
## +---+---+
or like this:
df = rdd.map(lambda row: [row[i] for i in columns_num]).toDF()
df.show()
## +---+---+
## | _1| _4|
## +---+---+
## | a|4.0|
## | b|5.0|
## +---+---+
On a side not you should never collect data just to reshape. In the best case scenario it will be slow, in the worst case scenario it will simply crash.
Author by
Jason Donnald
Updated on June 12, 2022Comments
-
Jason Donnald almost 2 years
I am working on
Ipython
andSpark
and I have aRDD
from which I form alist
. Now from thislist
I want to form adataframe
which has multiple columns from parentlist
but these columns are not contiguous. I wrote this but it seems to be working wrong:list1 = rdd.collect() columns_num = [1,8,11,17,21,24] df2 = [list[i] for i in columns_num]
The above code only selects 6 rows, with only column 1 data, from parent
list
and forms the newdataframe
with those 6 columns 1 data.How can I form a new
dataframe
with multiple not contiguous columns from anotherlist
-
zero323 over 8 yearsI am glad to hear that and thanks for accepting the answer. Still, I would be grateful I could provide an example which fails with the first approach if you have time :)
-
Jason Donnald over 8 yearsTheres one thing I noticed. This is not forming a pandas dataframe right? How can I form a pandas dataframe?
-
zero323 over 8 yearsNo it doesn't. You want a local data structure? You can simply call
toPandas()
on a data frame.