How to delete columns in pyspark dataframe
Solution 1
Reading the Spark documentation I found an easier solution.
Since version 1.4 of spark there is a function drop(col)
which can be used in pyspark on a dataframe.
You can use it in two ways
df.drop('age')
df.drop(df.age)
Solution 2
Adding to @Patrick's answer, you can use the following to drop multiple columns
columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)
Solution 3
An easy way to do this is to user "select
" and realize you can get a list of all columns
for the dataframe
, df
, with df.columns
drop_list = ['a column', 'another column', ...]
df.select([column for column in df.columns if column not in drop_list])
Solution 4
You can use two way:
1: You just keep the necessary columns:
drop_column_list = ["drop_column"]
df = df.select([column for column in df.columns if column not in drop_column_list])
2: This is the more elegant way.
df = df.drop("col_name")
You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!
Solution 5
You could either explicitly name the columns you want to keep, like so:
keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]
Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id
column from b
):
keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']
Finally you make a selection on your join result:
d = a.join(b, a.id==b.id, 'outer').select(*keep)
xjx0524
Updated on April 03, 2022Comments
-
xjx0524 about 2 years
>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint, id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
There are two
id: bigint
and I want to delete one. How can I do? -
deusxmach1na about 9 yearsI think I got the answer. Select needs to take a list of strings NOT a list of columns. So do this:
keep = [c for c in a.columns] + [c for c in b.columns if c != 'id']
d = a.join(b, a.id==b.id, 'outer').select(*keep)
-
karlson about 9 yearsWell, that should do exactly the same thing as my answer, as I'm pretty sure that
select
accepts either strings OR columns (spark.apache.org/docs/latest/api/python/…). Btw, in your linekeep = ...
there's no need to use a list comprehension fora
:a.columns + [c for c in b.columns if c != 'id']
should achieve the exact same thing, asa.columns
is already alist
of strings. -
karlson almost 9 years@deusxmach1na Actually the column selection based on strings cannot work for the OP, because that would not solve the ambiguity of the
id
column. In that case you have to use theColumn
instances inselect
. -
deusxmach1na almost 9 yearsAll good points. I tried your solution in Spark 1.3 and got errors, so what I posted actually worked for me. And to resolve the id ambiguity I renamed my id column before the join then dropped it after the join using the keep list. HTH anyone else that was stuck like I was.
-
Shane Halloran over 6 yearsThank-you, this works great for me for removing duplicate columns with the same name as another column, where I use
df.select([df.columns[column_num] for column_num in range(len(df.columns)) if column_num!=2])
, where the column I want to remove has index 2. -
mnis.p over 5 yearswhen the data size is large, collect() might cause heap space error. you can also create a new dataframe dropping the extra field by
ndf = df.drop('age')
-
DefiniteIntegral over 5 yearsI had to reassign the drop results back to the dataframe: df = df.drop(*columns_to_drop)
-
seufagner almost 5 yearsSpark 2.4 (and least versions) doesn't accepts more than one column name.
-
Guido about 4 yearsNote that you will not get an error if the column does not exist
-
DataBach about 4 yearsIs it possible to drop columns by index ?
-
frlzjosh almost 4 yearsI get an error saying
TreeNodeException: Binding attribute, tree: _gen_alias_34#34
after I drop a column, and use.show()
-
Topde over 3 years@seufagner it does just pass it as a list
-
Juan-Kabbali about 3 yearsWhat the asterisk
*
means in*columns_to_drop
? -
Clock Slave about 3 yearsThe
*
is to unpack the list.(*[a,b,c])
becomes(a,b,c)
-
qwr over 2 yearsThere is absolutely no reason to use
collect
for this operation so I removed it from this answer