Multiple columns with the same name in Pandas
Solution 1
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a'
columns get named 'a.0'...'a.N'
as specified above.
if you used mangle_dupe_cols=False
, importing this csv
would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
Solution 2
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
- create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
- Use the function
.index()
that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for
loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
- Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Solution 3
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id'). Hence, calling
df['id']
returns 2 columns. You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.
Solution 4
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
vks
Still lost!!!!!!!!!!!! SOreadytohelp :) You can reach me at [email protected] :) String to dict python Using \G Add keys to dict iteratively from two lists Another \G Re for even number of 0's and 1's Setting Password Conditions Using re.sub with function Flattening list of tuples to string Use Alternation for conditional split Trick to get some while ignore some Elements from list if in different list Match +text and not ++text Split on 0 width assertion Python
Updated on July 09, 2022Comments
-
vks almost 2 years
I am creating a
dataframe
from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names saya
.So after forming
dataframe
and when I dodf['a']
which value will it return? It does not return all values.Also only one of the values will have a string rest will be
None
. How can I get that column?