Multiple columns with the same name in Pandas

48,487

Solution 1

the relevant parameter is mangle_dupe_cols

from the docs

mangle_dupe_cols : boolean, default True
    Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'

by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.

if you used mangle_dupe_cols=False, importing this csv would produce an error.

you can get all of your columns with

df.filter(like='a')

demonstration

from StringIO import StringIO
import pandas as pd

txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""

df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df

enter image description here

df.filter(like='a')

enter image description here

Solution 2

That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:

  1. create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
    if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
        duplicated_columns_list.append(column)
duplicated_columns_list
  1. Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'

This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.

This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.

  1. Finally, rename your columns with the underscored elements:

df.columns = list_of_all_columns

That's it, I hope it helps :)

Solution 3

Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id'). Hence, calling

df['id']

returns 2 columns. You can use

df.iloc[:,ind]

where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:

indices = [i for i,x in enumerate(df.columns) if x == 'id']

where you replace 'id' with the name of the column you are searching for.

Solution 4

I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.

In : df.columns
Out: 
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')

In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']

In : df.columns
Out: 
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')

From here, I was able to call 'id1' or 'id2' to get just the column I wanted.

Share:
48,487
vks
Author by

vks

Still lost!!!!!!!!!!!! SOreadytohelp :) You can reach me at [email protected] :) String to dict python Using \G Add keys to dict iteratively from two lists Another \G Re for even number of 0's and 1's Setting Password Conditions Using re.sub with function Flattening list of tuples to string Use Alternation for conditional split Trick to get some while ignore some Elements from list if in different list Match +text and not ++text Split on 0 width assertion Python

Updated on July 09, 2022

Comments

  • vks
    vks almost 2 years

    I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.

    So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.

    Also only one of the values will have a string rest will be None. How can I get that column?