sum values of columns starting with the same string in pandas dataframe
Solution 1
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-'
character.
Solution 2
Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories
map/dictionary
Solution 3
You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Amanda
Updated on June 05, 2022Comments
-
Amanda almost 2 years
I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \ 0 56 1 1 0 1 0 0 1 11 0 0 0 0 1 0 2 6 0 0 1 0 0 1 3 43 0 0 0 1 0 1 4 14 0 1 0 0 1 0 Histo Economics-51 Literature-re Literatureu4 0 1 0 1 0 1 0 0 0 1 2 0 0 0 0 3 0 1 1 0 4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature 0 56 1 1 2 1 1 11 1 0 0 1 2 6 0 1 1 0 3 43 2 0 1 1 4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t') df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t') filter_col = [col for col in list(df) if col.startswith('History')] df['History'] = 0 # initialize value, otherwise throws KeyError for c in df[filter_col]: df['History'] = df[filter_col].sum(axes=1) print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using
loc
in pandas. -
DSM about 8 years
groupby
accepts an axis=1 argument, so something likedf.groupby(df.columns.str.split("-").str[0],axis=1).sum()
might work. (ThatHisto
column makes me wonder if we'd have to use a different grouping condition, but you get the idea.) -
Amanda about 8 yearsThanks for your answer, awesome solution, but I get an issue with accessing the columns : with your methods I have always keyError, even after trying different solutions of encoding while reading CSV. Do even when I try to take
df['History']
I have an error, but no error withdf.get('history')
-
MaxU - stop genocide of UA about 8 years@Amanda, one of your columns had 'Histo' as a name, so i decided to use "Histo" as a common name, otherwise you would have both: Histo and History. So try df['Histo'] instead
-
MaxU - stop genocide of UA about 8 years@Amanda, what would you like to do with Histo column? Do you want to have both "Histo" and "History" in the result data frame?
-
Amanda about 8 yearsthank you for your answer, actually the problem is not only in history, but in accessin the column name in general : stackoverflow.com/questions/35764172/… ! your answer is good, but when there a multiple dataframes that should be treated, the KeyError blocks everything
-
MJK about 3 yearsThis is a much more elegant solution. Thanks for contributing to this 5 year old post
-
ksbawpn over 2 yearsAdding to this great answer that one can also incorporate RegEx here in case the column names include multiple
_
characters:import re
then[re.split('_\d{2}', s)[0] for s in df.T.index.values]