Aggregate all dataframe row pair combinations using pandas
Solution 1
I can't think of a clever vectorized way to do this, but unless performance is a real bottleneck I tend to use the simplest thing which makes sense. In this case, I might set_index("Gene")
and then use loc
to pick out the rows:
>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
case1 case2 control1 control2
(ABC1, ABC2) 1 2 0 1
(ABC1, ABC3) 1 2 1 1
(ABC1, ABC4) 0 1 1 2
(ABC2, ABC3) 2 2 1 0
(ABC2, ABC4) 1 1 1 1
(ABC3, ABC4) 1 1 2 1
Solution 2
Before going too far, you should keep in mind your data gets big pretty fast. With 5 rows, output will be C(5,2)
or 5+4+3+2+1
and so on.
That said, I'd think about doing this in numpy for speed (you may want to add a numpy tag to your question btw). Anyway, this isn't as vectorized as it might be, but ought to be a start at least:
df2 = df.set_index('Gene').loc[mygenes].reset_index()
import math
sz = len(df2)
sz2 = math.factorial(sz) / ( math.factorial(sz-2) * 2 )
Gene = df2['Gene'].tolist()
abc = df2.ix[:,1:].values
import math
arr = np.zeros([sz2,4])
gene2 = []
k = 0
for i in range(sz):
for j in range(sz):
if i != j and i < j:
gene2.append( gene[i] + gene[j] )
arr[k] = abc[i] + abc[j]
k += 1
pd.concat( [ pd.DataFrame(gene2), pd.DataFrame(arr) ], axis=1 )
Out[1780]:
0 0 1 2 3
0 ABC1ABC2 1 2 0 1
1 ABC1ABC3 1 2 1 1
2 ABC1ABC4 0 1 1 2
3 ABC2ABC3 2 2 1 0
4 ABC2ABC4 1 1 1 1
5 ABC3ABC4 1 1 2 1
Depending on size/speed issues you may need to separate the string and numerical code and vectorize the numerical piece. This code is not likely to scale all that well if your data is big and if it is, that may determine what sort of answer you need (and also may need to think about how you store results).
Related videos on Youtube
alexhli
Updated on September 15, 2022Comments
-
alexhli over 1 year
I use python pandas to perform grouping and aggregation across data frames, but I would like to now perform specific pairwise aggregation of rows (n choose 2, statistical combination). Here is the example data, where I would like to look at all pairs of genes in [mygenes]:
import pandas import itertools mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4'] df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'], 'case1' : [0,1,1,0,0], 'case2' : [1,1,1,0,1], 'control1':[0,0,1,1,1], 'control2':[1,0,0,1,0] }) >>> df Gene case1 case2 control1 control2 0 ABC1 0 1 0 1 1 ABC2 1 1 0 0 2 ABC3 1 1 1 0 3 ABC4 0 0 1 1 4 ABC5 0 1 1 0
The final product should look like this (applying np.sum by default is fine):
case1 case2 control1 control2 'ABC1', 'ABC2' 1 2 0 1 'ABC1', 'ABC3' 1 2 1 1 'ABC1', 'ABC4' 0 1 1 2 'ABC2', 'ABC3' 2 2 1 0 'ABC2', 'ABC4' 1 1 1 1 'ABC3', 'ABC4' 1 1 2 1
The set of gene pairs can be easily obtained with itertools ($
itertools.combinations(mygenes, 2)
), but I can't figure out how to perform aggregation of specific rows based on their values. Can anyone advise? Thank you -
JohnE almost 9 yearsAh, that is very nice use of loc and quite a bit simpler than mine. I just realized he's only asking for the subset in mygenes so if it's only a few at a time then simplest is definitely the best. FWIW it don't think it would be too hard to vectorize the numpy code I used (if it was worth the bother).
-
alexhli almost 9 yearsThis approach looks easy enough to code, I will go ahead and test how well it scales this afternoon. The real data set has about 6k columns and 1700 rows, which should lead to a little over 1 million combinations (output rows). Thanks for the response
-
DSM almost 9 years@alexhli: ehh, that's a lot of rows. :-/ I don't think this will perform very well in that regime.
-
DSM almost 9 years.. hey, wait. 1.4 million rows * 6000 columns is 8.4 billion numbers. Even if you only used 1 byte per value that's 8G right there. That's going to be a little unmanageable (I was working on vectorizing it when I realized I didn't have the memory when using floats.)