Calculate pvalue from pandas DataFrame

15,329

Let's suppose you have the same X genes for the Y samples. I try my method with X=3 and Y=2 but I guess you can generalize. I started with:

df1 = 
             count       mean        std       min
sample gene                                       
Arnhem IC       11   0.002319   0.000740  0.001503
       Int1     11   7.243040   6.848327  1.364879
       Sul1     11   0.003969   0.000919  0.002499
Basel  IC       11   0.005095   0.005639  0.001302
       Int1     12  13.330680  18.722070  0.049880
       Sul1     11   0.016186   0.019888  0.002970

Note that the genes need to be in the same order. First reset_index() with df_reindex = df1.reset_index(), I'm not sure what I'm doing is possible with multiindex:

df_reindex =
   sample  gene  count       mean        std       min
0  Arnhem    IC     11   0.002319   0.000740  0.001503
1  Arnhem  Int1     11   7.243040   6.848327  1.364879
2  Arnhem  Sul1     11   0.003969   0.000919  0.002499
3   Basel    IC     11   0.005095   0.005639  0.001302
4   Basel  Int1     12  13.330680  18.722070  0.049880
5   Basel  Sul1     11   0.016186   0.019888  0.002970

I create a rolled DF and join it to df_reindex:

nb_genes = 3
df_rolled = pd.DataFrame(pd.np.roll(df_reindex,nb_genes,0), columns = df_reindex.columns)
df_joined = df_reindex.join(df_rolled, rsuffix='_')
# rsuffix='_' is to be able to perform the join

Now on a same row, I have all data you needto calculate pvalue and create the column with apply:

df_joined['pvalue'] = df_joined.apply(lambda x: stats.ttest_ind_from_stats(x['mean'],x['std'],x['count'], x['mean_'],x['std_'],x['count_'])[1],axis=1)

Finally, I create a DF with the data you want and rename columns:

df_output = df_joined[['sample','sample_','gene','pvalue']].rename(columns = {'sample':'loc1', 'sample_':'loc2'})

You ends up with data:

df_output = 
     loc1    loc2  gene    pvalue
0  Arnhem   Basel    IC  0.121142
1  Arnhem   Basel  Int1  0.321072
2  Arnhem   Basel  Sul1  0.055298
3   Basel  Arnhem    IC  0.121142
4   Basel  Arnhem  Int1  0.321072
5   Basel  Arnhem  Sul1  0.055298

That you can reindex as you need.

If you want to do it each sample against each other, I think a loop for could do it.

EDIT: Using pivot_table, I think there is a easier way.

With your input stats as multiindex table for only ARG/16S (not sure how to handle this level), so I start with (which might be your stats['ARG/16S']):

df=
               count       mean           std       min
sample gene                                            
Arnhem IC         11   0.002319  7.396130e-04  0.001503
       Int1       11   7.243040  6.848327e+00  1.364879
       Sul1       11   0.003969  9.186019e-04  0.002499
       TetB        2   0.115475  1.627663e-01  0.000382
       TetM        4   0.000108  5.185259e-05  0.000052
       blaOXA      4   0.000004  3.783235e-07  0.000004
       ermB        4   0.000041  7.894879e-06  0.000033
       ermF        4   0.000023  4.519758e-06  0.000018
Basel  Aph3a       4   0.000008  1.757242e-06  0.000006
       IC         11   0.005095  5.639278e-03  0.001302
       Int1       12  13.330680  1.872207e+01  0.049880
       Sul1       11   0.016186  1.988817e-02  0.002970

With the function pivot_table, you can rearrange your data such as:

df_pivot = df.pivot_table(values = ['count','mean','std'], index = 'gene', 
                               columns = 'sample', fill_value = 0)

In this df_pivot (I don't print it here for readability but at the end with the new column), you can create a column for each couple (sample1, sample2) using itertools and apply:

import itertools
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_pivot[sample1+ '_' + sample2 ] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

I think this method is independent of the number of samples, genes and if genes are not all the same, you ends up with df_pivot like:

        count            mean                      std            Arnhem_Basel
sample Arnhem Basel    Arnhem      Basel        Arnhem      Basel             
gene                                                                          
Aph3a       0     4  0.000000   0.000008  0.000000e+00   0.000002     1.000000
IC         11    11  0.002319   0.005095  7.396130e-04   0.005639     0.121142
Int1       11    12  7.243040  13.330680  6.848327e+00  18.722070     0.321072
Sul1       11    11  0.003969   0.016186  9.186019e-04   0.019888     0.055298
TetB        2     0  0.115475   0.000000  1.627663e-01   0.000000     1.000000
TetM        4     0  0.000108   0.000000  5.185259e-05   0.000000     1.000000
blaOXA      4     0  0.000004   0.000000  3.783235e-07   0.000000     1.000000
ermB        4     0  0.000041   0.000000  7.894879e-06   0.000000     1.000000
ermF        4     0  0.000023   0.000000  4.519758e-06   0.000000     1.000000

Let me know if it works

EDIT2: to reply to the comment, I think you can do this:

No change for df_pivot and then you create a multiindex DF df_multi to write your results in:

df_multi = pd.DataFrame(index = df.index.levels[1], 
                        columns = pd.MultiIndex.from_tuples([p for p in itertools.combinations(df.index.levels[0],2)])).fillna(0)

Then you use the loop for to implement the data in this df_multi:

for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_multi.loc[:,(sample1,sample2)] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

Finally, you can use transpose and unstack on level 1 to get the way you ask (or close if I misunderstood)

df_output = df_multi.transpose().unstack(level=[1]).fillna(1)

You will see that you don't have the last sample in indexes and first sample in columns (because they don't exist how I built everything) if you want them, you need to replace itertools.combinations by itertools.combinations_with_replacement in both the creation of df_multi and in the loop for ( I didn't try it but it should work)

Share:
15,329
Gabriela Catalina
Author by

Gabriela Catalina

Updated on June 15, 2022

Comments

  • Gabriela Catalina
    Gabriela Catalina almost 2 years

    I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample.

     In[13]:stats
        Out[13]: 
                           ARG/16S                                            \
                             count          mean           std           min   
        sample      gene                                                       
        Arnhem      IC        11.0  2.319050e-03  7.396130e-04  1.503150e-03   
                    Int1      11.0  7.243040e+00  6.848327e+00  1.364879e+00   
                    Sul1      11.0  3.968956e-03  9.186019e-04  2.499074e-03   
                    TetB       2.0  1.154748e-01  1.627663e-01  3.816936e-04   
                    TetM       4.0  1.083125e-04  5.185259e-05  5.189226e-05   
                    blaOXA     4.0  4.210963e-06  3.783235e-07  3.843571e-06   
                    ermB       4.0  4.111081e-05  7.894879e-06  3.288865e-05   
                    ermF       4.0  2.335210e-05  4.519758e-06  1.832037e-05   
        Basel       Aph3a      4.0  7.815592e-06  1.757242e-06  5.539389e-06   
                    IC        11.0  5.095161e-03  5.639278e-03  1.302205e-03   
                    Int1      12.0  1.333068e+01  1.872207e+01  4.988048e-02   
                    Sul1      11.0  1.618617e-02  1.988817e-02  2.970397e-03   
    

    I'm trying to calculate the p-value (Students t-test) for each of these samples, comparing each of the genes between them.

    I've used scipy.stats.ttest_ind_from_stats but I managed to get the p-values for the different samples for one gene and only those of the samples neighboring each other.

    Experiments = list(values1_16S['sample'].unique())
    for exp in Experiments:
        if Experiments.index(exp)<len(Experiments)-1:
            second = Experiments[Experiments.index(exp)+1]
        else:
            second = Experiments[0]
        tstat, pvalue = scipy.stats.ttest_ind_from_stats(stats.loc[(exp,'Sul1')]['ARG/16S','mean'],
                                        stats.loc[(exp,'Sul1')]['ARG/16S','std'],
                                        stats.loc[(exp,'Sul1')]['ARG/16S','count'],
                                        stats.loc[(second,'Sul1')]['ARG/16S','mean'],
                                        stats.loc[(second,'Sul1')]['ARG/16S','std'],
                                        stats.loc[(second,'Sul1')]['ARG/16S','count'])
        d.append({'loc1':exp, 'loc2':second, 'pvalue':pvalue})
    
    
    stats_Sul1 = pd.DataFrame(d)
    stats_Sul1
    

    How can I get the pvalues between ALL samples? And is there a way to do this for all genes at once without running the code one by one for each gene?

  • Gabriela Catalina
    Gabriela Catalina almost 6 years
    Thanks for the edit ! The first solution did not work for me. Trying your edit now, but I have to get rid of that first MultiIndex line first ('ARG/16S'). If I delete is using ,drop() it also deletes everything below. I'll let you know if I get it to work
  • Ben.T
    Ben.T almost 6 years
    if you do df = stats['ARG/16S'] should do it?
  • Gabriela Catalina
    Gabriela Catalina almost 6 years
    It worked! Thanks so much! Do you know if there is also a way to have the samples that are compared as column and row indexes and a columns Multiindex with the genes? (Just to make the pvalues easier to find)
  • Ben.T
    Ben.T almost 6 years
    @GabrielaCatalina yes there is way, I was thinking about what was the best way to create the output data. I'll give a try in my EDIT2