Merging DataFrames on multiple conditions - not specifically on equal values

python pandas merge pandasql

10,790

Solution 1

I've just thought of a way to solve this - by combining my two methods:

First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

Solution 2

You can use the following to accomplish what you're looking for:

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]

Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:

snp_df
Out[193]: 
   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972   30400
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

gene_df
Out[194]: 
   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

merged_df
Out[195]: 
         SNP        feature_id
8  rs3131972  GeneID:100302278

10,790

Author by

Tim Vivian-Griffiths

Updated on June 12, 2022

Comments

Tim Vivian-Griffiths almost 2 years
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.

I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.

The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:

SNP DataFrame (snp_df):
```
   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972  752721
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144
```
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.

Gene DataFrame (gene_df):
```
   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520
```
This dataframe shows the locations of all the genes of interest.

What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.

If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
```
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
```
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.

I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.

SQL method
```
import pandas as pd
import pandasql as psql

pysqldf = lambda q: psql.sqldf(q, globals())

q           = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes  = gene_df.loc[gene_df['chromosome'] == chromosome]
    genic_snps  = pysqldf(q)
    all_dfs.append(genic_snps)

all_genic_snps  = pd.concat(all_dfs)
```
Gene iteration method
```
all_dfs = []
for line in gene_df.iterrows():
    info    = line[1] # Getting the Series object
    this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
            (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
    if this_snp.shape[0] != 0:
        this_snp = this_snp[['SNP']]
        this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
        all_dfs.append(this_snp)


all_genic_snps = pd.concat(all_dfs)
```
Can anyone give any suggestions of a more effective way of doing this?
Tim Vivian-Griffiths almost 9 years

I did actually think of using this method - the problem is that the merge operation on the full dataframes creates an enormous output. If I give an example - just for chromosome 1, there are 3511 entries in gene_df, and 528381 in snp_df. So an inner join on this chromosome alone results in 1855145691 entries! Also, the dataframes I showed in the original question were only the result of the head() method. So while there aren't any that match there, there should be plenty in the full dataframes.