Merging DataFrames on multiple conditions - not specifically on equal values
Solution 1
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
Solution 2
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278
Tim Vivian-Griffiths
Updated on June 12, 2022Comments
-
Tim Vivian-Griffiths almost 2 years
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP 0 1 rs3094315 752566 1 1 rs3131972 752721 2 1 rs2073814 753474 3 1 rs3115859 754503 4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id 0 1 10954 11507 GeneID:100506145 1 1 12190 13639 GeneID:100652771 2 1 14362 29370 GeneID:653635 3 1 30366 30503 GeneID:100302278 4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the
pandasql
module, while the other approach is to loop through the separate genes.SQL method
import pandas as pd import pandasql as psql pysqldf = lambda q: psql.sqldf(q, globals()) q = """ SELECT s.SNP, g.feature_id FROM this_snp s INNER JOIN this_genes g WHERE s.BP >= g.chr_start AND s.BP <= g.chr_stop; """ all_dfs = [] for chromosome in snp_df['chromosome'].unique(): this_snp = snp_df.loc[snp_df['chromosome'] == chromosome] this_genes = gene_df.loc[gene_df['chromosome'] == chromosome] genic_snps = pysqldf(q) all_dfs.append(genic_snps) all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = [] for line in gene_df.iterrows(): info = line[1] # Getting the Series object this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) & (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])] if this_snp.shape[0] != 0: this_snp = this_snp[['SNP']] this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id']) all_dfs.append(this_snp) all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
-
Tim Vivian-Griffiths almost 9 yearsI did actually think of using this method - the problem is that the merge operation on the full dataframes creates an enormous output. If I give an example - just for chromosome 1, there are 3511 entries in gene_df, and 528381 in snp_df. So an inner join on this chromosome alone results in 1855145691 entries! Also, the dataframes I showed in the original question were only the result of the head() method. So while there aren't any that match there, there should be plenty in the full dataframes.