Join dataframes based on partial string-match between columns
12,923
Solution 1
Given input dataframes df1
and df2
, you can use Boolean indexing via pd.Series.isin
. To align the format of the movie strings you need to first concatenate movie and year from df1
:
s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
res = df2[df2['FILM'].isin(s)]
print(res)
FILM VOTES
4 Max Steel (2016) 560
Solution 2
There are two ways:
-
get the row-indices for partial-matches:
FILM.startswith(title)
orFILM.contains(title)
. Either of:df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]
df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]
movie year ratings
106 Max Steel 2016 3.5
- Alternatively, you can use
merge()
if you convert the compound string column df2['FILM'] into its two component columnsmovie_title (year)
.
.
# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)
df2.merge(df1)
movie year Votes ratings
0 Max Steel 2016 560 3.5
(Acknowledging much help from @user3483203 here and in Python chat room)
Code to recreate dataframes:
import pandas as pd
from pandas.compat import StringIO
dat1 = """movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5"""
dat2 = """FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560"""
df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
Related videos on Youtube
Comments
-
Sai Kumar almost 2 years
I have a dataframe which I want to compare if they are present in another df.
after_h.sample(10, random_state=1) movie year ratings 108 Mechanic: Resurrection 2016 4.0 206 Warcraft 2016 4.0 106 Max Steel 2016 3.5 107 Me Before You 2016 4.5
I want to compare if the above movies are present in another df.
FILM Votes 0 Avengers: Age of Ultron (2015) 4170 1 Cinderella (2015) 950 2 Ant-Man (2015) 3000 3 Do You Believe? (2015) 350 4 Max Steel (2016) 560
I want something like this as my final output:
FILM votes 0 Max Steel 560
-
smci over 5 yearsYou can join with
pd.merge()
if you convert the compound string column df2['FILM'] into its two component columnsmovie_title (year)
-
-
user3483203 over 5 years
df2[df1['movie'].apply(lambda movie_title: df2['FILM'].str.contains(movie_title)).any(0)]
-
jpp over 5 yearsPartial matches may not be appropriate for sequels :)
-
smci over 5 years@jpp: Teehee. Complain to Tom Cruise or George Lucas... Yes strictly,
df2['FILM']
is formatted and contains both the title and (year) in parentheses. I should show how if we fixed that up, we could do simple join on title.