Finding the intersection between two series in Pandas

120,930

Solution 1

Place both series in Python's set container then use the set intersection method:

s1.intersection(s2)

and then transform back to list if needed.

Just noticed pandas in the tag. Can translate back to that:

pd.Series(list(set(s1).intersection(set(s2))))

From comments I have changed this to a more Pythonic expression, which is shorter and easier to read:

Series(list(set(s1) & set(s2)))

should do the trick, except if the index data is also important to you.

Have added the list(...) to translate the set before going to pd.Series as pandas does not accept a set as direct input for a Series.

Solution 2

Setup:

s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])

Timings:

%%timeit
pd.Series(list(set(s1).intersection(set(s2))))
10000 loops, best of 3: 57.7 µs per loop

%%timeit
pd.Series(np.intersect1d(s1,s2))
1000 loops, best of 3: 659 µs per loop

%%timeit
pd.Series(np.intersect1d(s1.values,s2.values))
10000 loops, best of 3: 64.7 µs per loop

So the numpy solution can be comparable to the set solution even for small series, if one uses the values explicitely.

Solution 3

If you are using Pandas, I assume you are also using NumPy. Numpy has a function intersect1d that will work with a Pandas series.

Example:

pd.Series(np.intersect1d(pd.Series([1,2,3,5,42]), pd.Series([4,5,6,20,42])))

will return a Series with the values 5 and 42.

Solution 4

Python

s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])

s1[s1.isin(s2)]

R

s1  <- c(4,5,6,20,42)
s2 <- c(1,2,3,5,42)

s1[s1 %in% s2]

Edit: Doesn't handle dupes.

Solution 5

Could use merge operator like follows

pd.merge(df1, df2, how='inner')
Share:
120,930
user7289
Author by

user7289

Updated on July 05, 2022

Comments

  • user7289
    user7289 almost 2 years

    I have two series s1 and s2 in pandas and want to compute the intersection i.e. where all of the values of the series are common.

    How would I use the concat function to do this? I have been trying to work it out but have been unable to (I don't want to compute the intersection on the indices of s1 and s2, but on the values).

  • Andy Hayden
    Andy Hayden almost 11 years
    also, you can use & operator for set intersection.
  • Andy Hayden
    Andy Hayden almost 11 years
    FYI This is orders of magnitude slower that set. :(
  • Andy Hayden
    Andy Hayden almost 11 years
    Actually, you can't just apply Series to a set (which is annoying) TypeError: Set value is unordered, seems unnecessary restriction/not very duck.
  • Joop
    Joop almost 11 years
    Mmm. used same logic while ago, but I probably moved it to list 1st... short calc so performance was not a major constraint. What it the syntax for using the & operator to do the set?
  • Andy Hayden
    Andy Hayden almost 11 years
    set(s1) & set(s2) :)
  • Joop
    Joop almost 11 years
    ahh.. thought the & was in pandas
  • Jeff
    Jeff almost 11 years
    FYI, can also do: s1[s1.isin(s2)], my timings show about the same perf
  • Jeff
    Jeff almost 11 years
    You need Series(list(set(s1) & set(s2))) as the set result is unordered
  • Jeff
    Jeff almost 11 years
    isin keeps the ordering the same as s1, and somewhat faster on really large series
  • Andy Hayden
    Andy Hayden almost 11 years
    @Jeff that was a considerably slower for me on the small example, but may make up for it with larger... drop_duplicates is really slow.
  • Jeff
    Jeff almost 11 years
    its better for > 100k elements (but dependeds on the density of matches too for some reason)....
  • Phillip Cloud
    Phillip Cloud almost 11 years
    For shame. @AndyHayden Is there a reason we can't add set ops to Series objects?
  • jbn
    jbn almost 11 years
    Thanks, @AndyHayden. I had just naively assumed numpy would have faster ops on arrays. A quick %timeit test shows you to be mostly correct. My method had an average of 775 us per loop on two Series of 100 randomly generated elements whereas @joop's method had 120 us per loop. However, for larger data sets, this relationship is reversed. On two sets of 100000 elements, my method showed 1.32 ms per loop and @joop's method showed 14.9 ms per loop.
  • Andy Hayden
    Andy Hayden almost 11 years
    very interesting, fyi @cpcloud opened an issue here github.com/pydata/pandas/issues/4480
  • eldad-a
    eldad-a over 10 years
    @jbn see my answer for how to get the numpy solution with comparable timing for short series as well.
  • Joop
    Joop over 9 years
    redid test with newest numpy(1.8.1) and pandas (0.14.1) looks like your second example is now comparible in timeing to others. With larger data your last method is a clear winner 3 times faster than others
  • jangorecki
    jangorecki about 8 years
    It won't handle duplicates correctly, at least the R code, don't know about python. In R there is intersect function, and for data.frame/data.table use fintersect.
  • keshr3106
    keshr3106 almost 7 years
    But the series must be converted to dataframes before one can use pd.merge
  • cs95
    cs95 over 5 years
    Not anymore, as of v0.24.
  • JP Zhang
    JP Zhang about 5 years
    Only Dataframes can be joined
  • MikolajM
    MikolajM almost 5 years
    for anyone interested - in Dask it won't work, this solution will return AttributeError: 'Series' object has no attribute 'columns'
  • toto_tico
    toto_tico over 4 years
    no need to convert s2 to a set; pd.Series(list(set(s1).intersection(s2))) works (at least in pandas 0.25)
  • asiehh
    asiehh over 3 years
    It's because the second one is 1000 loops and the rest are 10000 loops
  • ludog
    ludog over 2 years
    you don't need the second line in this function