How do I find the closest values in a Pandas series to an input number?
Solution 1
You could use argsort()
like
Say, input = 3
In [198]: input = 3
In [199]: df.iloc[(df['num']-input).abs().argsort()[:2]]
Out[199]:
num
2 4
4 2
df_sort
is the dataframe with 2 closest values.
In [200]: df_sort = df.iloc[(df['num']-input).abs().argsort()[:2]]
For index,
In [201]: df_sort.index.tolist()
Out[201]: [2, 4]
For values,
In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]
Detail, for the above solution df
was
In [197]: df
Out[197]:
num
0 1
1 6
2 4
3 5
4 2
Solution 2
Apart from not completely answering the question, an extra disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).
However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.
This gives the following code snippet:
def find_neighbours(value, df, colname):
exactmatch = df[df[colname] == value]
if not exactmatch.empty:
return exactmatch.index
else:
lowerneighbour_ind = df[df[colname] < value][colname].idxmax()
upperneighbour_ind = df[df[colname] > value][colname].idxmin()
return [lowerneighbour_ind, upperneighbour_ind]
This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.
Comparing both strategies shows that for large N, the partitioning strategy is indeed faster. For small N, the sorting strategy will be more efficient, as it is implemented at a much lower level. It is also a one-liner, which might increase code readability.
The code to replicate this plot can be seen below:
from matplotlib import pyplot as plt
import pandas
import numpy
import timeit
value=3
sizes=numpy.logspace(2, 5, num=50, dtype=int)
sort_results, partition_results=[],[]
for size in sizes:
df=pandas.DataFrame({"num":100*numpy.random.random(size)})
sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index",
globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
partition_results.append(timeit.Timer('find_neighbours(df,value)',
globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
sort_time=[time/amount for amount,time in sort_results]
partition_time=[time/amount for amount,time in partition_results]
plt.plot(sizes, sort_time)
plt.plot(sizes, partition_time)
plt.legend(['Sorting','Partitioning'])
plt.title('Comparison of strategies')
plt.xlabel('Size of Dataframe')
plt.ylabel('Time in s')
plt.savefig('speed_comparison.png')
Solution 3
If the series is already sorted, an efficient method of finding the indexes is by using bisect functions. An example:
idx = bisect_left(df['num'].values, 3)
Let's consider that the column col
of the dataframe df
is sorted.
- In the case where the value
val
is in the column,bisect_left
will return the precise index of the value in the list andbisect_right
will return the index of the next position. - In the case where the value is not in the list, both
bisect_left
andbisect_right
will return the same index: the one where to insert the value to keep the list sorted.
Hence, to answer the question, the following code gives the index of val
in col
if it is found, and the indexes of the closest values otherwise. This solution works even when the values in the list are not unique.
from bisect import bisect_left, bisect_right
def get_closests(df, col, val):
lower_idx = bisect_left(df[col].values, val)
higher_idx = bisect_right(df[col].values, val)
if higher_idx == lower_idx: #val is not in the list
return lower_idx - 1, lower_idx
else: #val is in the list
return lower_idx
Bisect algorithms are very efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.
Solution 4
If your series is already sorted, you could use something like this.
def closest(df, col, val, direction):
n = len(df[df[col] <= val])
if(direction < 0):
n -= 1
if(n < 0 or n >= len(df)):
print('err - value outside range')
return None
return df.ix[n, col]
df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
lc = closest(df, 'num', find, -1)
hc = closest(df, 'num', find, 1)
print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))
df: num
0 0
1 2
2 4
3 6
4 8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.
Solution 5
The most intuitive way I've found to solve this sort of problem is to use the partition approach suggested by @ivo-merchiers but use nsmallest and nlargest. In addition to working on unsorted series, a benefit of this approach is that you can easily get several close values by setting k_matches to a number greater than 1.
import pandas as pd
source = pd.Series([1,6,4,5,2])
target = 3
def find_closest_values(target, source, k_matches=1):
k_above = source[source >= target].nsmallest(k_matches)
k_below = source[source < target].nlargest(k_matches)
k_all = pd.concat([k_below, k_above]).sort_values()
return k_all
find_closest_values(target, source, k_matches=1)
Output:
4 2
2 4
dtype: int64
Steve
Updated on September 16, 2021Comments
-
Steve over 2 years
I have seen:
- how do I find the closest value to a given number in an array?
- How do I find the closest array element to an arbitrary (non-member) number?.
These relate to vanilla python and not pandas.
If I have the series:
ix num 0 1 1 6 2 4 3 5 4 2
And I input 3, how can I (efficiently) find?
- The index of 3 if it is found in the series
- The index of the value below and above 3 if it is not found in the series.
Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).
-
Steve about 9 yearsdoes this find the closest below and above, or just the two closest?
-
Zero about 9 yearsWhat do you mean by below and above? Closest values are picked by absolute difference between them and the given input.
-
Steve about 7 yearsI needed to find a) the cloest number above, b) the closest number below. So on absolute difference wouldn't achieve this in all cases.
-
amc over 6 yearsThis gives the incorrect answer. I tried this on a more complex dataset. You must use .iloc instead of .ix and it works well (see @op1)
-
Ivo Merchiers over 5 yearsWouldn't this approach have a complexity of N log(N), since you require sorting? This does not seem extremely efficient, as it should be possible to perform the search in ~O(N)?
-
Joël over 5 yearsI like the idea! However, it seems to me there is a great chance that
lower..._id
andupper..._id
are not next to each other if the index is not monotonic; this is what the other algorithms try to provide by sorting. -
Ivo Merchiers over 5 years@Joël I think that I don't understand your point. They would have to be next to eachother, since
lower...
is the largest value smaller thanvalue
andupper...
is the smallest value larger than lower. I don't see how there can be another value inbetween there? Or do you mean something else? -
Joël over 5 yearsOK understood, indeed considering the example serie provided
(1, 6, 5, 4, 2)
, your proposal returns the expected indexes:(2, 4)
. So we correctly get non-neighbour indexes, and I then have misunderstood the other answers --> +1 -
9769953 almost 5 yearsI think this answer is more correct, since it gives the closest lower- and upperbounds, not just the two closest values. It does have two (minor) mistakes: indentation on line 3, and the use of an unknown
traversed
variable (which probably should bevalue
); as I'm about 90% sure on the errors, not 100%, I'm still a bit hesitant to edit and fix the answer. -
Ivo Merchiers almost 5 years@0 0 You are completely correct, I added your fixes. Thanks!
-
Claudiu Creanga over 4 yearsusing timeit, this solution is much slower than
df.iloc[(df['num']-input).abs().argsort()[:2]]
-
Ivo Merchiers over 4 years@ClaudiuCreanga That depends on the size of
df
. I added a plot and some code to illustrate this behaviour. -
Ivo Merchiers over 4 years@Isaac Thanks for the suggestion, I approved the edit. The comments were helpful to understand your edit though.
-
Isaac over 4 yearsI've made a number of further tweaks to my own version of this function, but I'm not sure how useful they are in general. 1st - ensure exactmatch returns one value. 2nd, handle situations where the value is larger or smaller than any other value in num (i.e. preventing np.nans entering the returned array).
-
Isaac over 4 yearsAs an answer to the specific question (in the text of the question, rather than the title) this is incorrect. Given the data [1,2,3,10] and the input 4, it returns [2,3]. Ivo Merchier's answer below is correct.
-
Michael Chen almost 4 yearsIf the input does not exactly match an element in the column, then indLeft and indRight are equal i think. The request was for the two closest indices.
-
ijoseph over 3 years@isaac (good name, btw ; ) ): I edited to make the snippet actually compile. Those fixes sound even better, though -- maybe you can add to a gist and we can update further?
-
zylatis almost 3 yearsI think you're better off using argmin() here.