Optimize Binary Search in sorted array find number of occurences

202

Here's a performance issue. In the main while loop, you aren't breaking out fo the loop when you find the target value.

while(low<=high){
    int mid=(low+high)/2;
    if(a[mid]==k)  {
          result=mid;  // you need to break out of the loop here
       if(searchfirst)
          high=mid-1; 
        else
          low=mid+1;
}

But that's not the only issue. The searchfirst value shouldn't be what dictates adjusting high or low as you are doing it. In a classic binary search, adjust the high or low parameter based on how a[mid] compares to k`. Probably closer to this:

while(low<=high) {
    int mid=(low+high)/2;
    if(a[mid]==k)  {
          result=mid;  // you need to break out of the loop here
          break;
    }
    if(a[mid] > k)
        high=mid-1; 
    else
       low=mid+1;
}

You have the right idea. Binary search to find the element. But let me suggest the simpler solution after the initial binary search is to just "scan to the left" and "scan to the right" to count duplicate elements.

Let me know what you think of this:

int binarySearch(int* arr, int start, int end, int value) {
    while (start <= end) {
        int mid = (start + end) / 2;
        if (arr[mid] == value) {
            return mid;
        }
        start = (arr[mid] < value) ? (mid + 1) : start;
        end = (arr[mid] > value) ? (mid - 1) : end;
    }
    return -1;
}

int countSide(int* arr, int length, int index, int value, int step) {
    int count = 0;
    while (index >= 0 && index <= (length - 1) && arr[index] == value) {
        count++;
        index += step;
    }
    return count;
}

int main() {
    int a[] = { 1,1,1,2,2,3,3,3,6,6,6,6,6,7,7 };
    int n = sizeof(a) / sizeof(a[0]);
    int x = 6;
    int firstindex = binarySearch(a, 0, n - 1, x);
    printf("%d\n", firstindex);
    if (firstindex == -1) {
        printf("elment not found in the array:\n ");
    }
    else {
        int count = countSide(a, n, firstindex, x, -1);
        count += countSide(a, n, firstindex, x, 1);
        count--; // because we counted the middle element twice
        printf("count is = %d\n", count);
    }
}

Updated

Here's a solution that does two binary searches to find the lower and upper bounds of the target value in the array and simply measures the distance between the indices to get the count:

int bound(int* arr, int length, int value, bool isUpperBound) {

    int best = -1;
    int start = 0;
    int end = start + length - 1;

    while (start <= end) {
        int mid = (start + end) / 2;
        if (arr[mid] == value) {
            best = mid;

            if (isUpperBound) {
                start = mid + 1;
            }
            else {
                end = mid - 1;
            }
        }
        else if (arr[mid] < value) {
            start = mid + 1;
        }
        else if (arr[mid] > value) {
            end = mid - 1;
        }
    }
    return best;
}


int main() {
    int a[] = { 1,1,1,2,2,3,3,3,6,6,6,6,6,7,7 };
    int n = sizeof(a) / sizeof(a[0]);
    int x = 6;
    int firstindex = bound(a, n, x, false);
    int lastindex = bound(a, n, x, true);
    printf("%d - %d\n", firstindex, lastindex);
    if (firstindex == -1) {
        printf("elment not found in the array:\n ");
    }
    else {
        int count = lastindex-firstindex + 1;
        printf("count is = %d\n", count);
    }
}
Share:
202

Related videos on Youtube

Olivia22
Author by

Olivia22

Updated on January 04, 2023

Comments

  • Olivia22
    Olivia22 over 1 year

    Im trying to do the smallest number of operations possible to find the number of occurences of an element in the array. Save even 1 operation if possible. So far this is the best binary search version I know. I cant use vectors or any other std:: functions

    int modifiedbinsearch_low(int* arr, int low, int high , int key){   
        if(low==high) return high ; 
    
        int mid = low + (high-low) /2;
    
        if(key >  arr[mid] ) { modifiedbinsearch_low(arr,mid + 1 , high,key);  } 
        else  { modifiedbinsearch_low(arr,low,mid,key);  }  
    }
    
    int modifiedbinsearch_high(int* arr, int low, int high , int key){   
        if(low==high) return high ; 
    
        int mid = low + (high-low) /2;
    
        if(key <  arr[mid] ) { modifiedbinsearch_high(arr,low,mid,key);  } 
        else  { modifiedbinsearch_high(arr,mid+1,high,key);  } 
    
    } 
    
    int low = modifiedbinsearch_low( ...)
    int high = modifiedbinsearch_high( ...)
    

    This version collapsed both functions into just one but it takes almost double the time. Im wondering the idea is good for it to become the fastest but the implementation is wrong.

    #include<stdio.h>
    int binarysearch(int a[],int n,int k,bool searchfirst){
        int result=-1;
        int low=0,high=n-1;
        while(low<=high){
            int mid=(low+high)/2;
            if(a[mid]==k)  {
                  result=mid; 
               if(searchfirst)
                  high=mid-1; 
                else
                  low=mid+1;
        }
        else if(k<a[mid])  high=mid-1;
        else low=mid+1;
        }
        return result;
    }
    
    int main(){
        int a[]={1,1,1,2,2,3,3,3,6,6,6,6,6,7,7};
        int n=sizeof(a)/sizeof(a[0]);
        int x=6;
        int firstindex=binarysearch(a,n,x,true);
        printf("%d\n",firstindex);
        if(firstindex==-1){
            printf("elment not found in the array:\n ");
        }
        else {
            int lastindex=binarysearch(a,n,x,false);
            printf("%d\n",lastindex);
            printf("count is = %d", lastindex-firstindex+1);
        }
    
    }
    

    Shorter version

          int  binmin(int a[], int start, int end,int val ) {
             if(start<end) {
                int mid = (start+end)/2;
                if(a[mid]>=val) 
                    binmin(a,start,mid-1,val);
                else if(a[mid]<val)
                    binmin(a,mid+1,end,val);
    
          }
          else if(start>end)
               return start;
    }
    
    • Pete Becker
      Pete Becker almost 2 years
      After the search has found firstindex the upper bound can’t be below a+firstindex+1, so the second search should search the array pointed at by a+firstinde+1. And, depending on what you know about the data, a linear search for the upper bound could be faster.
    • PaulMcKenzie
      PaulMcKenzie almost 2 years
      So far this is the best binary search version I know. -- Did you try std::binary_search. In other words, can you beat the professional library writers? Then you have std::lower_bound and std::upper_bound.
    • Olivia22
      Olivia22 almost 2 years
      @its a search in the millions of elements in the array but the number of elements found is in the 2000.
    • Pete Becker
      Pete Becker almost 2 years
      @PaulMcKenzie — std::binary_search finds an element with the given value. It doesn’t find the full range.
    • PaulMcKenzie
      PaulMcKenzie almost 2 years
      @Olivia22 Shorter version but its not working properly. -- What is a debugger?
    • PaulMcKenzie
      PaulMcKenzie almost 2 years
      I cant use vectors or the std::binarysearch -- You didn't mention std::equal_range. Also, these STL algorithms mentioned work with regular arrays, not just vector.
    • PaulMcKenzie
      PaulMcKenzie almost 2 years
      @Olivia22 I cant use vectors or the std::binarysearch -- If you go to those pages that describe these functions (including lower_bound and upper_bound), many of those functions have "possible implementations" listed, and most of those do not use vector, binary search, etc. Just very good, and sometimes tricky logic that in all likelihood, beats your version. Nothing stops you from, at the very least, looking to see how these functions accomplish their goal.
    • Olivia22
      Olivia22 almost 2 years
      @PaulMcKenzie Ive already taken a look at it. I tried to implement some of the ways but its rather complicated for a beginner like me. thanks though.
    • PaulMcKenzie
      PaulMcKenzie almost 2 years
      @Olivia22 FYI -- usage of equal_range. You say you're a beginner, but the task given to you is not a beginner task. If your goal is to minimize the number of comparisons, you need to know something way beyond the beginner level to accomplish this. Then you have this, for example: int mid=(low+high)/2; -- this will overflow if (low + high) exceeds the limits of an int type. These are the edge cases you didn't consider.
    • Fozi
      Fozi almost 2 years
      This question seems to be academic/homework. In a real use case, given an array with "millions of elements" and "number of elements found ... in the 2000" I would say a) use std::lower_bound and std::upper_bound like suggested and b) don't store 2000 equal elements, just store one + their count.
    • Goswin von Brederlow
      Goswin von Brederlow almost 2 years
      int is too small for an array index in general. And int mid should use std::midpoint. You can search for low and high at the same time, until you can't anymore and then for small ranges a linear search will be faster than binary search, so you have to switch algorithms along the way as you narrow the range. But if you know you have duplicates and will be counting them then why don't you put all duplicate values in a bin while creating the array and then you just have to find the element in a much smaller array and return the bin size?
    • Goswin von Brederlow
      Goswin von Brederlow almost 2 years
      And if you truly want a faster search then use a dictionary search O(log log n) on average or hashtable O(1)`.
    • Toby Speight
      Toby Speight over 1 year
      If you "cant use vectors or any other std:: functions", don't you think it's time you learnt??? Or go back to writing C?
  • Goswin von Brederlow
    Goswin von Brederlow almost 2 years
    start + end risks UB. use std::midpoint.
  • selbie
    selbie almost 2 years
    @GoswinvonBrederlow - OP says, "I cant use vectors or any other std:: functions". And by risking UB, you mean the possibility of start+end overflowing? That would have to be a VERY large array. Is there another edge case to consider?
  • Goswin von Brederlow
    Goswin von Brederlow almost 2 years
    If using std::midpoint is truly forbidden then one can easily copy the code for it. And yes, I mean start + end overflowing. A test case of 1 billion items isn't that big. And if you compile for e.g. arduino it will be anything above 16384.
  • Olivia22
    Olivia22 almost 2 years
    @selbie Thanks for the feedback. However finding the upper and lower bounds its still faster than iterating left and right. Specially if the occurences are in the thousands.
  • selbie
    selbie almost 2 years
    @Olivia22 - I won't believe that unless you show me a benchmark that measures an optimized build between both implementations. Put something together on godbolt.org or equivalent and share the results. And make sure you compile with optimizations (e.g. /O2) on - otherwise, all bets are off.
  • selbie
    selbie almost 2 years
    @Olivia22 - I can see how upper/lower bounds traversal - which is similar to searching for X-1 and X+1 and computing the diff between indicies could be faster for some range of inputs. However, that is largely dependent on the size of the array, the number of iterations to find the target value, and the actual count of repeating elements.
  • selbie
    selbie almost 2 years
    @Olivia22 - I thought about it and coded a solution that does a pair of binary searches to find the upper and lower bounds. I updated the answer above.
  • Olivia22
    Olivia22 almost 2 years
    @selbie super cool. Also More info for the future programmers :)
  • starball
    starball over 1 year
    If this is asking for help to optimize the code, I think it might not be a great fit for this site, and isn't very likely to be helpful others who aren't willing spend the time to understand this code and then understand the answers. Maybe this would fit better on codereview.stackexchange.com, but I'm not sure. check the guide.