Image comparison - fast algorithm

334,126

Solution 1

Below are three approaches to solving this problem (and there are many others).

  • The first is a standard approach in computer vision, keypoint matching. This may require some background knowledge to implement, and can be slow.

  • The second method uses only elementary image processing, and is potentially faster than the first approach, and is straightforward to implement. However, what it gains in understandability, it lacks in robustness -- matching fails on scaled, rotated, or discolored images.

  • The third method is both fast and robust, but is potentially the hardest to implement.

Keypoint Matching

Better than picking 100 random points is picking 100 important points. Certain parts of an image have more information than others (particularly at edges and corners), and these are the ones you'll want to use for smart image matching. Google "keypoint extraction" and "keypoint matching" and you'll find quite a few academic papers on the subject. These days, SIFT keypoints are arguably the most popular, since they can match images under different scales, rotations, and lighting. Some SIFT implementations can be found here.

One downside to keypoint matching is the running time of a naive implementation: O(n^2m), where n is the number of keypoints in each image, and m is the number of images in the database. Some clever algorithms might find the closest match faster, like quadtrees or binary space partitioning.


Alternative solution: Histogram method

Another less robust but potentially faster solution is to build feature histograms for each image, and choose the image with the histogram closest to the input image's histogram. I implemented this as an undergrad, and we used 3 color histograms (red, green, and blue), and two texture histograms, direction and scale. I'll give the details below, but I should note that this only worked well for matching images VERY similar to the database images. Re-scaled, rotated, or discolored images can fail with this method, but small changes like cropping won't break the algorithm

Computing the color histograms is straightforward -- just pick the range for your histogram buckets, and for each range, tally the number of pixels with a color in that range. For example, consider the "green" histogram, and suppose we choose 4 buckets for our histogram: 0-63, 64-127, 128-191, and 192-255. Then for each pixel, we look at the green value, and add a tally to the appropriate bucket. When we're done tallying, we divide each bucket total by the number of pixels in the entire image to get a normalized histogram for the green channel.

For the texture direction histogram, we started by performing edge detection on the image. Each edge point has a normal vector pointing in the direction perpendicular to the edge. We quantized the normal vector's angle into one of 6 buckets between 0 and PI (since edges have 180-degree symmetry, we converted angles between -PI and 0 to be between 0 and PI). After tallying up the number of edge points in each direction, we have an un-normalized histogram representing texture direction, which we normalized by dividing each bucket by the total number of edge points in the image.

To compute the texture scale histogram, for each edge point, we measured the distance to the next-closest edge point with the same direction. For example, if edge point A has a direction of 45 degrees, the algorithm walks in that direction until it finds another edge point with a direction of 45 degrees (or within a reasonable deviation). After computing this distance for each edge point, we dump those values into a histogram and normalize it by dividing by the total number of edge points.

Now you have 5 histograms for each image. To compare two images, you take the absolute value of the difference between each histogram bucket, and then sum these values. For example, to compare images A and B, we would compute

|A.green_histogram.bucket_1 - B.green_histogram.bucket_1| 

for each bucket in the green histogram, and repeat for the other histograms, and then sum up all the results. The smaller the result, the better the match. Repeat for all images in the database, and the match with the smallest result wins. You'd probably want to have a threshold, above which the algorithm concludes that no match was found.


Third Choice - Keypoints + Decision Trees

A third approach that is probably much faster than the other two is using semantic texton forests (PDF). This involves extracting simple keypoints and using a collection decision trees to classify the image. This is faster than simple SIFT keypoint matching, because it avoids the costly matching process, and keypoints are much simpler than SIFT, so keypoint extraction is much faster. However, it preserves the SIFT method's invariance to rotation, scale, and lighting, an important feature that the histogram method lacked.

Update:

My mistake -- the Semantic Texton Forests paper isn't specifically about image matching, but rather region labeling. The original paper that does matching is this one: Keypoint Recognition using Randomized Trees. Also, the papers below continue to develop the ideas and represent the state of the art (c. 2010):

Solution 2

The best method I know of is to use a Perceptual Hash. There appears to be a good open source implementation of such a hash available at:

http://phash.org/

The main idea is that each image is reduced down to a small hash code or 'fingerprint' by identifying salient features in the original image file and hashing a compact representation of those features (rather than hashing the image data directly). This means that the false positives rate is much reduced over a simplistic approach such as reducing images down to a tiny thumbprint sized image and comparing thumbprints.

phash offers several types of hash and can be used for images, audio or video.

Solution 3

This post was the starting point of my solution, lots of good ideas here so I though I would share my results. The main insight is that I've found a way to get around the slowness of keypoint-based image matching by exploiting the speed of phash.

For the general solution, it's best to employ several strategies. Each algorithm is best suited for certain types of image transformations and you can take advantage of that.

At the top, the fastest algorithms; at the bottom the slowest (though more accurate). You might skip the slow ones if a good match is found at the faster level.

  • file-hash based (md5,sha1,etc) for exact duplicates
  • perceptual hashing (phash) for rescaled images
  • feature-based (SIFT) for modified images

I am having very good results with phash. The accuracy is good for rescaled images. It is not good for (perceptually) modified images (cropped, rotated, mirrored, etc). To deal with the hashing speed we must employ a disk cache/database to maintain the hashes for the haystack.

The really nice thing about phash is that once you build your hash database (which for me is about 1000 images/sec), the searches can be very, very fast, in particular when you can hold the entire hash database in memory. This is fairly practical since a hash is only 8 bytes.

For example, if you have 1 million images it would require an array of 1 million 64-bit hash values (8 MB). On some CPUs this fits in the L2/L3 cache! In practical usage I have seen a corei7 compare at over 1 Giga-hamm/sec, it is only a question of memory bandwidth to the CPU. A 1 Billion-image database is practical on a 64-bit CPU (8GB RAM needed) and searches will not exceed 1 second!

For modified/cropped images it would seem a transform-invariant feature/keypoint detector like SIFT is the way to go. SIFT will produce good keypoints that will detect crop/rotate/mirror etc. However the descriptor compare is very slow compared to hamming distance used by phash. This is a major limitation. There are a lot of compares to do, since there are maximum IxJxK descriptor compares to lookup one image (I=num haystack images, J=target keypoints per haystack image, K=target keypoints per needle image).

To get around the speed issue, I tried using phash around each found keypoint, using the feature size/radius to determine the sub-rectangle. The trick to making this work well, is to grow/shrink the radius to generate different sub-rect levels (on the needle image). Typically the first level (unscaled) will match however often it takes a few more. I'm not 100% sure why this works, but I can imagine it enables features that are too small for phash to work (phash scales images down to 32x32).

Another issue is that SIFT will not distribute the keypoints optimally. If there is a section of the image with a lot of edges the keypoints will cluster there and you won't get any in another area. I am using the GridAdaptedFeatureDetector in OpenCV to improve the distribution. Not sure what grid size is best, I am using a small grid (1x3 or 3x1 depending on image orientation).

You probably want to scale all the haystack images (and needle) to a smaller size prior to feature detection (I use 210px along maximum dimension). This will reduce noise in the image (always a problem for computer vision algorithms), also will focus detector on more prominent features.

For images of people, you might try face detection and use it to determine the image size to scale to and the grid size (for example largest face scaled to be 100px). The feature detector accounts for multiple scale levels (using pyramids) but there is a limitation to how many levels it will use (this is tunable of course).

The keypoint detector is probably working best when it returns less than the number of features you wanted. For example, if you ask for 400 and get 300 back, that's good. If you get 400 back every time, probably some good features had to be left out.

The needle image can have less keypoints than the haystack images and still get good results. Adding more doesn't necessarily get you huge gains, for example with J=400 and K=40 my hit rate is about 92%. With J=400 and K=400 the hit rate only goes up to 96%.

We can take advantage of the extreme speed of the hamming function to solve scaling, rotation, mirroring etc. A multiple-pass technique can be used. On each iteration, transform the sub-rectangles, re-hash, and run the search function again.

Solution 4

My company has about 24million images come in from manufacturers every month. I was looking for a fast solution to ensure that the images we upload to our catalog are new images.

I want to say that I have searched the internet far and wide to attempt to find an ideal solution. I even developed my own edge detection algorithm.
I have evaluated speed and accuracy of multiple models. My images, which have white backgrounds, work extremely well with phashing. Like redcalx said, I recommend phash or ahash. DO NOT use MD5 Hashing or anyother cryptographic hashes. Unless, you want only EXACT image matches. Any resizing or manipulation that occurs between images will yield a different hash.

For phash/ahash, Check this out: imagehash

I wanted to extend *redcalx'*s post by posting my code and my accuracy.

What I do:

from PIL import Image
from PIL import ImageFilter
import imagehash

img1=Image.open(r"C:\yourlocation")
img2=Image.open(r"C:\yourlocation")
if img1.width<img2.width:
    img2=img2.resize((img1.width,img1.height))
else:
    img1=img1.resize((img2.width,img2.height))
img1=img1.filter(ImageFilter.BoxBlur(radius=3))
img2=img2.filter(ImageFilter.BoxBlur(radius=3))
phashvalue=imagehash.phash(img1)-imagehash.phash(img2)
ahashvalue=imagehash.average_hash(img1)-imagehash.average_hash(img2)
totalaccuracy=phashvalue+ahashvalue

Here are some of my results:

item1  item2  totalsimilarity
desk1  desk1       3
desk1  phone1     22
chair1 desk1      17
phone1 chair1     34

Hope this helps!

Solution 5

As cartman pointed out, you can use any kind of hash value for finding exact duplicates.

One starting point for finding close images could be here. This is a tool used by CG companies to check if revamped images are still showing essentially the same scene.

Share:
334,126
meade
Author by

meade

Happy on the outside, sad on the inside type of project manager

Updated on April 15, 2021

Comments

  • meade
    meade about 3 years

    I'm looking to create a base table of images and then compare any new images against that to determine if the new image is an exact (or close) duplicate of the base.

    For example: if you want to reduce storage of the same image 100's of times, you could store one copy of it and provide reference links to it. When a new image is entered you want to compare to an existing image to make sure it's not a duplicate ... ideas?

    One idea of mine was to reduce to a small thumbnail and then randomly pick 100 pixel locations and compare.

  • meade
    meade almost 15 years
    So (trying to understand the Bloom filter) - does that mean you select random pixel points on the base image, randomly get either a red/green/blue value of the pixel - then compare to the new image? and then use a probability level (90% match) to determine how similar the two images are?
  • jdigital
    jdigital almost 15 years
    This isn't a similarity check, it's an equivalence check. If you need similarity, then hashing is not the right approach. The idea behind Bloom is to use multiple hash algorithms to increase the likelihood of unique identification. Selecting random points isn't the best approach for a hashing algorithm because it will yield different results each time.
  • meade
    meade almost 15 years
    The Histogram approach seems to make the most sense. I'm assuming you can rotate the image to perform this on all sides just in case the image being compared to was turned (treating the same image as 4) - thanks
  • Kyle Simek
    Kyle Simek almost 15 years
    @meade That's right. Something else to consider: depending on your problem, you might not need to use all 5 histograms in your algorithm. Discarding the texture direction histogram will allow you to match rotated versions of the picture. Discarding the texture scale histogram will allow you to match re-scaled versions of the image. You'll lose some ability to compare similarity, but this might not be a problem, depending on your situation. Also, since computing texture information is the most costly part of the algorithm, this will make your algorithm speedy, too.
  • dynamic
    dynamic almost 12 years
    @redmoskito: I have a question. How do you get the numeric value of the histogram of green for example ? So you can subtract it with the other image histogram ? Let's say we have a green histogram with 3 pixel belonging to 0-63 bucket, and 5 pixel belonging to 64-127. Which is the value ?
  • Ikaso
    Ikaso about 11 years
    Maybe my question is dumb, but I have no image processing background. Suppose I have a folder under which all my images are stored and I have duplicates (same images exactly). Does the histogram approach is sufficient? Which kind of hash is appropriate for my case?
  • reox
    reox almost 11 years
    @Ikaso if its excactly the same image, you probably dont want to use anything like that and consider using simple CRC or MD5 comparison. If this is not sufficient, like there are single pixels that are different or metadata has changed, the histogram method is also sufficient. if your images are the same but rotated or scaled, a histogram based method can be sufficent but maybe will fail. if your images has changed colors you need to use interest point based algorithms.
  • GilLevi
    GilLevi almost 10 years
    I'd like to add that nowadays, many fast alternatives to SIFT exist, such as the FAST detector and binary descriptors (BRIEF, BRISK, ORB, FREAK, BinBoost) to name a few. A tutorial on binary descriptors can be found here: gilscvblog.wordpress.com/2013/08/26/…
  • Ambika
    Ambika about 8 years
    @KyleSimek How can I do "texture direction histogram", in histogram technique
  • Alexey Voitenko
    Alexey Voitenko almost 7 years
    Who's interesting in this method can find Objective-C Perceptual Hash realization by the link github.com/ameingast/cocoaimagehashing
  • Tezar
    Tezar over 6 years
    Original link for BRIEF is down, alternative: cs.ubc.ca/~lowe/525/papers/calonder_eccv10.pdf
  • Michael
    Michael about 6 years
    @AlexeyVoitenko Is this compatible with the hashes produced by phash.org in its default configuration?
  • figtrap
    figtrap about 6 years
    Just a question, why not reduce the images to greyscale first to avoid simple differences like color temp?
  • Rena
    Rena almost 5 years
    By "walks in that direction" you mean follows a line from the point, at 45°, looking for another point? Or looks through the list of points for the next one that's 45°? And does distance between them mean spatial distance in the image, or positional distance in the list of edge points? (In other words are you following the edge or scanning 2D space?)
  • Rena
    Rena almost 5 years
    In my experience phash works well for finding different sizes of the same image, but not for similar images. Eg two different photos of the same object might have very different hashes.
  • Rena
    Rena almost 5 years
    If you're looking for exact duplicates but with different formats/metadata, you can do a hash (eg MD5) of the actual pixel values. Imagemagick calls this a signature (not related to cryptographic signing). You could also reduce it first, eg truncating to 4 bits per pixel to reduce impact of JPEG artifacts, or convert to grayscale to match slightly recolored images.
  • Mizux
    Mizux about 3 years
    SIFT was under an US Patent until March 7th 2020, So I recall to use ORB in opencv instead ! ref: patents.google.com/patent/US6711293B1/en
  • Prashant Akerkar
    Prashant Akerkar almost 3 years
    Thanks. Could this be a good use case given below edaboard.com/threads/… Thanks & Regards,