Compare two audio files

30,419

Solution 1

Copying from that answer:

The exact same question that people at the old AudioScrobbler and currently at MusicBrainz have worked on since long ago. For the time being, the Python project that can aid in your quest, is Picard, which will tag audio files (not only MPEG 1 Layer 3 files) with a GUID (actually, several of them), and from then on, matching the tags is quite simple.

If you prefer to do it as a project of your own, libofa might be of help. The documentation for the Python wrapper perhaps will help you the most.

Solution 2

This is actually not a trivial task. I do not think any off-the-shelf library can do it. Here is a possible approach:

  1. Decode mp3 to PCM.
  2. Ensure that PCM data has specific sample rate, which you choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.
  3. Normalize PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format, e.g. if sample format is signed 16 bit, then after normalization max. amplitude sample should have value 32767 or -32767).
  4. Split audio data into frames of fixed number of samples (e.g.: 1000 samples per frame).
  5. Convert each frame to spectrum domain (FFT).
  6. Calculate correlation between sequences of frames representing two songs. If correllation is greater than a certain threshold, assume the songs are the same.

Python libraries:

An additional complication. Your songs may have a different length of silence at the beginning. So to avoid false negatives, you may need an additional step:

3.1. Scan PCM data from the beginning, until sound energy exceeds predefined threshold. (E.g. calculate RMS with a sliding window of 10 samples and stop when it exceeds 1% of dynamic range). Then discard all data until this point.

Solution 3

First, you will have to change your domain of comparison. Analyzing raw samples from the uncompressed files will get you nowhere. Your distance measure will be based on one or more features that you extract from the audio samples. Wikipedia lists the following features as commonly used for Acoustic Fingerprinting:

Perceptual characteristics often exploited by audio fingerprints include average zero crossing rate, estimated tempo, average spectrum, spectral flatness, prominent tones across a set of bands, and bandwidth.

I don't have programmatic solutions for you but here's an interesting attempt at reverse engineering the YouTube Audio ID system. It is used for copyright infringement detection, a similar problem.

Share:
30,419
Sasha Chedygov
Author by

Sasha Chedygov

Stuff I'm great at: Frontend JavaScript (esp. React), Python, Node.js, HTML/CSS Stuff I'm good at: SQL, UX Stuff I'm learning: Rust, cloud infrastructure stuff

Updated on October 09, 2020

Comments

  • Sasha Chedygov
    Sasha Chedygov over 3 years

    Basically, I have a lot of audio files representing the same song. However, some of them are worse quality than the original, and some are edited to where they do not match the original song anymore. What I'd like to do is programmatically compare these audio files to the original and see which ones match up with that song, regardless of quality. A direct comparison would obviously not work because the quality of the files varies.

    I believe this could be done by analyzing the structure of the songs and comparing to the original, but I know nothing about audio engineering so that doesn't help me much. All the songs are of the same format (MP3). Also, I'm using Python, so if there are bindings for it, that would be fantastic; if not, something for the JVM or even a native library would be fine as well, as long as it runs on Linux and I can figure out how to use it.

  • Sasha Chedygov
    Sasha Chedygov almost 14 years
    I ended up using Picard, at least for now. Thanks. :)
  • LINGS
    LINGS over 10 years
    The PCM data is a byte array right? In step 3, while normalizing since we need amplitudes up to 32767, I believe you would be converting it to an integer/double array. Please correct me if I'm wrong. Also, do we need to calculate correlation in step 6? Or what if we just compare the fft values and see if they fall within a threshold?
  • atzz
    atzz over 10 years
    @LINGS (3) assumes that PCM data from step (1) is an array of appropriate type (e.g. int16 or float32). But if the decoder of choice returns raw bytes then yes, a conversion step is needed.
  • atzz
    atzz over 10 years
    @LINGS re step (6): simple difference won't work if your solution has to tolerate noise, because some noise such as clicks or claps cause large difference in FFT. Integrated difference might work though. I'm not sure correlation is the best comparison method here, I didn't research it as I probably should have, but it worked ok when I implemented something similar.