Algorithms for determining the key of an audio sample

21,217

Solution 1

It's worth being aware that this is a very tricky problem and if you don't have a background in signal processing (or an interest in learning about it) then you have a very frustrating time ahead of you. If you're expecting to throw a couple of FFTs at the problem then you won't get very far. I hope you do have the interest as it is a really fascinating area.

Initially there is the problem of pitch recognition, which is reasonably easy to do for simple monophonic instruments (eg voice) using a method such as autocorrelation or harmonic sum spectrum (eg see Paul R's link). However, you'll often find that this gives the wrong results: you'll often get half or double the pitch that you were expecting. This is called pitch period doubling or octave errors and it occurs essentially because the FFT or autocorrelation has an assumption that the data has constant characteristics over time. If you have an instrument played by a human there will always be some variation.

Some people approach the problem of key recognition as being a matter of doing the pitch recognition first and then finding the key from the sequence of pitches. This is incredibly difficult if you have anything other than a monophonic sequence of pitches. If you do have a monophonic sequence of pitches then it's still not a clear cut method of determining the key: how you deal with chromatic notes, for instance, or determining whether it's major or minor. So you'd need to use a method similar to Krumhansl's key finding algorithm.

So, given the complexity of this approach, an alternative is to look at all the notes being played at the same time. If you have chords, or more than one instruments then you're going to have a rich spectral soup of many sinusoids playing at once. Each individual note is comprised of multiple harmonics a fundamental frequency, so A (at 440Hz) will be comprised of sinusoids at 440, 880, 1320... Furthermore, if you play an E (see this diagram for pitches) then that is 659.25Hz which is almost one and a half times that of A (actually 1.498). This means that every 3rd harmonic of A coincides with every 2nd harmonic of E. This is the reason that chords sound pleasant, because they share harmonics. (as an aside, the whole reason that western harmony works is due to the quirk of fate that the twelfth root of 2 to the power 7 is nearly 1.5)

If you looked beyond this interval of a 5th to major, minor and other chords then you'll find other ratios. I think that many key finding techniques will enumerate these ratios and then fill a histogram for each spectral peak in the signal. So in the case of detecting the chord A5 you would expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5 it'll be 494, 988, 741, etc. So create a frequency histogram and for every sinusoidal peak in the signal (eg from the FFT power spectrum) increment the histogram entry. Then for each key A-G tally up the bins in your histogram and the ones with the most entries is most likely to be your key.

That's just a very simple approach but may be enough to find the key of a strummed or sustained chord. You'd also have to chop the signal into small intervals (eg 20ms) and analyse each one to build up a more robust estimate.

EDIT:
If you want to experiment then I'd suggest downloading a package like Octave or CLAM which makes it easier to visualise audio data and run FFTs and other operations.

Other useful links:

  • My PhD thesis on some aspects of pitch recognition -- the maths is a bit heavy going but chapter 2 is (I hope) quite an accessible introduction to the different approaches of modelling musical audio
  • http://en.wikipedia.org/wiki/Auditory_scene_analysis -- Bregman's Auditory Scene analysis which though not talking about music has some fascinating findings about how we perceive complex scenes
  • Dan Ellis has done some great papers in this and similar areas
  • Keith Martin has some interesting approaches

Solution 2

I have worked at the problem of transcribing polyphonic CD recordings into scores for more than two years at university. The problem is notoriously hard. The first scientific papers related to the problem date back to the 1940s and up to today there are no robust solutions for the general case.

All the basic assumption you usually read are not exactly right and most of them are wrong enough that they become unusable for everything but very simple scenarios.

The frequencies of overtones are not multiples of the fundamental frequency - there are non-linear effects so that the high partials drift away from the expected frequency - and not only a few Hertz; it is not unusual to find the 7th partial where you expected the 6th.

Fourier transformations do not play nice with audio analysis because the frequencies one is interested in are spaced logarithmically while the Fourier transformation yields linearly spaced frequencies. At low frequencies you need high frequency resolution to separate neighboring pitches - but this yields bad time resolution and you lose the ability the separate notes played in quick succession.

An audio recording does (probably) not contain all the information needed to reconstruct the score. A large part of our music perception happens in our ears and brain. That is why some of the most successful systems are expert systems with large knowledge repositories about the structure of (western) music that only rely to a small portion on signal processing to extract information from the audio recording.

When I am back home I will look through the papers I have read and pick the 20 or 30 most relevant ones and add them here. I really suggest to read them before you decide to implement something - as stated before most common assumptions are somewhat incorrect and you really don't want to rediscover all this things found and analyzed for more than 50 year while implementing and testing.

It's a hard problem, but it's much fun, too. I would really like to hear what you tried and how well it worked.


For now you may have a look at the Constant Q transform, Cepstrum and Wigner(–Ville) distribution. There are also some good papers on how to extract the frequency from shifts in the phase of short time Fourier spectra - this allows to use very short windows sizes (for high time resolution) because the frequency can be determined with a precision several 1000 times larger than the frequency resolution of the underlying Fourier transformation.

All this transformations fit the problem of audio processing much better than ordinary Fourier transformations. For improving the results of basic transformations have a look at the concept of energy reassignment.

Solution 3

You can use the Fourier Transform to calculate the frequency spectrum from an audio sample. From this output, you can use the frequency values for particular notes to turn this into a list of notes heard during the sample. Choosing the strongest notes heard per sample over a series of samples should give you a decent map of the different notes used, which you can compare to the different musical scales to get a list of the possible scales that contain that combination of notes.

To help decide which particular scale is being used, make a note (no pun intended) of the most frequently heard notes. In Western music, the root of the scale is typically the most common note heard, followed by the fifth, and then the fourth. You can also look for patterns such as common chords, arpeggios, or progressions.

Sample size will probably be important here. Ideally, each sample will be a single note (so that you don't get two chords in one sample). If you filter out and concentrate on the low frequencies, you may be able to use the volume spikes ("clicks") normally associated with percussion instruments in order to determine the song's tempo and "lock" your algorithm to the beat of the music. Start with samples that are a half-beat in length and adjust from there. Be prepared to throw out some samples that don't have a lot of useful data (such as a sample taken in the middle of a slide).

Solution 4

As far as I can tell from this article, various keys each have their own common frequencies, so it likely analyzes the audio sample to detect what the most common notes and chords are. After all, you can have multiple keys that have the same configuration of sharps and flats, with the difference being the note that the key starts on and thus the chords that such keys, so it seems how often the significant notes and chords appear would be the only real way you could figure that sort of thing out. I don't really think you can get a layman's explanation of the actual mathematical formulas without leaving out a lot of information.

Do note that this is coming from somebody who has absolutely no experience in this area, with his first exposure being the article linked in this answer.

Solution 5

It's a complex topic, but a simple algorithm for determining a single key (single note) would look like this:

Do a fourier transformation on let's say 4096 samples (exact size depends on your resolution demands) on a part of the sample which contains the note. Determine the power peak in the spectrum - this is the frequency of the note.

Things are getting tighter if you have a chord, different "instruments/effects" or a non-homophonic music pattern.

Share:
21,217
Alex
Author by

Alex

Always learning.

Updated on July 23, 2020

Comments

  • Alex
    Alex almost 4 years

    I am interested in determining the musical key of an audio sample. How would (or could) an algorithm go about trying to approximate the key of a musical audio sample?

    Antares Autotune and Melodyne are two pieces of software that do this sort of thing.

    Can anyone give a bit of a layman's explanation about how this would work? To mathematically deduce the key of a song by analysing the frequency spectrum for chord progressions etc.

    This topic interests me a lot!

    Edit - brilliant sources and a wealth of information to be found from everyone who contributed to this question.

    Especially from: the_mandrill and Daniel Brückner.

  • Alex
    Alex almost 14 years
    Yes I think you'd need a fairly clean sample to work with. Plus one that fits with Western tonal structures too of course. Good answer, many thanks.
  • Paul R
    Paul R almost 14 years
    It's not that easy to extract pitch from a power spectrum - there are much better pitch detection algorithms.
  • Paul R
    Paul R almost 14 years
    Frequency of a peak != pitch, at least for for musical instruments. Better to use one of the popular pitch detection algorithms.
  • Alex
    Alex almost 14 years
    The whole process is a complex one. But very interesting. Chords I think create much complexity as they generate their own resonance and harmonic frequency that must be very difficult to account for in an algorith!
  • Alex
    Alex almost 14 years
    I'm not sure this could work on chords as you are hearing many pitches at once.
  • bta
    bta almost 14 years
    @AlexW- Yes harmonic resonance is present, but it appears at a much lower magnitude than the chord itself. If you know the chord, you can predict the harmonics that might be heard and filter them out of your results accordingly.
  • Alex
    Alex almost 14 years
    @bta yes that's true. Going by the material generated from this page, it's an all round tricky task. Maybe if you can strip away unnecessary artefacts from music, it would be easier to determine the key (to add a bandpass filter first to get rid of high- and low-frequencies).
  • Alex
    Alex almost 14 years
    @Paul R - yes I've seen that the perception of volume of a pitch is determined by it's frequency, not by some other measure. This also confuses me a bit though.
  • bta
    bta almost 14 years
    @AlexW- I would recommend starting with something recorded as a series of electronic tones (from an electronic keyboard, perhaps). Simple tones are much easier to work with, and once you get the hang of that you can slowly move to more and more complex sounds. Real-world instruments (and to a greater degree, voices) are a complex combination of sounds and are much tougher to crack; if you are targeting a specific instrument, it helps if you can filter out anything outside that instrument's range.
  • SigTerm
    SigTerm almost 14 years
    A good pitch detection algorithm should not be detecting chords or determining "major or minor". It should be detecting individual notes. This is how ear with absolute pitch ability works (I do have the ability + musical education) - I do not hear "C major chord". I hear C+E+G and then determine that it is, indeed C major chord. Even if you sit on the piano keyboard or press combo of random keys (like C+Cis+D+Fis+G+Bes+B), I still will be able to name every note, although it will not be a "chord". This is because (my) ear does not operate on chords or tonalities. It operates on notes.
  • SigTerm
    SigTerm almost 14 years
    The ear is detecting individual frequencies (to be more precies, the brain does the analysis), and maps them to chromatic note names(C, Cis/Des, D, etc). After that combination of notes can be analyzed and recognized as a some kind of chords, and you'll be able to guess tonality. I believe that computer tone detection should be working in similar way. And another thing - the easiest way to detect keys or chords will be probably processing signal histogram. Because every key or chord will be visible as a "blip" on the histogram at the certain frequencies.
  • Paul R
    Paul R almost 14 years
    @AlexW: pitch is a percept rather than an actual physical quantity, but it's usually quite close to the fundamental frequency of the note being played. In some instruments though the fundamental frequency may be of quite low amplitude, or even missing altogether, hence the need to use a proper pitch detection algorithm rather than a power spectrum.
  • Paul R
    Paul R almost 14 years
    @AlexW: yes, chords are going to be tricky - you are going to want to sample the more melodic and monophonic parts of the music.
  • the_mandrill
    the_mandrill almost 14 years
    @SigTerm: the problem isn't as clear cut as you make out. When there are multiple instruments playing (and in particular for orchestral scores) it's simply not possible to hear every single note, but yet it can be simple to hear the chord. From a signal processing point of view the problem is ambiguous since you have several instruments playing the same pitch, or at (almost) integer multiples thereof. Therefore the signal from each instrument isn't orthogonal. I think it was one of Tangian's papers who showed that a complex tone can be indistinguishable from a chord. (see above for link)
  • the_mandrill
    the_mandrill almost 14 years
    Besides, polyphonic pitch recognition is incredibly difficult (there are a handful of systems in the world that perform well) and is therefore unsuited to being a front end to a chord/key finding system.
  • Alex
    Alex almost 14 years
    That would be a 'Mechanical Turk With Pitch Perfect Musical Understanding'... good luck finding your source!
  • Alex
    Alex almost 14 years
    At the moment this is to be honest a rather vague notion. The maths is intimidating, but it's important to remember that tools exist to help with 'boilerplate' Fourier transformations. It's a case of understanding this data and experimenting with algorithms.
  • Alex
    Alex almost 14 years
    I would start by going from a single note, recording frequency bands and analysing the frequency of all the 12 semi-tones. This would build a database of associated harmonics as well as the actual base frequency for that particular pitch. Once you have built an extensive database of harmonic and base note frequencies, you might be able to take measurements from chords and approximate the note combinations based upon earlier readings. This method may not work in a good enough time frame for real-time analysis, but after some extensive referencing work is performed on the music.
  • Alex
    Alex almost 14 years
    You could model the waveform used for frequency measurements using acoustically shaped oscillators, to mimic more natural harmonic structures and complexity.
  • Alex
    Alex almost 14 years
    +1. For me though as things stand, I do not have the mathematical knowledge to fully comprehend Constant Q transform like you. What I can do though is try and think of practical solutions based on my not particularly extensive knowledge of computing and programming.
  • SigTerm
    SigTerm almost 14 years
    @the_mandrill: "not possible to hear every single note" it IS possible to hear every single one. As I said, I'm not an expert in audio processing, but I had more than enough musical training, and I can name note by hearing. With very complex chord (12+ notes) picking every note will take more time. With simple chord(4+), it is instant, but you always can name them all. Also chords can be positioned differently. C major chord (C+E+G) can be C5+E5+G5, E5+G5+C6, C5+G5+E3, C4+C5+G5+E6, etc. Different notes, same harmony. This makes chord detection useless. You need to pick individual notes.
  • SigTerm
    SigTerm almost 14 years
    @the_mandrill: With complex-sounding harmonics, when you're recognizing pitch by ear (and when you can't name all notes instantly), it goes like this: You concentrate on sound of one instrument, then for all currently "active" sounds of every instrument, you concentrate on individual notes and "name" them. Recognition of one note (ear) is instant. Not sure how brain does it, "concentrating" is probably equivalent to setting up filter sensitivity, and picking up individual notes probably equals to histogram scanning. Also, don't forget that it may be possible to use trained neural networks.
  • SigTerm
    SigTerm almost 14 years
    @the_mandrill: "polyphonic pitch recognition is incredibly difficult" difficult or not, it is the proper way to do it. Chord detection system will screw up on non-standard music with dissonant chords. Dodecaphonic music, for example, maybe even jazz.
  • the_mandrill
    the_mandrill almost 14 years
    @AlexW - if you are only wanting to recognise one specific instrument then building up a harmonic database may work well enough. There was some work to do this for piano transcription that made explicit use of the fact that there is slightly inharmonicity in the higher harmonics (as Daniel mentions).
  • the_mandrill
    the_mandrill almost 14 years
    @SigTerm: It's not always possible (or necessary) to hear every single note. A chord composed of C4+C5 it may be indistinguishable from a complex tone at C4. The only reason you may be able to hear it as two notes is that you have a prior expectation of the harmonic structure of that particular instrument. If you construct it out of sine waves (which intrinsically is what you're detecting) then it can be impossible to detect. Similarly C4+C5+G5 sounds just like a complex tone at C4. So the whole problem of chord recognition is ambiguous. See Terhardt's virtual pitch theory for more.
  • SigTerm
    SigTerm almost 14 years
    @the_mandrill: "A chord of C4+C5 it may be indistinguishable from a complex tone at C4" Although it is possible theoretically, it doesn't match my musical experience. Haven't ever heard a sound like this from a real instrument. "expectation of the harmonic". In this case you'll need a solution that will rapidly train itself, classify frequency patterns, ans associate them with notes. This obviously won't be a simple numeric algorithm, it will be slow, results will be probabilistic, not precise, but I think it can be done at least for one polyphonic instrument at a time. Or a harmonic database.
  • Alex
    Alex almost 14 years
    @SigTerm it's interesting that you have those differences in audible recognition of tones. The human brain has an incredible advantage with this sort of ability, to recognise the note from training and from musicality generally. Transferring this skill to a computer seems to test the best minds in the field.
  • Tomer Vromen
    Tomer Vromen almost 14 years
    +1 for the 2^(7/12) ~= 1.5 bit. I've been wondering about that for some time.
  • hotpaw2
    hotpaw2 over 13 years
    @SigTerm : How you think you perceive tones and how you perceive tones are not the same thing. The human brain can easily fool a person into thinking they are hearing by one method when some other process is going on in the brain stem. Auditory illusions can demonstrate some of this.
  • hotpaw2
    hotpaw2 over 13 years
    +1 for mentioning inharmonicity of the overtone series. I can actually see this in some of my instruments using multiple simultaneous strobe tuners tuned to an overtone series. Notes can also "bend" in frequency as they evolve in time.
  • the_mandrill
    the_mandrill over 13 years
    +1 for mentioning Wigner-Ville. If I was looking at this problem again now then I would certainly be looking at Time-Frequency methods that can trade off time against space. This is also a better model of how we perceive pitch.
  • supercat
    supercat over 11 years
    @SigTerm: On many pipe organs, there are stops labeled 2 2/3' and 1 3/5' which produce pitches that represent the third and fifth harmonics of a stop at 8'. If one draws just those those two stops and plays a melody in C major, and on the other manual one draws a quiet 8' stop and plays the appropriate chords in C major, the perception will be of a melody played in C major, even if what one is actually hearing will be the melody played in G major and E major.
  • woojoo666
    woojoo666 over 7 years
    any examples on "good papers on how to extract the frequency from shifts in the phase of short time Fourier spectra"? Not sure what to google search here
  • Meekohi
    Meekohi over 6 years
    Would love to see some specific papers you suggest reading to get started!