Counting Syllables In A Word

19,951

Solution 1

Ambiguity is a huge issue in natural language processing, but some tasks can actually handle with the ambiguity with nice accuracy. It turns out syllabification is one of them, so don't listen to the other answers. :)

Syllabification

Heuristic-based

You could come up with algorithms achieving correct syllabification virtually throughout the English vocabulary, but it seems complicated to program correctly.

Corpus-based

As always, when hand-made algorithms don't help too much, Natural Language Processing researchers use hand-tagged corpora containing the correct answers for given words. Learnings algorithms are then used and often provide great accuracy. You can use LingPipe's syllabification (see "English syllabification") which follows this approach.

Exhaustive list

English only has so many words, which is how we came up with dictionaries. Such dictionaries often contain the correct syllabification. You could scrape reference.com. For example, the undulate entry contains « un·du·late », which is enough to know there are three syllables.

Other such dictionaries include Answers.com, The Free Dictionary, Merriam-Webster, and so on. Do read the Terms and Conditions, automated retrieval may not be allowed. And different dictionaries don't always agree with each other.

It won't help with new words or proper nouns, but I'd say it's going to be the most accurate method.

About hyphenation

Another related problem got a lot more exposure: hyphenation. But don't use that! It is used in typesetting programs such as LaTeX, but only aims to provide some of the correct hyphens, without ever providing an incorrect one (high precision, low recall). It's interesting to note that there only are 14 exceptions, eg. project which has a different hyphenation depending on the part-of-speech (verb or noun).

Hyphenation programs

If you decide that it's enough for you needs, note that a few implementations of the TeX hyphenation algorithm exist in other languages, such as Python, Perl or Ruby.

Solution 2

I'm looking for a fully accurate statement of an algorithm to count syllables in words

There isn't one. Period. Whatever algorithm you invent, I promise to find a counterexample. In certain languages(Armenian and Russian come to mind) the algorithm is pretty straightforward - count the number of vowels. In other languages, such as German, it's not as straightforward but still doable. In English, I am afraid, the transduction between letters and sounds is absolutely irregular.

For example,

coincidence. oi is to be counted as two syllables. But in boil it's only one syllable. Also, not counting the final vowel is not always accurate. Consider the name Penelope or Hermione. Or banana

Another curious case is when the syllable exists without a printed vowel. For example, table is a bisyllabic word but the second syllable is generated by the invisible sound between b and l. Also, don't forget about words originated from greek, which can have a lot of consecutive vowels. E.g. onomatopoeia.

So, there is no accurate algorithm. The only way you can go is to try to find an algorithm which works in many (I am avoiding the word most) cases. But in this case you should redefine your requirements.

Solution 3

Old question, but still, people probably read it once in a while and it is an open question.

Words aren't built up out of discrete, well defined, agreed syllables - you try your best to separate language into syllables, and the way you do it depends on the purpose - some are more phonetic, others rely more on spelling.

Phonetic methods produce different results depending on the accent or dialect of the speaker, and/or how clearly each individual is speaking at a particular time. In some phonetic methods, syllables share sounds - i.e. the last sound in one syllable can be the first in the next, and this can cross word boundaries.

What is taught in schools (if the school bothers at all) often is a mixture of spelling and phonetic rules designed to help children spell. They try to have a few memorable rules that work a lot of the time, they aren't meant to be 100% correct or exhaustive.

With any particular method, you'll likely find things that don't sound right to you.

Now the answer: For a readability metric, it won't matter much which method is used. Even just counting letters in the words (or vowels) can work also. If you are trying to match someone else's results, then you need to know their method.

Share:
19,951

Related videos on Youtube

Glenn1234
Author by

Glenn1234

Updated on June 04, 2022

Comments

  • Glenn1234
    Glenn1234 almost 2 years

    I'm looking for a fully accurate statement of an algorithm to count syllables in words. What I'm finding when I research is inconsistent or what I know to generate incorrect results. Does anyone have any suggestions of how to accomplish this? Thanks.

    The algorithm I'm using now:

    1. Count the number of vowels in the word.
    2. Do not count double-vowels ("rain" has 2 vowels but is only 1 syllable)
    3. If last letter in word is vowel do not count ("side" is 1 syllable)

    Are there any more rules I'm missing? I'm trying to determine in testing for my incorrect results if the algorithm I'm using is wrong or my implementation of it.

    • wildplasser
      wildplasser about 12 years
      ad 2: "doable" ? Ouch!
  • Glenn1234
    Glenn1234 about 12 years
    If it helps to know, what I'm using this for is to implement readability formulas. The two that I've selected have a variable that equates to "average number of syllables per word", which means I need to count syllables. What I am noticing however, is that in the paper I got these formulas from that some of my results match the examples in that paper and some don't. So I'm trying to track down how my results differ from the paper's author and this seems like the likely problem since my word counts are accurate.
  • Quentin Pradet
    Quentin Pradet about 12 years
    It's complicated, but solutions exist.
  • Armen Tsirunyan
    Armen Tsirunyan about 12 years
    You downvoted me for stating that there exists no 100% accurate algorithm, yet the only one you provided is the exhaustive list...
  • Quentin Pradet
    Quentin Pradet about 12 years
    I downvoted you for stating that there was no accurate algorithm, and saying that based only on a few examples. How do you define "accurate"? 100% is definitely not what we aime for in natural language processing, since inter-annotator agreement is never that high.
  • Armen Tsirunyan
    Armen Tsirunyan about 12 years
    I didn't say that based on the examples. I stated that based on my claim that I promised to provide a counterexample to any existing algorithm. The examples were illustratory
  • Quentin Pradet
    Quentin Pradet about 12 years
    Hmm, I can only cancel my downvote if your answer is edited. I still think it's misleading, but I would cancel the downvote if I could.
  • Armen Tsirunyan
    Armen Tsirunyan about 12 years
    I am not bitching about a downvote. I am conducting a constuctive dialogue :)