Detecting syllables in a word

nlp spell-checking hyphenation

75,712

Solution 1

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang's thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.

Solution 2

I stumbled across this page looking for the same thing, and found a few implementations of the Liang paper here: https://github.com/mnater/hyphenator or the successor: https://github.com/mnater/Hyphenopoly

That is unless you're the type that enjoys reading a 60 page thesis instead of adapting freely available code for non-unique problem. :)

Solution 3

Here is a solution using NLTK:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

Solution 4

I'm trying to tackle this problem for a program that will calculate the flesch-kincaid and flesch reading score of a block of text. My algorithm uses what I found on this website: http://www.howmanysyllables.com/howtocountsyllables.html and it gets reasonably close. It still has trouble on complicated words like invisible and hyphenation, but I've found it gets in the ballpark for my purposes.

It has the upside of being easy to implement. I found the "es" can be either syllabic or not. It's a gamble, but I decided to remove the es in my algorithm.

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 && 
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

Solution 5

This is a particularly difficult problem which is not completely solved by the LaTeX hyphenation algorithm. A good summary of some available methods and the challenges involved can be found in the paper Evaluating Automatic Syllabification Algorithms for English (Marchand, Adsett, and Damper 2007).

View more solutions

75,712

user50705

Updated on June 07, 2021

Comments

user50705 about 3 years

I need to find a fairly efficient way to detect syllables in a word. E.g.,

Invisible -> in-vi-sib-le

There are some syllabification rules that could be used:

V CV VC CVC CCV CCCV CVCC

*where V is a vowel and C is a consonant. E.g.,

Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)

I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).

The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.

I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.

I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.
- Adrian McCarthy almost 12 years
  
  Do you actually want the actual division points or just the number of syllables in a word? If the latter, consider looking up the words in a text-to-speech dictionary and count the phonemes that encode vowel sounds.
- Brōtsyorfuzthrāx almost 10 years
  
  The most efficient way (computation-wise; not storage-wise), I would guess would be just to have a Python dictionary with words as keys and the number of syllables as values. However, you'd still need a fallback for words that didn't make it in the dictionary. Let me know if you ever find such a dictionary!
Karl over 15 years

I like that youve cited a thesis dissertation on the subject, it's a little hint to the original poster that this might not be an easy question.
user50705 over 15 years

I read the disertation paper, and found it very helpful. The problem with the approach was that I did not have any patterns for the Albanian language, although I found some tools that could generate those patterns. Anyway, for my purpose I wrote a rule based app, which solved the problem...
user50705 over 15 years

... My approach is a bit slow (~20 sec on a 50K word file) but I think the results are reasonably accurate (i dont have any useful stats yet).
Wouter Lievens almost 14 years

Maybe it has to work for words that don't appear in dictionaries, such as names?
hoju over 13 years

agreed - much more convenient to just use an existing implmentation
Gourneau over 13 years

Hey thanks tiny baby error in the should be function def nsyl(word): return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]
Dan Gayle about 13 years

What would you suggest as a fallback for words that aren't in that corpus?
allenporter almost 13 years

I wrote up a quick post doing some tests of this approach including stats: allenporter.tumblr.com/post/9776954743/syllables -- While the hyphenation approach was promising, an ad-hoc approach of counting vowels seemed more accurate since the hyphenation algorithm errors on under-hyphenating. Definitely not a solved problem, as far as I can tell.
Jean-François Corbett over 12 years

@WouterLievens: I don't think names are anywhere near well-behaved enough for automatic syllable parsing. A syllable parser for English names would fail miserably on names of Welsh or Scottish origin, let alone names of Indian and Nigerian origins, yet you might find all of these in a single room somewhere in e.g. London.
Warren about 12 years

@allenporter I read your webpage. According to your statistics, hyphenation approach is not accurate. I also read 2 articles eprints.soton.ac.uk/264285/1/MarchandAdsettDamper_ISCA07.pdf and web.cs.dal.ca/~adsett/publications/AdsMar_CompSyllMeth_2009.‌pdf . Do you know about SbA method in their articles? They claim hyphenation is as high as about 95% correct. What is that big dict (1 m size) you used for evaluation, Can you please let know where and how can I have it for such test?
Adrian McCarthy almost 12 years

Note that the TeX algorithm is for finding legitimate hyphenation points, which is not exactly the same as syllable divisions. It's true that hyphenation points fall on syllable divisions, but not all syllable divisions are valid hyphenation points. For example, hyphens aren't (usually) used within a letter or two of either end of a word. I also believe the TeX patterns were tuned to trade off false negatives for false positives (never put a hyphen where it doesn't belong, even if that means missing some legitimate hyphenation opportunities).
Ezequiel about 10 years

I don't believe hyphenation is the answer either.
Darren Ringer almost 9 years

One must keep in mind that it is not reasonable to expect better performance than a human could provide considering this is a purely heuristic approach to a sketchy domain.
josefnpat almost 9 years

I added two more test cases "End" and "I". The fix was to compare strings case insensitively. Ping'ing @joe-basirico and tihamer in case they suffer from the same problem and would like to update their functions.
josefnpat almost 9 years

@tihamer American is 4 syllables!
IKavanagh over 8 years

Generally, links to a tool or library should be accompanied by usage notes, a specific explanation of how the linked resource is applicable to the problem, or some sample code, or if possible all of the above.
IKavanagh over 8 years

See Syntax Highlighting. There is a help button (question mark) in the SO editor which will get you to the linked page.
billy_chapters over 8 years

@Pureferret cmudict is a pronouncing dictionary for north american english words. it splits words into phonemes, which are shorter than syllables (e.g. the word 'cat' is split into three phonemes: K - AE - T). but vowels also have a "stress marker": either 0, 1, or 2, depending on the pronunciation of the word (so AE in 'cat' becomes AE1). the code in the answer counts the stress markers and therefore the number of the vowels - which effectively gives the number of syllables (notice how in OP's examples each syllable has exactly one vowel).
Adam Michael Wood about 7 years

This returns the number of syllables, not the syllabification.
Nico Haase over 6 years

How is that a generic syllable parser? It looks like this code is only looking up syllables in a dictionary
Norman H over 6 years

For my simple scenario of finding syllables in proper names this seems to be initially working well enough. Thanks for putting it out here.
Abe Voelker about 4 years

But Liang's hyphenation algorithm isn't equivalent to breaking into syllables. E.g. applying it to "hyphenation" returns "'hy-phen-ation", but breaking into syllables it should be "hy-phen-a-tion" (4 syllables, not 3). "Project" isn't hyphenated at all, but broken into syllables it would be "pro-ject" (2 syllables, not 1). Many such cases
dacort almost 3 years

SpacySyllables is pretty decent, just be aware that it's unfortunately not perfect. "eighty" returns ['eighty'] and "universal" returns ['uni', 'ver', 'sal']. This is due to the underlying library (Pyphen) having a default of 2 characters for the first and last syllables.
Aidan almost 3 years

The link is dead and the library does not seem to be available anymore.
Aidan almost 3 years

Its a decent try but even after some simple testing it does not seem very accurate. e.g. "anyone" returns 1 syllable instead of 3, "Minute" returns 3 instead of 2, and "Another" returns 2 instead of 3.