Determine the difficulty of an english word

13,033

Solution 1

Get a large corpus of texts (e.g. from the Gutenberg archives), do a straight frequency analysis, and eyeball the results. If they don't look satisfying, weight each text with its Flesch-Kincaid score and run the analysis again - words that show up frequently, but in "difficult" texts will get a score boost, which is what you want.

If all you have is 10000 words, though, it will probably be quicker to just do the frequency sorting as a first pass and then tweak the results by hand.

Solution 2

I'm not understanding how frequency is being used... if you were to scan a newspaper, I'm sure you would see the word "thoroughly" mentioned much more frequently than the word "bop" or "moo" but that doesn't mean it's an easier word; on the contrary 'thoroughly' is one of the most disgustingly absurd spelling anomalies that gives grade school children nightmares...

Try explaining to a sane human being learning english as a second language the subtle difference between slaughter and laughter.

Solution 3

I agree that frequency of use is the most likely metric; there are studies supporting a high correlation between word frequency and difficulty (correct responses on tests, etc.). Check out the English Lexicon Project at http://elexicon.wustl.edu/ for some 70k(?) frequency-rated words.

Solution 4

Crowd-source the answer.

  • Create an online 'game' that lists 10 words at random.
  • Get the player to drag and drop them into easiest - hardest, and tick to indicate if the player has ever heard of the word.
  • Apply an ranking algorithm (e.g. ELO) on the result of each experiment.
  • Repeat.

It might even be fun to play, you could get a language proficiency score at the end.

Solution 5

Word frequency is an obvious choice (of course not perfect). You can download Google n-grams V2 here, which is license under the Creative Commons Attribution 3.0 Unported License.

Format: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

Example:

enter image description here

Corpus used (from Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.):

enter image description here

Share:
13,033
Techtwaddle
Author by

Techtwaddle

Updated on June 25, 2022

Comments

  • Techtwaddle
    Techtwaddle almost 2 years

    I am working a word based game. My word database contains around 10,000 english words (sorted alphabetically). I am planning to have 5 difficulty levels in the game. Level 1 shows the easiest words and Level 5 shows the most difficult words, relatively speaking.

    I need to divide the 10,000 long words list into 5 levels, starting from the easiest words to difficult ones. I am looking for a program to do this for me.

    Can someone tell me if there is an algorithm or a method to quantitatively measure the difficulty of an english word?

    I have some thoughts revolving around using the "word length" and "word frequency" as factors, and come up with a formula or something that accomplishes this.