Where do the words in /usr/share/dict/words come from?

25,749

You're asking multiple questions, but I think the main one is:

Is there any standard dictating what it must contain?

To my knowledge, no.

Given that, your related questions:

How is this list generated? Are its contents the same across different Unices?

are answered “it depends on each different Unix”.

The convention of including a word list as part of the operating system comes from the spell(1) utility, which uses it for a primitive spell-checking procedure.

That spell-checking procedure is described in the academic paper “Development of a Spelling List”, by M. D. McIlroy of Bell Labs, 1982.

You should check your operating system's package manager for where the spelling list comes from, how it is generated, and what alternatives are available.

On Debian GNU+Linux, for example:

  • The /usr/share/dict/words file is a symbolic link managed using the Debian “alternatives” system.
  • A common word list package providing that link is the wamerican package.
  • The package documentation for wamerican states its word list comes from the SCOWL (Spell Checker Oriented Word Lists) project.

Many other word list packages can be installed; they each have the “Provides: wordlist” field:

$ aptitude search '?provides(wordlist)' | wc -l
34

On different Unices, you'll need to see the package system and documentation to know the provenance and alternatives of the word list.

Share:
25,749

Related videos on Youtube

Mark Amery
Author by

Mark Amery

Email address: [email protected]. No spam, please. I work for Curative Inc: https://curative.com/ I license you to use any of my Stack Overflow contributions in any way you like. If you find something wrong in one of my posts, feel free to edit it. I'm a frequent visitor and can always roll back if I think a change you've made is wrong or stupid, so you may as well be bold - it's better, here, to ask for forgiveness than permission.

Updated on September 18, 2022

Comments

  • Mark Amery
    Mark Amery over 1 year

    /usr/share/dict/words contains lots of words. How is this list generated? Are its contents the same across different Unices? Is there any standard dictating what it must contain?

    All I've been able to turn up so far is that on Ubuntu/Debian the list comes from the wordlist packages, but their descriptions offer no clue on how the lists were actually generated.

  • Admin
    Admin over 5 years
    FWIW: On a minimal Centos 7 x64 install (where the words file is absent), yum install words did the trick for me.