Separate word lists for nouns, verbs, adjectives, etc

33,935

Solution 1

See Kevin's word lists. Particularly the "Part Of Speech Database." You'll have to do some minimal text-processing on your own, in order to get the database into multiple files for yourself, but that can be done very easily with a few grep commands.

The license terms are available on the "readme" page.

Solution 2

If you download just the database files from wordnet.princeton.edu/download/current-version you can extract the words by running these commands:

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb

Or if you only want single words (no underscores)

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb

Solution 3

As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.

The line format is pretty simple to parse (search.c, function parse_synset), but if all you're interested in are the words, the relevant part of the line is formatted as:

NNNNNNNN NN a NN word N [word N ...]

These correspond to:

  • Byte offset within file (8 character integer)
  • File number (2 character integer)
  • Part of speech (1 character)
  • Number of words (2 characters, hex encoded)
  • N occurrences of...
    • Word with spaces replaced with underscores, optional comment in parentheses
    • Word lexical ID (a unique occurrence ID)

For example, from data.adj:

00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"
  • Byte offset within the file is 4614
  • File number is 0
  • Part of speech is s, corresponding to adjective (wnutil.c, function getpos)
  • Number of words is 2
    • First word is cut with lexical ID 0
    • Second word is shortened with lexical ID 0

A short Perl script to simply dump the words from the data.* files:

#!/usr/bin/perl

while (my $line = <>) {
    # If no 8-digit byte offset is present, skip this line
    if ( $line !~ /^[0-9]{8}\s/ ) { next; }
    chomp($line);

    my @tokens = split(/ /, $line);
    shift(@tokens); # Byte offset
    shift(@tokens); # File number
    shift(@tokens); # Part of speech

    my $word_count = hex(shift(@tokens));
    foreach ( 1 .. $word_count ) {
        my $word = shift(@tokens);
        $word =~ tr/_/ /;
        $word =~ s/\(.*\)//;
        print $word, "\n";

        shift(@tokens); # Lexical ID
    }
}

A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.

Both scripts are used in a similar fashion: ./wordnet_parser.pl DATA_FILE.

Solution 4

http://icon.shef.ac.uk/Moby/mpos.html

Each part-of-speech vocabulary entry consists of a word or phrase field followed by a field delimiter of (ASCII 215) and the part-of-speech field that is coded using the following ASCII symbols (case is significant):

Noun                            N
Plural                          p
Noun Phrase                     h
Verb (usu participle)           V
Verb (transitive)               t
Verb (intransitive)             i
Adjective                       A
Adverb                          v
Conjunction                     C
Preposition                     P
Interjection                   !
Pronoun                         r
Definite Article                D
Indefinite Article              I
Nominative                      o
Share:
33,935
polygenelubricants
Author by

polygenelubricants

I mostly contributed in [java] and [regex] from February to August of 2010. I work for Palantir Technologies now, so I may not have much time on stackoverflow as I did then. We're currently hiring; you can e-mail me for a referral. A few habits I've developed on the site: I will no longer cast a downvote. It will stay at 54 forever. I don't like to engage in dramas on stackoverflow. If you really need to discuss politics and other non-technical issues with me, bring it to meta. I delete my comments once they've become obsolete I try to revise my answers periodically, so I prefer that you leave comments and feedbacks instead of editing my answers directly.

Updated on January 03, 2022

Comments

  • polygenelubricants
    polygenelubricants over 2 years

    Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?

    I need them for English specifically.

  • Mephy
    Mephy over 9 years
    This doesn't seem to add much to what have been said 4 years ago.
  • John Dorean
    John Dorean over 9 years
    Speak for yourself, this is exactly what I needed. Thanks Chilly!
  • Jonny Henly
    Jonny Henly about 7 years
    Thank you so much for adding this useful answer to this older question. You have definitely made my life a lot easier. I'd upvote 99 more times if I could.
  • Ketil Malde
    Ketil Malde over 5 years
    Link is broken, think it should be: wordnet.princeton.edu/download/current-version
  • digitaldavenyc
    digitaldavenyc over 5 years
    You da real MVP!
  • jacobian
    jacobian about 5 years
    Not sure the cmd for cut in windows so did it in notepad++ Search: ^[^a-z]*?[a-z][^a-z]*?([a-zA-Z]+).*?$ Replace: \1
  • artfulrobot
    artfulrobot almost 3 years
    link is dead now