Separate word lists for nouns, verbs, adjectives, etc
Solution 1
See Kevin's word lists. Particularly the "Part Of Speech Database." You'll have to do some minimal text-processing on your own, in order to get the database into multiple files for yourself, but that can be done very easily with a few grep
commands.
The license terms are available on the "readme" page.
Solution 2
If you download just the database files from wordnet.princeton.edu/download/current-version you can extract the words by running these commands:
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb
Or if you only want single words (no underscores)
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb
Solution 3
As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.
The line format is pretty simple to parse (search.c
, function parse_synset
), but if all you're interested in are the words, the relevant part of the line is formatted as:
NNNNNNNN NN a NN word N [word N ...]
These correspond to:
- Byte offset within file (8 character integer)
- File number (2 character integer)
- Part of speech (1 character)
- Number of words (2 characters, hex encoded)
- N occurrences of...
- Word with spaces replaced with underscores, optional comment in parentheses
- Word lexical ID (a unique occurrence ID)
For example, from data.adj
:
00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"
- Byte offset within the file is 4614
- File number is 0
- Part of speech is
s
, corresponding to adjective (wnutil.c
, functiongetpos
) - Number of words is 2
- First word is
cut
with lexical ID 0 - Second word is
shortened
with lexical ID 0
- First word is
A short Perl script to simply dump the words from the data.*
files:
#!/usr/bin/perl
while (my $line = <>) {
# If no 8-digit byte offset is present, skip this line
if ( $line !~ /^[0-9]{8}\s/ ) { next; }
chomp($line);
my @tokens = split(/ /, $line);
shift(@tokens); # Byte offset
shift(@tokens); # File number
shift(@tokens); # Part of speech
my $word_count = hex(shift(@tokens));
foreach ( 1 .. $word_count ) {
my $word = shift(@tokens);
$word =~ tr/_/ /;
$word =~ s/\(.*\)//;
print $word, "\n";
shift(@tokens); # Lexical ID
}
}
A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.
Both scripts are used in a similar fashion: ./wordnet_parser.pl DATA_FILE
.
Solution 4
http://icon.shef.ac.uk/Moby/mpos.html
Each part-of-speech vocabulary entry consists of a word or phrase field followed by a field delimiter of (ASCII 215) and the part-of-speech field that is coded using the following ASCII symbols (case is significant):
Noun N
Plural p
Noun Phrase h
Verb (usu participle) V
Verb (transitive) t
Verb (intransitive) i
Adjective A
Adverb v
Conjunction C
Preposition P
Interjection !
Pronoun r
Definite Article D
Indefinite Article I
Nominative o
polygenelubricants
I mostly contributed in [java] and [regex] from February to August of 2010. I work for Palantir Technologies now, so I may not have much time on stackoverflow as I did then. We're currently hiring; you can e-mail me for a referral. A few habits I've developed on the site: I will no longer cast a downvote. It will stay at 54 forever. I don't like to engage in dramas on stackoverflow. If you really need to discuss politics and other non-technical issues with me, bring it to meta. I delete my comments once they've become obsolete I try to revise my answers periodically, so I prefer that you leave comments and feedbacks instead of editing my answers directly.
Updated on January 03, 2022Comments
-
polygenelubricants over 2 years
Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?
I need them for English specifically.
-
Mephy over 9 yearsThis doesn't seem to add much to what have been said 4 years ago.
-
John Dorean over 9 yearsSpeak for yourself, this is exactly what I needed. Thanks Chilly!
-
Jonny Henly about 7 yearsThank you so much for adding this useful answer to this older question. You have definitely made my life a lot easier. I'd upvote 99 more times if I could.
-
Ketil Malde over 5 yearsLink is broken, think it should be: wordnet.princeton.edu/download/current-version
-
digitaldavenyc over 5 yearsYou da real MVP!
-
jacobian about 5 yearsNot sure the cmd for cut in windows so did it in notepad++ Search: ^[^a-z]*?[a-z][^a-z]*?([a-zA-Z]+).*?$ Replace: \1
-
artfulrobot almost 3 yearslink is dead now