"Stop words" list for English?

language-agnostic indexing filtering stop-words nlp

20,537

Solution 1

The magic word to put into Google is "stop words". This turns up a reasonable-looking list.

MySQL also has a built-in list of stop words, but this is far too comprehensive to my tastes. For example, at our university library we had problems because "third" in "third world" was considered a stop word.

Solution 2

these are called stop words, check this sample

Solution 3

Depending on the subdomain of English you are working in, you may have/wish to compile your own stop word list. Some generic stop words could be meaningful in a domain. E.g. The word "are" could actually be an abbreviation/acronym in some domain. Conversely, you may want to ignore some domain specific words depending on your application which you may not want to ignore in the domain of general English. E.g. If you are analyzing a corpus of hospital reports, you may wish to ignore words like 'history' and 'symptoms' as they would be found in every report and may not be useful (from a plain vanilla inverted index perspective).

Otherwise, the lists returned by Google should be fine. The Porter Stemmer uses this and the Lucene seach engine implementation uses this.

Solution 4

Get statistics about word frequency in large txt corpora. Ignore all words with frequency > some number.

Solution 5

I think I used the stopword list for German from here when I built a search application with lucene.net a while ago. The site contains a list for English, too, and the lists on the site are apparaently the ones that the lucene project use as default, too.

View more solutions

20,537

Author by

Mark Harrison

I'm a Software Engineer at Google where I work on machine learning planning systems. From 2001-2015 I was the Pixar Tech Lead of the Data Management Group. My 50-year charter was to store and catalog all data and metadata related to the Studio's feature films. This system ("Templar") is in use to this day. From 1997 to 2001 I lived in Beijing, China and was the Chief Software Architect at AsiaInfo, the company that built China's Internet. While there my software was used to grow the China Internet from 200K to 65M users. The last I heard they were at 350M+ users. I studied computer science and worked in Texas for many years. I wrote a couple of computer books... the best one was in print for 20 years. Feel free to drop me a line! [email protected]

Updated on July 09, 2022

Comments

Mark Harrison almost 2 years
I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the".
- Where can I find some lists of these uninteresting words?
- Is a list of these words the same as a list of the most frequently used words in English?
update: these are apparently called "stop words" and not "skip words".
Mark Harrison over 14 years

lol, this is just the work I'm trying to avoid!
bobobobo over 14 years

Your link is out, archive: web.archive.org/web/20080501010608/http://www.dcs.gla.ac.uk/‌…
alexis over 11 years

The nltk (Natural Language Toolkit, a python library) comes with a bunch of resources including a stopword corpus (Porter et al.), "2,400 stopwords for 11 languages". You can use the stopword list independent of the toolkit.
Hamman Samuel almost 9 years

How do I access this corpus of 2,400 stopwords in NLTK?
Keyur Padalia almost 9 years

nltk.org/nltk_data
gidim over 7 years

The English stop-words in NLTK are tokenized. So instead of "shouldn't" it lists "shouldn"