Determine if text is in English?

python scikit-learn nlp nltk

44,405

Solution 1

There is a library called langdetect. It is ported from Google's language-detection available here:

https://pypi.python.org/pypi/langdetect

It supports 55 languages out of the box.

Solution 2

You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.

TL;DR:

CLD-2 is pretty good and extremely fast
lang-detect is a tiny bit better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK's Textcat is neither efficient nor effective.

You can install lidtk and classify languages:

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra

Solution 3

Pretrained Fast Text Model Worked Best For My Similar Needs

I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.

After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.

With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('\n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

Below is the application / instantiation and use of the above class for my needs.

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)

Solution 4

This is what I've used some time ago. It works for texts longer than 3 words and with less than 3 non-recognized words. Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

Solution 5

Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

This example is taken directly from their website

View more solutions

44,405

Author by

ocean800

Updated on June 02, 2021

Comments

ocean800 almost 3 years
I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:
```
[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ] 
```
For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?

I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.