Java or Python for Natural Language Processing

71,002

Solution 1

Java vs Python for NLP is very much a preference or necessity. Depending on the company/projects you'll need to use one or the other and often there isn't much of a choice unless you're heading a project.

Other than NLTK (www.nltk.org), there are actually other libraries for text processing in python:

(for more, see https://pypi.python.org/pypi?%3Aaction=search&term=natural+language+processing&submit=search)

For Java, there're tonnes of others but here's another list:

This is a nice comparison for basic string processing, see http://nltk.googlecode.com/svn/trunk/doc/howto/nlp-python.html

A useful comparison of GATE vs UIMA vs OpenNLP, see https://www.assembla.com/spaces/extraction-of-cost-data/wiki/Gate-vs-UIMA-vs-OpenNLP?version=4

If you're uncertain, which is the language to go for NLP, personally i say, "any language that will give you the desired analysis/output", see Which language or tools to learn for natural language processing?

Here's a pretty recent (2017) of NLP tools: https://github.com/alvations/awesome-community-curated-nlp

An older list of NLP tools (2013): http://web.archive.org/web/20130703190201/http://yauhenklimovich.wordpress.com/2013/05/20/tools-nlp


Other than language processing tools, you would very much need machine learning tools to incorporate into NLP pipelines.

There's a whole range in Python and Java, and once again it's up to preference and whether the libraries are user-friendly enough:

Machine Learning libraries in python:

(for more, see https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search)


With the recent (2015) deep learning tsunami in NLP, possibly you could consider: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software

I'll avoid listing deep learning tools out of non-favoritism / neutrality.


Other Stackoverflow questions that also asked for NLP/ML tools:

Solution 2

The question is very open ended. That said, rather than choose one, below is a comparison depending on the language that you would like to use (since there are good libraries available in both languages).

Python

In terms of Python, the first place you should look at is the Python Natural Language Toolkit. As they note in their description, NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

There is also some excellent code that you can look up that originated out of Google's Natural Language Toolkit project that is Python based. You can find a link to that code here on GitHub.

Java

The first place to look would be Stanford's Natural Language Processing Group. All of software that is distributed there is written in Java. All recent distributions require Oracle Java 6+ or OpenJDK 7+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

Another great option that you see in a lot of machine learning environments here (general option), is Weka. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Share:
71,002

Related videos on Youtube

Jin Ling
Author by

Jin Ling

Updated on July 30, 2022

Comments

  • Jin Ling
    Jin Ling almost 2 years

    I would like to know which programming language is better for natural language processing. Java or Python? I have found lots of questions and answers regarding about it. But I am still lost in choosing which one to use.

    And I want to know which NLP library to use for Java since there are lots of libraries (LingPipe, GATE, OpenNLP, StandfordNLP). For Python, most programmers recommend NLTK.

    But if I am to do some text processing or information extraction from unstructured data (just free formed plain English text) to get some useful information, what is the best option? Java or Python? Suitable library?

    Updated

    What I want to do is to extract useful product information from unstructured data (E.g. users make different forms of advertisement about mobiles or laptops with not very standard English language)

    • L0j1k
      L0j1k about 10 years
      I hate that these kinds of questions are not welcome here on SO. I think the intent was to prevent holy wars, but this contributes to the content IMO.
    • Scott Smith
      Scott Smith almost 8 years
      If it were worded to say "What are the leading Java and Python NLP libraries and their relative strengths?" maybe that solves it? The answer changes over time, but I also find questions like this very useful.
    • Ksofiac
      Ksofiac almost 7 years
      I also wish these sorts of questions were welcomed on SO. I recently tried to survey NLP strengths in Python vs R, and it was immediately shot down. Not bueno for those trying to frame their projects in the right language.
  • L0j1k
    L0j1k about 10 years
    Awesome answer. I really don't understand why these kinds of questions are looked down on here. +1
  • Nathaniel Payne
    Nathaniel Payne about 10 years
    I agree completely. The question that was asked is general. That said, these are precisely the types of questions that I often find myself facing, particularly when I am new to an area.
  • Nathaniel Payne
    Nathaniel Payne about 10 years
    In terms of Java based libraries and tools, another great one that you might look at is LingPipe. alias-i.com/lingpipe
  • Jin Ling
    Jin Ling about 10 years
    @NathanielPayne: Thank you so much for your suggestions. That gives me some guide to start NLP.
  • Jin Ling
    Jin Ling about 10 years
    Thanks for giving lots of information about NLP and ML tools