Sentiment analysis for Twitter in Python

51,259

Solution 1

With most of these kinds of applications, you'll have to roll much of your own code for a statistical classification task. As Lucka suggested, NLTK is the perfect tool for natural language manipulation in Python, so long as your goal doesn't interfere with the non commercial nature of its license. However, I would suggest other software packages for modeling. I haven't found many strong advanced machine learning models available for Python, so I'm going to suggest some standalone binaries that easily cooperate with it.

You may be interested in The Toolkit for Advanced Discriminative Modeling, which can be easily interfaced with Python. This has been used for classification tasks in various areas of natural language processing. You also have a pick of a number of different models. I'd suggest starting with Maximum Entropy classification so long as you're already familiar with implementing a Naive Bayes classifier. If not, you may want to look into it and code one up to really get a decent understanding of statistical classification as a machine learning task.

The University of Texas at Austin computational linguistics groups have held classes where most of the projects coming out of them have used this great tool. You can look at the course page for Computational Linguistics II to get an idea of how to make it work and what previous applications it has served.

Another great tool which works in the same vein is Mallet. The difference between Mallet is that there's a bit more documentation and some more models available, such as decision trees, and it's in Java, which, in my opinion, makes it a little slower. Weka is a whole suite of different machine learning models in one big package that includes some graphical stuff, but it's really mostly meant for pedagogical purposes, and isn't really something I'd put into production.

Good luck with your task. The real difficult part will probably be the amount of knowledge engineering required up front for you to classify the 'seed set' off of which your model will learn. It needs to be pretty sizeable, depending on whether you're doing binary classification (happy vs sad) or a whole range of emotions (which will require even more). Make sure to hold out some of this engineered data for testing, or run some tenfold or remove-one tests to make sure you're actually doing a good job predicting before you put it out there. And most of all, have fun! This is the best part of NLP and AI, in my opinion.

Solution 2

Good luck with that.

Sentiment is enormously contextual, and tweeting culture makes the problem worse because you aren't given the context for most tweets. The whole point of twitter is that you can leverage the huge amount of shared "real world" context to pack meaningful communication in a very short message.

If they say the video is bad, does that mean bad, or bad?

A linguistics professor was lecturing to her class one day. "In English," she said, "A double negative forms a positive. In some languages, though, such as Russian, a double negative is still a negative. However, there is no language wherein a double positive can form a negative."

A voice from the back of the room piped up, "Yeah . . .right."

Solution 3

Thanks everyone for your suggestions, they were indeed very useful! I ended up using a Naive Bayesian classifier, which I borrowed from here. I started by feeding it with a list of good/bad keywords and then added a "learn" feature by employing user feedback. It turned out to work pretty nice.

The full details of my work as in a blog post.

Again, your help was very useful, so thank you!

Solution 4

I have constructed a word list labeled with sentiment. You can access it from here:

http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6010/zip/imm6010.zip

You will find a short Python program on my blog:

http://finnaarupnielsen.wordpress.com/2011/06/20/simplest-sentiment-analysis-in-python-with-af/

This post displays how to use the word list with single sentences as well as with Twitter.

Word lists approaches have their limitations. You will find a investigation of the limitations of my word list in the article "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs". That article is available from my homepage.

Please note a unicode(s, 'utf-8') is missing from the code (for paedagogic reasons).

Solution 5

I think you may find it difficult to find what you're after. The closest thing that I know of is LingPipe, which has some sentiment analysis functionality and is available under a limited kind of open-source licence, but is written in Java.

Also, sentiment analysis systems are usually developed by training a system on product/movie review data which is significantly different from the average tweet. They are going to be optimised for text with several sentences, all about the same topic. I suspect you would do better coming up with a rule-based system yourself, perhaps based on a lexicon of sentiment terms like the one the University of Pittsburgh provide.

Check out We Feel Fine for an implementation of similar idea with a really beautiful interface (and twitrratr).

Share:
51,259
Ran
Author by

Ran

Updated on July 08, 2022

Comments

  • Ran
    Ran almost 2 years

    I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use?

    I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad" tweets. I'm using Google's appengine, so it's in python. I'd like to be able to classify the returned search results from twitter and I'd like to do that in python. I haven't been able to find such sentiment analyzer so far, specifically not in python. Are you familiar with such open source implementation I can use? Preferably this is already in python, but if not, hopefully I can translate it to python.

    Note, the texts I'm analyzing are VERY short, they are tweets. So ideally, this classifier is optimized for such short texts.

    BTW, twitter does support the ":)" and ":(" operators in search, which aim to do just this, but unfortunately, the classification provided by them isn't that great, so I figured I might give this a try myself.

    Thanks!

    BTW, an early demo is here and the code I have so far is here and I'd love to opensource it with any interested developer.

  • Ran
    Ran over 15 years
    Thanks. I'm only doing this at nights, so... it'll take some time, but I'll post an update when I have something ready
  • amit kumar
    amit kumar about 14 years
    NLTK code is available under Apache License 2.0 as per nltk.org/faq
  • Scott Weinstein
    Scott Weinstein about 12 years
    I think the quote was "yeah yeah" - from Sidney Morgenbesser
  • Swapnil
    Swapnil over 11 years
    Why do you say Weka is for pedagogical purposes? Isn't it part of pentaho BI suite? And pentaho does serve enterprises.
  • andilabs
    andilabs almost 11 years
    "Posterous Spaces is no longer available" Could you post python code somewhere?
  • Finn Årup Nielsen
    Finn Årup Nielsen almost 11 years
    Thanks for noting it. I have now changed the posterous link to a Wordpress link where I moved my blog.
  • andilabs
    andilabs over 10 years
    Could you say something about any experiments with your sentiment wordslit? I mean what was precission, recall of classification.
  • Finn Årup Nielsen
    Finn Årup Nielsen over 10 years
    I have links to a few evaluations here: neuro.compute.dtu.dk/wiki/AFINN#Evaluation I have not myself evaluated its performance in terms of precision, recall and classification. What I did was rank correlation with Mislove's Amazon Mechanical Turk labeling of tweets.
  • Petrutiu Mihai
    Petrutiu Mihai about 10 years
    blog post link is not working anymore, could you update it?
  • Ran
    Ran about 10 years
    Hi @PetrutiuMihai indeed that blog was taken down. But it's pretty old stuff, not at the front of research as of today, so you won't be missing much ;(