How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

java python r machine-learning classification

16,815

Solution 1

I would do it as follows:

Split the sentence into words, normalise them, build a dictionary
With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.

Solution 2

What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities.

Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.

Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")

A very good library (GPL) is Stanford's NER

Here's the demo: http://nlp.stanford.edu:8080/ner/

Some sample text to try:

I was eating an apple over at Apple headquarters and I thought about Apple Martin, the daughter of the Coldplay guy

(the 3class and 4class classifiers get it right)

Solution 3

I have a semi-working system that solves this problem, open sourced using scikit-learn, with a series of blog posts describing what I'm doing. The problem I'm tackling is word-sense disambiguation (choosing one of multiple word sense options), which is not the same as Named Entity Recognition. My basic approach is somewhat-competitive with existing solutions and (crucially) is customisable.

There are some existing commercial NER tools (OpenCalais, DBPedia Spotlight, and AlchemyAPI) that might give you a good enough commercial result - do try these first!

I used some of these for a client project (I consult using NLP/ML in London), but I wasn't happy with their recall (precision and recall). Basically they can be precise (when they say "This is Apple Inc" they're typically correct), but with low recall (they rarely say "This is Apple Inc" even though to a human the tweet is obviously about Apple Inc). I figured it'd be an intellectually interesting exercise to build an open source version tailored to tweets. Here's the current code: https://github.com/ianozsvald/social_media_brand_disambiguator

I'll note - I'm not trying to solve the generalised word-sense disambiguation problem with this approach, just brand disambiguation (companies, people, etc.) when you already have their name. That's why I believe that this straightforward approach will work.

I started this six weeks ago, and it is written in Python 2.7 using scikit-learn. It uses a very basic approach. I vectorize using a binary count vectorizer (I only count whether a word appears, not how many times) with 1-3 n-grams. I don't scale with TF-IDF (TF-IDF is good when you have a variable document length; for me the tweets are only one or two sentences, and my testing results didn't show improvement with TF-IDF).

I use the basic tokenizer which is very basic but surprisingly useful. It ignores @ # (so you lose some context) and of course doesn't expand a URL. I then train using logistic regression, and it seems that this problem is somewhat linearly separable (lots of terms for one class don't exist for the other). Currently I'm avoiding any stemming/cleaning (I'm trying The Simplest Possible Thing That Might Work).

The code has a full README, and you should be able to ingest your tweets relatively easily and then follow my suggestions for testing.

This works for Apple as people don't eat or drink Apple computers, nor do we type or play with fruit, so the words are easily split to one category or the other. This condition may not hold when considering something like #definance for the TV show (where people also use #definance in relation to the Arab Spring, cricket matches, exam revision and a music band). Cleverer approaches may well be required here.

I have a series of blog posts describing this project including a one-hour presentation I gave at the BrightonPython usergroup (which turned into a shorter presentation for 140 people at DataScienceLondon).

If you use something like LogisticRegression (where you get a probability for each classification) you can pick only the confident classifications, and that way you can force high precision by trading against recall (so you get correct results, but fewer of them). You'll have to tune this to your system.

Here's a possible algorithmic approach using scikit-learn:

Use a Binary CountVectorizer (I don't think term-counts in short messages add much information as most words occur only once)
Start with a Decision Tree classifier. It'll have explainable performance (see Overfitting with a Decision Tree for an example).
Move to logistic regression
Investigate the errors generated by the classifiers (read the DecisionTree's exported output or look at the coefficients in LogisticRegression, work the mis-classified tweets back through the Vectorizer to see what the underlying Bag of Words representation looks like - there will be fewer tokens there than you started with in the raw tweet - are there enough for a classification?)
Look at my example code in https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py for a worked version of this approach

Things to consider:

You need a larger dataset. I'm using 2000 labelled tweets (it took me five hours), and as a minimum you want a balanced set with >100 per class (see the overfitting note below)
Improve the tokeniser (very easy with scikit-learn) to keep # @ in tokens, and maybe add a capitalised-brand detector (as user @user2425429 notes)
Consider a non-linear classifier (like @oiez's suggestion above) when things get harder. Personally I found LinearSVC to do worse than logistic regression (but that may be due to the high-dimensional feature space that I've yet to reduce).
A tweet-specific part of speech tagger (in my humble opinion not Standford's as @Neil suggests - it performs poorly on poor Twitter grammar in my experience)
Once you have lots of tokens you'll probably want to do some dimensionality reduction (I've not tried this yet - see my blog post on LogisticRegression l1 l2 penalisation)

Re. overfitting. In my dataset with 2000 items I have a 10 minute snapshot from Twitter of 'apple' tweets. About 2/3 of the tweets are for Apple Inc, 1/3 for other-apple-uses. I pull out a balanced subset (about 584 rows I think) of each class and do five-fold cross validation for training.

Since I only have a 10 minute time-window I have many tweets about the same topic, and this is probably why my classifier does so well relative to existing tools - it will have overfit to the training features without generalising well (whereas the existing commercial tools perform worse on this snapshop, but more reliably across a wider set of data). I'll be expanding my time window to test this as a subsequent piece of work.

Solution 4

You can do the following:

Make a dict of words containing their count of occurrence in fruit and company related tweets. This can be achieved by feeding it some sample tweets whose inclination we know.
Using enough previous data, we can find out the probability of a word occurring in tweet about apple inc.
Multiply individual probabilities of words to get the probability of the whole tweet.

A simplified example:

p_f = Probability of fruit tweets.

p_w_f = Probability of a word occurring in a fruit tweet.

p_t_f = Combined probability of all words in tweet occurring a fruit tweet = p_w1_f * p_w2_f * ...

p_f_t = Probability of fruit given a particular tweet.

p_c, p_w_c, p_t_c, p_c_t are respective values for company.

A laplacian smoother of value 1 is added to eliminate the problem of zero frequency of new words which are not there in our database.

old_tweets = {'apple pie sweet potatoe cake baby https://vine.co/v/hzBaWVA3IE3': '0', ...}
known_words = {}
total_company_tweets = total_fruit_tweets =total_company_words = total_fruit_words = 0

for tweet in old_tweets:
    company = old_tweets[tweet]
    for word in tweet.lower().split(" "):
        if not word in known_words:
            known_words[word] = {"company":0, "fruit":0 }
        if company == "1":
            known_words[word]["company"] += 1
            total_company_words += 1
        else:
            known_words[word]["fruit"] += 1
            total_fruit_words += 1

    if company == "1":
        total_company_tweets += 1
    else:
        total_fruit_tweets += 1
total_tweets = len(old_tweets)

def predict_tweet(new_tweet,K=1):
    p_f = (total_fruit_tweets+K)/(total_tweets+K*2)
    p_c = (total_company_tweets+K)/(total_tweets+K*2)
    new_words = new_tweet.lower().split(" ")

    p_t_f = p_t_c = 1
    for word in new_words:
        try:
            wordFound = known_words[word]
        except KeyError:
            wordFound = {'fruit':0,'company':0}
        p_w_f = (wordFound['fruit']+K)/(total_fruit_words+K*(len(known_words)))
        p_w_c = (wordFound['company']+K)/(total_company_words+K*(len(known_words)))
    p_t_f *= p_w_f
    p_t_c *= p_w_c

    #Applying bayes rule
    p_f_t = p_f * p_t_f/(p_t_f*p_f + p_t_c*p_c)
    p_c_t = p_c * p_t_c/(p_t_f*p_f + p_t_c*p_c)
    if p_c_t > p_f_t:
        return "Company"
    return "Fruit"

Solution 5

If you don't have an issue using an outside library, I'd recommend scikit-learn since it can probably do this better & faster than anything you could code by yourself. I'd just do something like this:

Build your corpus. I did the list comprehensions for clarity, but depending on how your data is stored you might need to do different things:

def corpus_builder(apple_inc_tweets, apple_fruit_tweets):
    corpus = [tweet for tweet in apple_inc_tweets] + [tweet for tweet in apple_fruit_tweets]
    labels = [1 for x in xrange(len(apple_inc_tweets))] + [0 for x in xrange(len(apple_fruit_tweets))]
    return (corpus, labels)

The important thing is you end up with two lists that look like this:

([['apple inc tweet i love ios and iphones'], ['apple iphones are great'], ['apple fruit tweet i love pie'], ['apple pie is great']], [1, 1, 0, 0])

The [1, 1, 0, 0] represent the positive and negative labels.

Then, you create a Pipeline! Pipeline is a scikit-learn class that makes it easy to chain text processing steps together so you only have to call one object when training/predicting:

def train(corpus, labels)
    pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3), stop_words='english')),
                        ('tfidf', TfidfTransformer(norm='l2')),
                        ('clf', LinearSVC()),])
    pipe.fit_transform(corpus, labels)
    return pipe

Inside the Pipeline there are three processing steps. The CountVectorizer tokenizes the words, splits them, counts them, and transforms the data into a sparse matrix. The TfidfTransformer is optional, and you might want to remove it depending on the accuracy rating (doing cross validation tests and a grid search for the best parameters is a bit involved, so I won't get into it here). The LinearSVC is a standard text classification algorithm.

Finally, you predict the category of tweets:

def predict(pipe, tweet):
    prediction = pipe.predict([tweet])
    return prediction

Again, the tweet needs to be in a list, so I assumed it was entering the function as a string.

Put all those into a class or whatever, and you're done. At least, with this very basic example.

I didn't test this code so it might not work if you just copy-paste, but if you want to use scikit-learn it should give you an idea of where to start.

EDIT: tried to explain the steps in more detail.

View more solutions

16,815

SAL

Programmer who often dables in things way over his head. Mainly use PHP, jQuery and JS.

Updated on January 13, 2020

Comments

SAL over 4 years
See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.

Here are a couple of lines:
```
1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“@Zach_Paull: When did green skittles change from lime to green apple? #notafan” @Skittles
1|@dtfcdvEric: @MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|@STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx  @SuryaRay
```
Here is the total data set: http://pastebin.com/eJuEb4eB

I need to build a model that classifies "Apple" (Inc). from the rest.

I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).
- eddi almost 11 years
  
  You basically want this: en.wikipedia.org/wiki/Bayesian_spam_filtering
- dan almost 11 years
  
  You hand label your data, but want libraries that scale. Is this supervised or unsupervised?
- SAL almost 11 years
  
  It would start out as supervised with the goal being allowing it go unsupervised.
- SAL almost 11 years
  
  Eddi, thanks, for the comment. See the mail filtering email really helped something click in my brain. I was able to see a real life example of what I was trying to do, just applied differently.
- Neil McGuigan almost 11 years
  
  Named Entity Recognition: nlp.stanford.edu/software/CRF-NER.shtml .
- Ryan almost 11 years
  
  Fascinating @NeilMcGuigan. I pasted in some of the text on their demo (nlp.stanford.edu:8080/ner/process) and was impressed with how different models classified the words.
- Dan Albert almost 11 years
  
  You can actually get rather remarkably good predictions with even the most Naïve of Bayesian algorithms. If you need a push in the right direction, you can take a look at a movie ratings analyzer I wrote for an AI class.
- Vadim Ponomarenko almost 11 years
  
  A brainless approach that works on almost all the test data is to test for the presence of any other words for food.
SAL almost 11 years

Thank you for your answer on this. Your answer in conjunction with a comment above really helped me get towards a solution. Can you help me hone this solution?
SAL almost 11 years

In the function I did a strtolower to filter out the capital letters. A little crude, but it worked.
user2425429 almost 11 years

@SAL I didn't expect it to be very useful, but if you have a time limit, then...
Ryan almost 11 years

Thanks Manetheran. I'm not the original poster, but I'm also interested in the answer. For the bounty I'm looking for some code (even using nltk) that can help get me started in the right direction with a "hello world" machine learning task. The Apple (inc) vs. apple (fruit) seems like a perfect assignment.
Ryan almost 11 years

Thanks Fawar. I was hoping for some code on this "hello world" for this exact purpose — to learn how ML works. I will look up the class though. Looks good.
Ryan almost 11 years

That was really interesting. Is it possible to view the code for english.conll.4class.distsim.crf.ser.gz? I'd love to see how one builds something like this.
sanityinc almost 11 years

This is an informal description of Bayesian classification.
AMADANON Inc. almost 11 years

I prefer "pseudo-code implementation of Bayesian classification" :)
Neil McGuigan almost 11 years

The code for NER is open source, but the data that they used in the CONLL conferences is not. However, you can find the Reuters Corpus online at NIST.
Ryan almost 11 years

I haven't had the pleasure of looking through your code and trying to duplicate/emulate/educate, but I do owe you an apology for not awarding the full 50pts of the bounty. I was away from SO over the weekend and missed the deadline to award it. Thankfully the SO community stepped in and saw fit to award you 25pts.
senthilkumar almost 11 years

No problem :-) The code, README and blog posts should give you an idea about my approach. It is deliberately simple but Seems To Work Ok.
Ryan almost 11 years

how does this work? I don't see your "chosen features" in your code. Does it automatically choose features based on training set? Or is it stored in dict() somewhere else? I think if one's training set is large enough, shouldn't a computer be able to figure out the features itself? (unsupervised?)
Paul Dubs almost 11 years

The features are extracted using the tweet_features function. It basically removes the urls from the tweets, and then creates a feature dict whose entries read something like 'hasBigram(foo,bar)' = True.
Ryan almost 11 years

So 'hasBigram(foo,bar)' = True where tweet string includes foo bar? So it builds bigrams and trigrams for each tweet and flags it in the positive feature dict()? Therefore given the tweet, "alpha beta gamma delta", it will build dict() bigrams for alpha,beta; beta,gamma; and gamma,delta; and trigrams for alpha,beta,gamma and beta,gamma,delta? And from the given positive and negative bi and tri grams, the decisiontree or bayes classifiers can do their magic?
Paul Dubs almost 11 years

Exactly. When using the bayes classifier you can also get the most useful features by calling "show_most_informative_features()" on it.
Ryan almost 11 years

Paul, I built a crude php version of this and you're absolutely correct. This is a super efficient way to build a weighted dictionary. I think this could easily scale without having to manually build all of the keywords. I look forward to learning more about how to do this inside of a standard machine learning libraries.
Szymon Maszke over 5 years

At the same time, spaCy has ner pipeline component, wouldn't it be beneficial for this classification? I assume their model can recognize Apple (as it's one the biggest and best-known companies in the world) much better than a model you can come up with in one day.
Dim over 5 years

@Szymon: NER may or may not help. As I understand, you want to use named entities (the fact that they are present in the text) to be a feature for the main classification task. Apparently, NER won't have 100% accuracy as there is a high level of ambiguity. So the main classification model will decide under which circumstances will it trust this feature. It may turn out (I think it is very likely) that a basic classification model will give a very low weight to the results of the NER model. And this means that you will spend time on NER, which is (almost) not used.
Szymon Maszke over 5 years

Not what I meant. Just create spacy.Doc from each text, iterate over their NERs with doc.ents and check whether any NER has .text attribute equal to Apple. Fun fact, their first example consist of Apple.
Szymon Maszke over 5 years

And if someone wanted to make a model, it would most probably involve RNNs/CNNs and the like, tune them accordingly, find architecture, cell types etc., I don't think easier models would handle disambiguation and context well. Why make your life easier (unless you want to learn something along the way), if someone's already done it for you?
Dim over 5 years

@SzymonMaszke your model is more complicated and more difficult to train. For your model to work for the mentioned purpose you have not only to find a NE, but also find it at a correct place (token). With categorization model that I suggest you optimize the model for your core goal - identify it is Apple company or Apple fruit. That easier to train and therefore most likely it will be more accurate.
Szymon Maszke over 5 years

I think you have misunderstood my point of view. spaCy comes with pretrained NER model, you don't have to train it. I suppose it would easily distinguish Apple and apple as fruit. I see no point in custom model if the one provided would probably suffice in his use case. If it doesn't he could move on to constructing new model/retraining spaCy's, I doubt it would be needed though.
Dim over 5 years

@SzymonMaszke the default spacy model is not good enough for specific tasks. It always needs to be trained.