Forming Bigrams of words in list of sentences with Python

98,133

Solution 1

Using list comprehensions and zip:

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]

Solution 2

from nltk import word_tokenize 
from nltk.util import ngrams


text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
    token = word_tokenize(line)
    bigram = list(ngrams(token, 2)) 

    # the '2' represents bigram; you can change it to get ngrams with different size

Solution 3

Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

To use it, do like so:

for line in sentence:
    features = get_bigrams(line)
    # train set here

Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).

Solution 4

Without nltk:

ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

Solution 5

>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text  for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

Using enumerate and split function.

Share:
98,133
Hypothetical Ninja
Author by

Hypothetical Ninja

Surviving.

Updated on May 02, 2022

Comments

  • Hypothetical Ninja
    Hypothetical Ninja about 2 years

    I have a list of sentences:

    text = ['cant railway station','citadel hotel',' police stn']. 
    

    I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:

    text2 = [[word for word in line.split()] for line in text]
    bigrams = nltk.bigrams(text2)
    print(bigrams)
    

    which yields

    [(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])
    

    Can't railway station and citadel hotel form one bigram. What I want is

    [([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...
    

    The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

  • Hypothetical Ninja
    Hypothetical Ninja about 10 years
    are they bigrams by default? because i'll be needing them for spell correct.
  • Nir Alfasi
    Nir Alfasi about 10 years
    @Sword you can see that it generates only bigrams from the last line (before the print). Play with it, try different sentences and see for yourself ;)
  • dashesy
    dashesy over 6 years
    stemmer changes apple to appl so I get ['appl basket'].
  • Dan
    Dan over 6 years
    Yeah there are some limitations with Porter stemmer.
  • Thomas Decaux
    Thomas Decaux almost 6 years
    but this is not by sentences, you should use from_documents.
  • Joe
    Joe almost 4 years
    and if you want to keep each sentence's bigrams in its own list: [[b for b in zip(l.split(" ")[:-(n-1)], l.split(" ")[(n-1):])] for l in x]
  • Ender
    Ender about 3 years
    This is a wonderful approach for the general case and solves the OP's question straightforwardly but it is also worth mentioning that it is sometimes useful to treat punctuation marks as separate words e.g. if the intent is to train an n-gram language model, in order to calculate the grammaticality of a sentence so .split(" ") may not be the ideal here. It may be best to use nltk.word_tokenize along with nltk.sent_tokenize instead