Forming Bigrams of words in list of sentences with Python
Solution 1
Using list comprehensions and zip:
>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',
'one')]
Solution 2
from nltk import word_tokenize
from nltk.util import ngrams
text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
token = word_tokenize(line)
bigram = list(ngrams(token, 2))
# the '2' represents bigram; you can change it to get ngrams with different size
Solution 3
Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def get_bigrams(myString):
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(myString)
stemmer = PorterStemmer()
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
for bigram_tuple in bigrams:
x = "%s %s" % bigram_tuple
tokens.append(x)
result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
return result
To use it, do like so:
for line in sentence:
features = get_bigrams(line)
# train set here
Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).
Solution 4
Without nltk:
ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
arr = line.split()
for i in range(len(arr)-1):
ans.append([[arr[i]], [arr[i+1]]])
print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]
Solution 5
>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]
Using enumerate and split function.
Comments
-
Hypothetical Ninja about 2 years
I have a list of sentences:
text = ['cant railway station','citadel hotel',' police stn'].
I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:
text2 = [[word for word in line.split()] for line in text] bigrams = nltk.bigrams(text2) print(bigrams)
which yields
[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])
Can't railway station and citadel hotel form one bigram. What I want is
[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...
The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?
-
Hypothetical Ninja about 10 yearsare they bigrams by default? because i'll be needing them for spell correct.
-
Nir Alfasi about 10 years@Sword you can see that it generates only bigrams from the last line (before the print). Play with it, try different sentences and see for yourself ;)
-
dashesy over 6 years
stemmer
changesapple
toappl
so I get['appl basket']
. -
Dan over 6 yearsYeah there are some limitations with Porter stemmer.
-
Thomas Decaux almost 6 yearsbut this is not by sentences, you should use
from_documents
. -
Joe almost 4 yearsand if you want to keep each sentence's bigrams in its own list:
[[b for b in zip(l.split(" ")[:-(n-1)], l.split(" ")[(n-1):])] for l in x]
-
Ender about 3 yearsThis is a wonderful approach for the general case and solves the OP's question straightforwardly but it is also worth mentioning that it is sometimes useful to treat punctuation marks as separate words e.g. if the intent is to train an n-gram language model, in order to calculate the grammaticality of a sentence so .split(" ") may not be the ideal here. It may be best to use nltk.word_tokenize along with nltk.sent_tokenize instead