UPDATE – 24/03/2018: I’m in the process of rewriting this article. For those of you who can understand a bit of non-trivial Python code you can take a look at my GitHub repository for a more elegant implementation.
OUTDATED information from here…
I’ve written a very small code snippet that actually generates n-grams. I’ve also added a small tweak that gives us the number of times a n-gram has appeared in the document.
The example I’ve considered is a Shakespeare’s play (All is Well that Ends Well). I’ll be generating the most common 3,4,5 or 6 word phrases that were used by Shakespeare in this particular play.
The first thing to do is cleaning up the document. Removing stuff like ACT1, SCENE 1, [To Derpina] etc. The next step is tokenising the document (splitting the document into tokens by stripping punctuations and white spaces).
Now we get into action:
#By now you should have a list of the words in the file
#There should not be unnecessary punctuation marks in the end
#of the words or any unnecessary white spaces as well.
#now word_list contains a list, generate a n-gram
#print word_list
#n for n-gram
#Change it to whatever the requirement is
n = 6
ngrams = dict()
#create an n-gram list
for i in range(len(word_list) - n + 1):
gram = tuple(word_list[i:i+n])
if gram in ngrams:
ngrams[gram] += 1
else:
ngrams[gram] = 1
#now ngrams contains all the ngrams of the book
sorted_ngrams = sorted(ngrams.iteritems(), key = operator.itemgetter(1), reverse = True)
Okay! this is the only working part of this program that needs to be explained. I believe the the code is self-explanatory if you know a bit of Python.
The source code can be found in my repository .