UPDATE – 24/03/2018: I’m in the process of rewriting this article. For those of you who can understand a bit of non-trivial Python code you can take a look at my GitHub repository for a more elegant implementation.
OUTDATED information from here…
The example I’ve considered is a Shakespeare’s play (All is Well that Ends Well). I’ll be generating the most common 3,4,5 or 6 word phrases that were used by Shakespeare in this particular play.
The first thing to do is cleaning up the document. Removing stuff like ACT1, SCENE 1, [To Derpina] etc. The next step is tokenising the document (splitting the document into tokens by stripping punctuations and white spaces).
Now we get into action:
#By now you should have a list of the words in the file #There should not be unnecessary punctuation marks in the end #of the words or any unnecessary white spaces as well. #now word_list contains a list, generate a n-gram #print word_list #n for n-gram #Change it to whatever the requirement is n = 6 ngrams = dict() #create an n-gram list for i in range(len(word_list) - n + 1): gram = tuple(word_list[i:i+n]) if gram in ngrams: ngrams[gram] += 1 else: ngrams[gram] = 1 #now ngrams contains all the ngrams of the book sorted_ngrams = sorted(ngrams.iteritems(), key = operator.itemgetter(1), reverse = True)
Okay! this is the only working part of this program that needs to be explained. I believe the the code is self-explanatory if you know a bit of Python.
The source code can be found in my repository .