Word Prediction Algorithm

Jeremy Voisey
19th May 2017

Executive Summary

Using a training sample (25%) of data collected from Twitter, Blogs and News sources n-gram counts were analysed to develop an efficient algortihm to predict the next word in a phrase.

Three different models were examined. They were compared using a validation sample (1%). The final model chosen, being extremely simple, yet effective for the purpose required.

The time taken for the final algorithm to predict the next word was around 25ms on a modern computer. The success rate using a test sample of 2000 phrases, was found to be 34.5%. Where “success rate” is defined as the percentage of times the next word is in the five words suggested.

The shiny app developed was written using html & javascript, alowing greater control over the client-server communication.

Model Building

N-grams counts (n = 1 to 5) were collected from the training data. Three potential n-gram models were considered.

Simple Back off

Predicted words for each preceding n-gram (n = 1 to 4) were stored in descending order of probability. Priority was given to predictions based on higher order n-grams.

Stupid Back Off

The probability of the next word was interpolated using the probablities of all n-grams, but giving a higher weighting to higher order n-grams.

Modified Knesser-Ney

N-grams were discounted by an amount, calculated by comparing the n-gram counts in a second sample. Unigram probabilties were calculated based on the number of different words a given word is found after.

Model Performance & Selection

It was found that for the purposes of prediction the next word, there was very little difference in the performance of the various models.

The model finally chosen was the simple backoff model, using the Knesser-Ney method of calculating unigram probabilities. Using n-grams (n=5) was found to have a only marginal improvement (0.1%) so were dropped for efficiency & model size.

plot of chunk modelperformance

Shiny App Demonstration

The user interface for the shiny app was written in html and javascript to allow full control over the design and server calls.

Typing (or pasting) a phrase, results in a call to the server script (written in R) to predict the five most probable next words.

These were used to update the word menu.