Word Prediction Exploratory Analysis

Introduction

A large amount of text was collected from three sources:

Twitter
Blogs
News

This text was analysed to determine the frequency of words and bigrams (pairs of words). The purpose being to use this information to try and predict what the next word would be as someone is writing.

Source Summary

	Source	No. Lines	No. Words	Object size (Mb)	No. times ‘Perplexity’ occurs
1	Twitter	2360148	30657929	301.4 Mb	4
2	Blogs	899288	30657929	248.5 Mb	4
3	News	77259	30657929	19.2 Mb	4

Data Cleaning & Sampling

Before carrying out the word frequency analysis, the data was “cleaned” to remove emoticons, swearwords and any text that wasn’t a letter (a-z).

As the data sources are very large, a sample (10%) was taken from each source for the following analysis.

Coverage

Coverage refers to the percentage of words in a text that can be found in a specific list.

Using the Blog source, the number of words required vs coverage of words used is shown below.

Note how quickly the number of words required increases with coverage. Only about 250 words are required to cover 50% of the text. But almost 2000 are required to cover 75%! This increase is exponential.

Word Frequencies

Bigrams

There are a huge number of possible bigrams (word pairs). Some of the most common are shown below.

In order to predict the next word with accuracy a large number of bigrams will have to be considered. The graph below shows the number of bigrams required vs the % Coverage.

Approximately 25000 Bigrams will be needed to cover just 50% of those commonly used.

A major consideration will be storage, especially as three and four-word combinaions are considered.

Another consideration will be the speed searching through large tables.