Introduction

A large amount of text was collected from three sources:

This text was analysed to determine the frequency of words and bigrams (pairs of words). The purpose being to use this information to try and predict what the next word would be as someone is writing.

Source Summary

Source No. Lines No. Words Object size (Mb) No. times ‘Perplexity’ occurs
1 Twitter 2360148 30657929 301.4 Mb 4
2 Blogs 899288 30657929 248.5 Mb 4
3 News 77259 30657929 19.2 Mb 4

Data Cleaning & Sampling

Before carrying out the word frequency analysis, the data was “cleaned” to remove emoticons, swearwords and any text that wasn’t a letter (a-z).

As the data sources are very large, a sample (10%) was taken from each source for the following analysis.

Coverage

Coverage refers to the percentage of words in a text that can be found in a specific list.

Using the Blog source, the number of words required vs coverage of words used is shown below.

Note how quickly the number of words required increases with coverage. Only about 250 words are required to cover 50% of the text. But almost 2000 are required to cover 75%! This increase is exponential.

Word Frequencies

Bigrams

There are a huge number of possible bigrams (word pairs). Some of the most common are shown below.

In order to predict the next word with accuracy a large number of bigrams will have to be considered. The graph below shows the number of bigrams required vs the % Coverage.

Approximately 25000 Bigrams will be needed to cover just 50% of those commonly used.

A major consideration will be storage, especially as three and four-word combinaions are considered.

Another consideration will be the speed searching through large tables.