MOT Vehicle Testing¶

Introduction¶

In the UK, cars older than 3 years are required to have annual tests (MOT test) for vehicle safety, roadworthiness and emissions. This ongoing project is a study of the MOT test results for cars and similar small vehicles in 2016.

The intention is carry out some exploratory data analysis before investigating models to predict whether or not a vehicle will pass its MOT given its age, mileage, make, model, location etc..

Data Download¶

The data for this project was obtained as follows:

MOT test results (downloaded 24/12/2017) https://data.gov.uk/dataset/anonymised_mot_test
Postcode lookup (downloaded 24/12/2017) https://data.gov.uk/dataset/national-statistics-postcode-lookup-uk
Shapefiles for UK postcode boundaries (downloaded 09/02/2018) http://www.opendoorlogistics.com/wp-content/uploads/Data/UK-postcode-boundaries-Jan-2015.zip

Data Pre-Processing¶

All data in archives was extracted and placed in a "data" folder.

The Shapefiles data did not require any further pre-processing.

MOT Results (2016)¶

As all MOT test results for all vehicles are provided, this is a large file (3GB).

The first three lines of the file are shown below:

# Import required libraries
import pandas as pd
import re

# Raw data for 2016 (3GB)
data_file = "data/test_result_2016.txt"

with open(data_file, 'r') as data_input:
    for lines in range(3):
        line = data_input.readline()
        print(line)

test_id|vehicle_id|test_date|test_class_id|test_type|test_result|test_mileage|postcode_area|make|model|colour|fuel_type|cylinder_capacity|first_use_date

1645480751|1374211238|2016-01-01|4|NT|P|117033|SM|VOLKSWAGEN|POLO|BLACK|PE|1600|2000-06-23

1393462389|1153769898|2016-01-01|4|NT|P|99292|NE|VOLKSWAGEN|PASSAT|BLUE|DI|1968|2006-11-30

Sampling & Splitting¶

The MOT test data is stored in a pipe delimited text file. For practical purposes and to simulate a survey, a small sample (1.5%) of all car and small vehicle (test class, 4) tests was taken.

This data was then split into four parts:

Training (75%)
Development for Hyperparameter tuning (10%)
Ensemble for stacking various models if required (5%)
Testing, which will put asise (in a vault!) to test the final model of each type. (10%)

# Initialize counts
sample_count = 0
testing_development_count = 0
ensemble_count = 0

# Output files
training_file = "data/MOT_result_2016_training.csv"
development_file = "data/MOT_result_2016_development.csv"
ensemble_file = "data/MOT_result_2016_ensemble.csv"
testing_file = "data/MOT_result_2016_testing.csv"

# Sample Ratio (very large file, so sample just 1.5%)
sample = 0.015

# Split Ratio (10% for development & testing, 5% for ensemble, 75% left for training)
testing_development_split = 0.1
ensemble_split = 0.05

# Use write mode and headers for 1st dataframe only
mode = 'w'
header = True

# Process the file in chunks
chunksize = 10**5

for chunk in pd.read_csv(data_file, sep='|', chunksize=chunksize, error_bad_lines=False, warn_bad_lines=False):
    if mode == 'w': # 1st dataframe sampled, set random state
        chunk = chunk.query('test_class_id == 4').sample(frac=sample, random_state = 21)
    else:
        chunk = chunk.query('test_class_id == 4').sample(frac=sample)
    # Split data into training, development, ensemble & testing
    chunk_len = len(chunk)
    testing_development_len = int(chunk_len * testing_development_split)
    ensemble_len = int(chunk_len * ensemble_split)
    # Testing Data
    start = 0
    end = testing_development_len
    chunk[start:end].to_csv(testing_file, index=False, mode=mode, header=header)
    # Ensemble Data
    start = end + 1
    end = testing_development_len + ensemble_len
    chunk[start:end].to_csv(ensemble_file, index=False, mode=mode, header=header)
    # Development (hyperparameter tuning) Data
    start = end + 1
    end = 2 * testing_development_len + ensemble_len
    chunk[start:end].to_csv(development_file, index=False, mode=mode, header=header)
    # Training Data
    start = end + 1
    chunk[start:].to_csv(training_file, index=False, mode=mode, header=header)
    # Set mode for writing CSV file to append for subsequent samples & don't rewrite headers
    mode = 'a'
    header = False
    # Update counts
    sample_count += chunk_len
    testing_development_count += testing_development_len
    ensemble_count += ensemble_len

# Subtracting testing and development counts to get training counts
training_count = sample_count - 2 * testing_development_count
print('The number of samples taken was ' + "{:,}".format(sample_count))
print('This was split as follows:')
print('  Training - ' + "{:,}".format(training_count))
print('  Development - ' + "{:,}".format(testing_development_count))
print('  Ensemble - ' + "{:,}".format(ensemble_count))
print('  Testing - ' + "{:,}".format(testing_development_count))

The number of samples taken was 534,557
This was split as follows:
  Training - 427,993
  Development - 53,282
  Ensemble - 26,548
  Testing - 53,282

As can be seen, a 1.5% sample still gives a large dataset, which is more than adequate for my purposes.

Data Summary¶

The processed data contains 14 features (none were dropped) as summarised below.

# Read training data
training_df = pd.read_csv(training_file, parse_dates=['test_date', 'first_use_date'])

# Empty dataframe to store data description
meta_data = pd.DataFrame()

# Loop over each column obtaining data descriptors
for col in training_df.columns:
    col_data = {}
    col_data['Column'] = col
    col_data['Type'] = training_df[col].dtype
    no_unique = len(training_df[col].unique())
    col_data['No. Unique Values'] = no_unique
    no_null = training_df[col].isnull().sum()
    col_data['No. Null Values'] = no_null
    if no_unique < 11:
        col_data['Values'] = training_df[col].unique()
    elif col in ['test_date', 'first_use_date']:
        col_data['Values'] = [d.strftime('%d, %b %Y') for d in training_df.loc[:3, col]] + ['...']
    else:
        col_data['Values'] = training_df.loc[:3, col].tolist() + ['...']
    meta_data = meta_data.append(col_data, ignore_index=True)

meta_data = meta_data[['Column','Type','No. Null Values', 'No. Unique Values', 'Values']]
meta_data.style

Postcode Lookup¶

The MOT test results only provide the postcode area (city or area level). e.g. CF = Cardiff.

The Postcode Lookup file is much more detailed than is required, containing information about individual Postcodes (street level).

postcode_file = "data/National_Statistics_Postcode_Lookup_UK.csv"

postcode_df = pd.read_csv(postcode_file)
postcode_df.head(3)

Postcode Area Extraction¶

The file was processed to extract the postcode areas and basic geographic location only.

Postcodes in England only are split into regions, so a Country_Region column was created with regions for England, and Country for Wales & Scotland.

postcode_area_file = "data/Postcode Area.csv"

# Extract 1st letters of postcode, which is the postcode area 
postcode_df['Postcode Area'] = (postcode_df['Postcode 3'].str.extract('([A-Z]+)', expand=True))

# Group by postcode area, keeping only basic geographic location
postcode_df = postcode_df.groupby(['Postcode Area']).first()[['Local Authority Name', 'Country Name', 'Region Name']]

# Create Country_Region column
postcode_df.loc[postcode_df['Country Name'] == 'England','Country_Region'] = postcode_df.loc[postcode_df['Country Name'] == 'England','Region Name']
postcode_df.loc[postcode_df['Country Name'] != 'England','Country_Region'] = postcode_df.loc[postcode_df['Country Name'] != 'England','Country Name']

# Write to file
postcode_df.to_csv(postcode_area_file)
postcode_df.head(10)

	Column	Type	No. Null Values	No. Unique Values	Values
0	test_id	int64	0	401068	[635544879, 1135632171, 1821645967, 727411847, '...']
1	vehicle_id	int64	0	399891	[1123718405, 259476332, 1196670034, 257196698, '...']
2	test_date	datetime64[ns]	0	366	['04, Jan 2016', '04, Jan 2016', '03, Jan 2016', '04, Jan 2016', '...']
3	test_class_id	int64	0	1	[4]
4	test_type	object	0	3	['NT' 'RT' 'ES']
5	test_result	object	0	5	['F' 'P' 'PRS' 'ABR' 'ABA']
6	test_mileage	float64	3248	149682	[22263.0, 72386.0, 58479.0, 54344.0, '...']
7	postcode_area	object	0	119	['BN', 'WS', 'ME', 'BA', '...']
8	make	object	0	429	['SUZUKI', 'PEUGEOT', 'NISSAN', 'VAUXHALL', '...']
9	model	object	0	8511	['SWIFT SZ3 DDIS', '3008', 'QASHQAI', 'CORSA', '...']
10	colour	object	0	19	['BLUE', 'RED', 'BLUE', 'SILVER', '...']
11	fuel_type	object	0	10	['DI' 'PE' 'HY' 'EL' 'LP' 'OT' 'ED' 'FC' 'GB' 'LN']
12	cylinder_capacity	float64	414	1211	[1248.0, 1560.0, 1461.0, 1364.0, '...']
13	first_use_date	datetime64[ns]	1	10442	['26, Sep 2013', '02, Mar 2010', '28, Jun 2010', '16, May 2007', '...']

	Postcode 1	Postcode 2	Postcode 3	Date Introduced	User Type	Easting	Northing	Positional Quality	County Code	County Name	...	Middle Super Output Area Code	Middle Super Output Area Name	Output Area Classification Code	Output Area Classification Name	Longitude	Latitude	Spatial Accuracy	Last Uploaded	Location	Socrata ID
0	S20 6RU	S20 6RU	S20 6RU	06-1997	0	441432.0	382852.0	1	E99999999	(pseudo) England (UA/MD/LB)	...	E02001671	NaN	6B3	Suburbanites;Semi-detached suburbia;Semi-detac...	-1.379193	53.340953	Postcode Level	21/09/2017	(53.340953, -1.379193)	1311804
1	TW4 7BD	TW4 7BD	TW4 7BD	01-1980	0	512373.0	175453.0	1	E99999999	(pseudo) England (UA/MD/LB)	...	E02000541	NaN	4B1	Multicultural metropolitans;Challenged Asian t...	-0.383652	51.466899	Postcode Level	21/09/2017	(51.466899, -0.383652)	1630253
2	GU513ZQ	GU51 3ZQ	GU51 3ZQ	05-2003	1	481569.0	155037.0	1	E10000014	Hampshire	...	E02004757	NaN	6B4	Suburbanites;Semi-detached suburbia;Older work...	-0.831674	51.288637	Postcode Level	21/09/2017	(51.288637, -0.831674)	652126

	Local Authority Name	Country Name	Region Name	Country_Region
Postcode Area
AB	Aberdeen City	Scotland	NaN	Scotland
AL	St Albans	England	East of England	East of England
B	Birmingham	England	West Midlands	West Midlands
BA	Bath and North East Somerset	England	South West	South West
BB	Hyndburn	England	North West	North West
BD	Bradford	England	Yorkshire and The Humber	Yorkshire and The Humber
BH	Poole	England	South West	South West
BL	Bolton	England	North West	North West
BN	Lewes	England	South East	South East
BR	Bromley	England	London	London