MOT Vehicle Testing

Introduction

In the UK, cars older than 3 years are required to have annual tests (MOT test) for vehicle safety, roadworthiness and emissions. This ongoing project is a study of the MOT test results for cars and similar small vehicles in 2016.

The intention is carry out some exploratory data analysis before investigating models to predict whether or not a vehicle will pass its MOT given its age, mileage, make, model, location etc..

Data Download

The data for this project was obtained as follows:

Data Pre-Processing

All data in archives was extracted and placed in a "data" folder.

The Shapefiles data did not require any further pre-processing.

MOT Results (2016)

As all MOT test results for all vehicles are provided, this is a large file (3GB).

The first three lines of the file are shown below:

In [2]:
# Import required libraries
import pandas as pd
import re

# Raw data for 2016 (3GB)
data_file = "data/test_result_2016.txt"

with open(data_file, 'r') as data_input:
    for lines in range(3):
        line = data_input.readline()
        print(line)
test_id|vehicle_id|test_date|test_class_id|test_type|test_result|test_mileage|postcode_area|make|model|colour|fuel_type|cylinder_capacity|first_use_date

1645480751|1374211238|2016-01-01|4|NT|P|117033|SM|VOLKSWAGEN|POLO|BLACK|PE|1600|2000-06-23

1393462389|1153769898|2016-01-01|4|NT|P|99292|NE|VOLKSWAGEN|PASSAT|BLUE|DI|1968|2006-11-30

Sampling & Splitting

The MOT test data is stored in a pipe delimited text file. For practical purposes and to simulate a survey, a small sample (1.5%) of all car and small vehicle (test class, 4) tests was taken.

This data was then split into four parts:

  • Training (75%)
  • Development for Hyperparameter tuning (10%)
  • Ensemble for stacking various models if required (5%)
  • Testing, which will put asise (in a vault!) to test the final model of each type. (10%)
In [2]:
# Initialize counts
sample_count = 0
testing_development_count = 0
ensemble_count = 0

# Output files
training_file = "data/MOT_result_2016_training.csv"
development_file = "data/MOT_result_2016_development.csv"
ensemble_file = "data/MOT_result_2016_ensemble.csv"
testing_file = "data/MOT_result_2016_testing.csv"

# Sample Ratio (very large file, so sample just 1.5%)
sample = 0.015

# Split Ratio (10% for development & testing, 5% for ensemble, 75% left for training)
testing_development_split = 0.1
ensemble_split = 0.05

# Use write mode and headers for 1st dataframe only
mode = 'w'
header = True

# Process the file in chunks
chunksize = 10**5

for chunk in pd.read_csv(data_file, sep='|', chunksize=chunksize, error_bad_lines=False, warn_bad_lines=False):
    if mode == 'w': # 1st dataframe sampled, set random state
        chunk = chunk.query('test_class_id == 4').sample(frac=sample, random_state = 21)
    else:
        chunk = chunk.query('test_class_id == 4').sample(frac=sample)
    # Split data into training, development, ensemble & testing
    chunk_len = len(chunk)
    testing_development_len = int(chunk_len * testing_development_split)
    ensemble_len = int(chunk_len * ensemble_split)
    # Testing Data
    start = 0
    end = testing_development_len
    chunk[start:end].to_csv(testing_file, index=False, mode=mode, header=header)
    # Ensemble Data
    start = end + 1
    end = testing_development_len + ensemble_len
    chunk[start:end].to_csv(ensemble_file, index=False, mode=mode, header=header)
    # Development (hyperparameter tuning) Data
    start = end + 1
    end = 2 * testing_development_len + ensemble_len
    chunk[start:end].to_csv(development_file, index=False, mode=mode, header=header)
    # Training Data
    start = end + 1
    chunk[start:].to_csv(training_file, index=False, mode=mode, header=header)
    # Set mode for writing CSV file to append for subsequent samples & don't rewrite headers
    mode = 'a'
    header = False
    # Update counts
    sample_count += chunk_len
    testing_development_count += testing_development_len
    ensemble_count += ensemble_len

# Subtracting testing and development counts to get training counts
training_count = sample_count - 2 * testing_development_count
print('The number of samples taken was ' + "{:,}".format(sample_count))
print('This was split as follows:')
print('  Training - ' + "{:,}".format(training_count))
print('  Development - ' + "{:,}".format(testing_development_count))
print('  Ensemble - ' + "{:,}".format(ensemble_count))
print('  Testing - ' + "{:,}".format(testing_development_count))
The number of samples taken was 534,557
This was split as follows:
  Training - 427,993
  Development - 53,282
  Ensemble - 26,548
  Testing - 53,282

As can be seen, a 1.5% sample still gives a large dataset, which is more than adequate for my purposes.

Data Summary

The processed data contains 14 features (none were dropped) as summarised below.

In [60]:
# Read training data
training_df = pd.read_csv(training_file, parse_dates=['test_date', 'first_use_date'])

# Empty dataframe to store data description
meta_data = pd.DataFrame()

# Loop over each column obtaining data descriptors
for col in training_df.columns:
    col_data = {}
    col_data['Column'] = col
    col_data['Type'] = training_df[col].dtype
    no_unique = len(training_df[col].unique())
    col_data['No. Unique Values'] = no_unique
    no_null = training_df[col].isnull().sum()
    col_data['No. Null Values'] = no_null
    if no_unique < 11:
        col_data['Values'] = training_df[col].unique()
    elif col in ['test_date', 'first_use_date']:
        col_data['Values'] = [d.strftime('%d, %b %Y') for d in training_df.loc[:3, col]] + ['...']
    else:
        col_data['Values'] = training_df.loc[:3, col].tolist() + ['...']
    meta_data = meta_data.append(col_data, ignore_index=True)

meta_data = meta_data[['Column','Type','No. Null Values', 'No. Unique Values', 'Values']]
meta_data.style
Out[60]:
Column Type No. Null Values No. Unique Values Values
0 test_id int64 0 401068 [635544879, 1135632171, 1821645967, 727411847, '...']
1 vehicle_id int64 0 399891 [1123718405, 259476332, 1196670034, 257196698, '...']
2 test_date datetime64[ns] 0 366 ['04, Jan 2016', '04, Jan 2016', '03, Jan 2016', '04, Jan 2016', '...']
3 test_class_id int64 0 1 [4]
4 test_type object 0 3 ['NT' 'RT' 'ES']
5 test_result object 0 5 ['F' 'P' 'PRS' 'ABR' 'ABA']
6 test_mileage float64 3248 149682 [22263.0, 72386.0, 58479.0, 54344.0, '...']
7 postcode_area object 0 119 ['BN', 'WS', 'ME', 'BA', '...']
8 make object 0 429 ['SUZUKI', 'PEUGEOT', 'NISSAN', 'VAUXHALL', '...']
9 model object 0 8511 ['SWIFT SZ3 DDIS', '3008', 'QASHQAI', 'CORSA', '...']
10 colour object 0 19 ['BLUE', 'RED', 'BLUE', 'SILVER', '...']
11 fuel_type object 0 10 ['DI' 'PE' 'HY' 'EL' 'LP' 'OT' 'ED' 'FC' 'GB' 'LN']
12 cylinder_capacity float64 414 1211 [1248.0, 1560.0, 1461.0, 1364.0, '...']
13 first_use_date datetime64[ns] 1 10442 ['26, Sep 2013', '02, Mar 2010', '28, Jun 2010', '16, May 2007', '...']

Postcode Lookup

The MOT test results only provide the postcode area (city or area level). e.g. CF = Cardiff.

The Postcode Lookup file is much more detailed than is required, containing information about individual Postcodes (street level).

In [4]:
postcode_file = "data/National_Statistics_Postcode_Lookup_UK.csv"

postcode_df = pd.read_csv(postcode_file)
postcode_df.head(3)
Out[4]:
Postcode 1 Postcode 2 Postcode 3 Date Introduced User Type Easting Northing Positional Quality County Code County Name ... Middle Super Output Area Code Middle Super Output Area Name Output Area Classification Code Output Area Classification Name Longitude Latitude Spatial Accuracy Last Uploaded Location Socrata ID
0 S20 6RU S20 6RU S20 6RU 06-1997 0 441432.0 382852.0 1 E99999999 (pseudo) England (UA/MD/LB) ... E02001671 NaN 6B3 Suburbanites;Semi-detached suburbia;Semi-detac... -1.379193 53.340953 Postcode Level 21/09/2017 (53.340953, -1.379193) 1311804
1 TW4 7BD TW4 7BD TW4 7BD 01-1980 0 512373.0 175453.0 1 E99999999 (pseudo) England (UA/MD/LB) ... E02000541 NaN 4B1 Multicultural metropolitans;Challenged Asian t... -0.383652 51.466899 Postcode Level 21/09/2017 (51.466899, -0.383652) 1630253
2 GU513ZQ GU51 3ZQ GU51 3ZQ 05-2003 1 481569.0 155037.0 1 E10000014 Hampshire ... E02004757 NaN 6B4 Suburbanites;Semi-detached suburbia;Older work... -0.831674 51.288637 Postcode Level 21/09/2017 (51.288637, -0.831674) 652126

3 rows × 36 columns

Postcode Area Extraction

The file was processed to extract the postcode areas and basic geographic location only.

Postcodes in England only are split into regions, so a Country_Region column was created with regions for England, and Country for Wales & Scotland.

In [5]:
postcode_area_file = "data/Postcode Area.csv"

# Extract 1st letters of postcode, which is the postcode area 
postcode_df['Postcode Area'] = (postcode_df['Postcode 3'].str.extract('([A-Z]+)', expand=True))

# Group by postcode area, keeping only basic geographic location
postcode_df = postcode_df.groupby(['Postcode Area']).first()[['Local Authority Name', 'Country Name', 'Region Name']]

# Create Country_Region column
postcode_df.loc[postcode_df['Country Name'] == 'England','Country_Region'] = postcode_df.loc[postcode_df['Country Name'] == 'England','Region Name']
postcode_df.loc[postcode_df['Country Name'] != 'England','Country_Region'] = postcode_df.loc[postcode_df['Country Name'] != 'England','Country Name']

# Write to file
postcode_df.to_csv(postcode_area_file)
postcode_df.head(10)
Out[5]:
Local Authority Name Country Name Region Name Country_Region
Postcode Area
AB Aberdeen City Scotland NaN Scotland
AL St Albans England East of England East of England
B Birmingham England West Midlands West Midlands
BA Bath and North East Somerset England South West South West
BB Hyndburn England North West North West
BD Bradford England Yorkshire and The Humber Yorkshire and The Humber
BH Poole England South West South West
BL Bolton England North West North West
BN Lewes England South East South East
BR Bromley England London London