In the UK, cars older than 3 years are required to have annual tests (MOT test) for vehicle safety, roadworthiness and emissions. This ongoing project is a study of the MOT test results for cars and similar small vehicles in 2016.
The intention is carry out some exploratory data analysis before investigating models to predict whether or not a vehicle will pass its MOT given its age, mileage, make, model, location etc..
The data for this project was obtained as follows:
All data in archives was extracted and placed in a "data" folder.
The Shapefiles data did not require any further pre-processing.
As all MOT test results for all vehicles are provided, this is a large file (3GB).
The first three lines of the file are shown below:
# Import required libraries
import pandas as pd
import re
# Raw data for 2016 (3GB)
data_file = "data/test_result_2016.txt"
with open(data_file, 'r') as data_input:
for lines in range(3):
line = data_input.readline()
print(line)
The MOT test data is stored in a pipe delimited text file. For practical purposes and to simulate a survey, a small sample (1.5%) of all car and small vehicle (test class, 4) tests was taken.
This data was then split into four parts:
# Initialize counts
sample_count = 0
testing_development_count = 0
ensemble_count = 0
# Output files
training_file = "data/MOT_result_2016_training.csv"
development_file = "data/MOT_result_2016_development.csv"
ensemble_file = "data/MOT_result_2016_ensemble.csv"
testing_file = "data/MOT_result_2016_testing.csv"
# Sample Ratio (very large file, so sample just 1.5%)
sample = 0.015
# Split Ratio (10% for development & testing, 5% for ensemble, 75% left for training)
testing_development_split = 0.1
ensemble_split = 0.05
# Use write mode and headers for 1st dataframe only
mode = 'w'
header = True
# Process the file in chunks
chunksize = 10**5
for chunk in pd.read_csv(data_file, sep='|', chunksize=chunksize, error_bad_lines=False, warn_bad_lines=False):
if mode == 'w': # 1st dataframe sampled, set random state
chunk = chunk.query('test_class_id == 4').sample(frac=sample, random_state = 21)
else:
chunk = chunk.query('test_class_id == 4').sample(frac=sample)
# Split data into training, development, ensemble & testing
chunk_len = len(chunk)
testing_development_len = int(chunk_len * testing_development_split)
ensemble_len = int(chunk_len * ensemble_split)
# Testing Data
start = 0
end = testing_development_len
chunk[start:end].to_csv(testing_file, index=False, mode=mode, header=header)
# Ensemble Data
start = end + 1
end = testing_development_len + ensemble_len
chunk[start:end].to_csv(ensemble_file, index=False, mode=mode, header=header)
# Development (hyperparameter tuning) Data
start = end + 1
end = 2 * testing_development_len + ensemble_len
chunk[start:end].to_csv(development_file, index=False, mode=mode, header=header)
# Training Data
start = end + 1
chunk[start:].to_csv(training_file, index=False, mode=mode, header=header)
# Set mode for writing CSV file to append for subsequent samples & don't rewrite headers
mode = 'a'
header = False
# Update counts
sample_count += chunk_len
testing_development_count += testing_development_len
ensemble_count += ensemble_len
# Subtracting testing and development counts to get training counts
training_count = sample_count - 2 * testing_development_count
print('The number of samples taken was ' + "{:,}".format(sample_count))
print('This was split as follows:')
print(' Training - ' + "{:,}".format(training_count))
print(' Development - ' + "{:,}".format(testing_development_count))
print(' Ensemble - ' + "{:,}".format(ensemble_count))
print(' Testing - ' + "{:,}".format(testing_development_count))
As can be seen, a 1.5% sample still gives a large dataset, which is more than adequate for my purposes.
The processed data contains 14 features (none were dropped) as summarised below.
# Read training data
training_df = pd.read_csv(training_file, parse_dates=['test_date', 'first_use_date'])
# Empty dataframe to store data description
meta_data = pd.DataFrame()
# Loop over each column obtaining data descriptors
for col in training_df.columns:
col_data = {}
col_data['Column'] = col
col_data['Type'] = training_df[col].dtype
no_unique = len(training_df[col].unique())
col_data['No. Unique Values'] = no_unique
no_null = training_df[col].isnull().sum()
col_data['No. Null Values'] = no_null
if no_unique < 11:
col_data['Values'] = training_df[col].unique()
elif col in ['test_date', 'first_use_date']:
col_data['Values'] = [d.strftime('%d, %b %Y') for d in training_df.loc[:3, col]] + ['...']
else:
col_data['Values'] = training_df.loc[:3, col].tolist() + ['...']
meta_data = meta_data.append(col_data, ignore_index=True)
meta_data = meta_data[['Column','Type','No. Null Values', 'No. Unique Values', 'Values']]
meta_data.style
The MOT test results only provide the postcode area (city or area level). e.g. CF = Cardiff.
The Postcode Lookup file is much more detailed than is required, containing information about individual Postcodes (street level).
postcode_file = "data/National_Statistics_Postcode_Lookup_UK.csv"
postcode_df = pd.read_csv(postcode_file)
postcode_df.head(3)
The file was processed to extract the postcode areas and basic geographic location only.
Postcodes in England only are split into regions, so a Country_Region column was created with regions for England, and Country for Wales & Scotland.
postcode_area_file = "data/Postcode Area.csv"
# Extract 1st letters of postcode, which is the postcode area
postcode_df['Postcode Area'] = (postcode_df['Postcode 3'].str.extract('([A-Z]+)', expand=True))
# Group by postcode area, keeping only basic geographic location
postcode_df = postcode_df.groupby(['Postcode Area']).first()[['Local Authority Name', 'Country Name', 'Region Name']]
# Create Country_Region column
postcode_df.loc[postcode_df['Country Name'] == 'England','Country_Region'] = postcode_df.loc[postcode_df['Country Name'] == 'England','Region Name']
postcode_df.loc[postcode_df['Country Name'] != 'England','Country_Region'] = postcode_df.loc[postcode_df['Country Name'] != 'England','Country Name']
# Write to file
postcode_df.to_csv(postcode_area_file)
postcode_df.head(10)