Analysis¶

Results¶

1st Round¶

In the 1st round, the model only had two convolutional layers. This was extended to three in later rounds.

The model is currently being trained on a CPU, which is why the number of layers has been limited.

Table of results¶

import pandas as pd
hyperparam_df = pd.read_csv('./log/model_hyperparam_1.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)

import matplotlib.pyplot as plt
import seaborn as sns

mean_acc = hyperparam_df.groupby("learning_rate")[["best_acc"]].mean()
sns.barplot(mean_acc.index, mean_acc.best_acc, color="#555999")
plt.title("Test Accuracy vs. Learning Rate")
plt.ylabel("Test Accuracy")
plt.xlabel("Learning Rate")
plt.show()

Learning rates of 0.01 and 0.03 are clearly not converging. So these were eliminated from further search rounds.

Accuracy is only reached 60% for this two layer model. Training accuracy was not much higher, which suggests overfitting was not the problem.

Architectures with more features in the convolutional layers and more nodes in the fully connected layers tend to perform better. So in the next round a 3rd convolutional layer was added. At this time, the model is being trained on a CPU, so it is not practical to add more.

Alternative activation functions and optimization algorithms will also be included in future searches.

2nd Round¶

In addition to specifying the number of filters, kernel size etc.., The followings options were added as Hyperparameters:

Activation functions (relu, leaky relu or elu)
Optimization algorithms (Adam, RMSProp or Nesterov)
Using a reduced image patch, by removing border pixels (0, 1 or 2)
Momentum for Batch Normalization

A 3rd convolutional layer was also added.

Table of results¶

hyperparam_df = pd.read_csv('./log/model_hyperparam_2.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)

Graphical Analysis¶

sns.distplot(hyperparam_df.best_acc, hist=False, rug=False)
plt.title("Test Accuracy")
plt.xlabel("Test Accuracy")
plt.show()

The distribution of Test Accuracy appears to be approximately Gaussian with a negative skew, from models that had difficulty converging.

sns.boxplot(y=hyperparam_df.activation, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Activation Function")
plt.xlabel("Test Accuracy")
plt.ylabel("Activation Function")
plt.show()

relu or leaky relu seems to offer better performance than elu, with leaky relu being more consistent.

sns.boxplot(y=hyperparam_df.optimizer, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Optimizer")
plt.xlabel("Test Accuracy")
plt.ylabel("Optimizer")
plt.show()

The Adam optimizer algorithm appears to be more successful than Nesterov Accelerated Gradient or RMSProp in this case.

sns.boxplot(y=hyperparam_df.patch_reduction, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Patch Reduction")
plt.xlabel("Test Accuracy")
plt.ylabel("Patch Reduction")
plt.show()

A patch reduction of 1, implies removal of the border pixels, resulting in the reduction of the image size from 32x32 to 30x30.

While other CNNs have had success with this approach, it would appear that there is no benefit with the current architecture.

sns.boxplot(y=hyperparam_df.batch_size, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Batch Size")
plt.xlabel("Test Accuracy")
plt.ylabel("Batch Size")
plt.show()

While a batch size of 32 may have higher median performance, the results are more erratic. A batch size of 64 is capable of achieving good results and is more reliable.

def subplot_performance(hyperparameter, title, ax):
    mean_acc = hyperparam_df.groupby(hyperparameter)[["best_acc"]].mean()
    sns.pointplot(mean_acc.index, mean_acc.best_acc, color="#555999", ax=ax)
    ax.set_xlabel(title)
    ax.set_ylabel("Mean Test Accuracy")

fig, axes = plt.subplots(2,3,figsize=(12,8))

subplot_performance("filters1","Layer 1 filters", axes[0,0])
subplot_performance("filters2","Layer 2 filters", axes[0,1])
subplot_performance("filters3","Layer 3 filters", axes[0,2])

subplot_performance("ksize1","Layer 1 kernel", axes[1,0])
subplot_performance("ksize2","Layer 2 kernel", axes[1,1])
subplot_performance("ksize3","Layer 3 kernel", axes[1,2])

plt.tight_layout()
plt.show()

Increasing the number of filters in the first convolutional layer tends to improve performance. But surprisingly, the next two layers perform more poorly with more filters. Perhaps as a result of overfitting.

A 4x4 kernel has slightly better performance on average.

3rd Round¶

The stride of the 1st CNN layer was reduced from 2 to 1. The resulting increase in parameters to learn, greatly increased training time, so only 5 hyperparameter settings were sampled.

Hyperparameter values that tended to underperform in the 2nd round were also eliminated.

Table of results¶

hyperparam_df = pd.read_csv('./log/model_hyperparam_3.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)

Conclusion¶

With over 15 million hyperparameter combinations to possibly choose from in the 2nd round, a grid search is clearly impractical. A randomized search allows some hyperparameter values be chosen by a process of elimination/selection.

The 1st round showed that learning rates of 0.01 and above were not converging. A full grid search would have wasted a lot of time attempting these learning rates with other hyperparameter settings.

In the 2nd round, the indication was that a batch size of 64 and the leaky relu activation function were more reliable. So in the 3rd round, these were both fixed. Having more features in the 1st convolutional layers also seemed to help, so 16 and 32 were both removed as possible values fo this setting.

The stride of the 1st convolutional layer was reduced from 2 to 1 in the 3rd round. As there were already signs of possible overfitting in the 2nd round, dropout layers were introduced to regularize the network.

Best test set accuracy in the three rounds was as follows:

1st Round - 60.85% (2 convolutional, 2 fully connected)
2nd Round - 71.17% (3 convolutional, 2 fully connected)
3rd Round - 80.62% (3 convolutional, 2 fully connected, Dropout & stride 1 on all layers)

This is creeping into the bottom of the leaderboard for cifar-10, which is satisfying given the simplicity of the model and that the purpose of this study, was to investigate hyperparameter search (and learn TensorFlow).

Further work¶

Training on a GPU, e.g. using Google's cloud service, would allow more layers and a faster analysis of possible models.

Inception modules have been shown to be successful and could be an interesting addition.

Image pre-processing may also offer some advantages. For example, adding a greyscale channel or enhancing contrast.

Hyperparameter searches can also be carried out using a Gaussian Process Model, which aims to predict which set of Hyperparameters have the highest expected improvement of performance over the current best.

	batch_size	best_acc	best_loss	filters_1	filters_2	full_hidd_1	full_hidd_2	ksize_1	ksize_2	learning_rate	no_epochs
6	256.0	0.6085	1.124841	64.0	96.0	125.0	125.0	5.0	5.0	0.001	18.0
5	128.0	0.6002	1.142996	32.0	64.0	100.0	125.0	3.0	5.0	0.001	20.0
0	32.0	0.5978	1.147990	32.0	96.0	100.0	80.0	3.0	4.0	0.001	19.0
25	64.0	0.5903	1.196584	64.0	96.0	100.0	125.0	3.0	3.0	0.001	23.0
3	256.0	0.5806	1.192041	32.0	48.0	60.0	125.0	3.0	3.0	0.001	24.0
2	32.0	0.5749	1.211263	16.0	32.0	60.0	100.0	5.0	4.0	0.003	35.0
10	64.0	0.5735	1.226173	32.0	96.0	60.0	100.0	3.0	3.0	0.001	20.0
18	64.0	0.5606	1.266603	32.0	64.0	60.0	100.0	5.0	4.0	0.003	21.0
12	256.0	0.5598	1.247521	8.0	32.0	125.0	80.0	5.0	3.0	0.003	30.0
26	64.0	0.5589	1.245890	32.0	48.0	125.0	80.0	4.0	5.0	0.003	18.0
4	128.0	0.5307	1.304451	8.0	96.0	125.0	100.0	4.0	4.0	0.003	23.0
24	128.0	0.5296	1.316891	8.0	64.0	100.0	125.0	3.0	4.0	0.003	18.0
16	32.0	0.5266	1.396713	16.0	64.0	60.0	80.0	5.0	4.0	0.003	28.0
21	256.0	0.5261	1.326381	8.0	48.0	125.0	125.0	4.0	4.0	0.001	20.0
8	128.0	0.4189	1.511615	16.0	96.0	125.0	80.0	4.0	4.0	0.010	20.0
28	64.0	0.4175	1.584211	16.0	32.0	125.0	80.0	3.0	3.0	0.010	32.0
22	64.0	0.1001	2.302906	8.0	64.0	100.0	125.0	3.0	5.0	0.010	14.0
15	32.0	0.1000	2.303990	16.0	32.0	125.0	80.0	4.0	3.0	0.030	12.0
17	256.0	0.1000	2.302994	8.0	64.0	125.0	80.0	3.0	4.0	0.030	12.0
1	32.0	0.1000	2.302861	32.0	64.0	60.0	125.0	4.0	3.0	0.010	23.0
19	256.0	0.1000	2.303015	32.0	48.0	60.0	125.0	4.0	4.0	0.030	13.0
20	128.0	0.1000	2.303385	64.0	96.0	60.0	100.0	3.0	4.0	0.030	12.0
13	128.0	0.1000	2.302733	8.0	48.0	100.0	80.0	4.0	4.0	0.010	12.0
23	32.0	0.1000	2.303946	8.0	32.0	60.0	100.0	3.0	4.0	0.030	12.0
11	32.0	0.1000	2.303967	16.0	96.0	125.0	80.0	3.0	3.0	0.030	12.0
9	32.0	0.1000	2.303927	8.0	32.0	125.0	125.0	3.0	3.0	0.030	12.0
7	32.0	0.1000	2.303434	32.0	96.0	60.0	80.0	5.0	5.0	0.030	13.0
27	128.0	0.1000	2.302751	8.0	96.0	60.0	125.0	4.0	5.0	0.010	12.0
14	32.0	0.1000	2.303959	8.0	96.0	100.0	125.0	3.0	3.0	0.030	12.0

	activation	batch_size	best_acc	best_loss	best_train_acc	best_train_loss	filters1	filters2	filters3	full_hidd1	full_hidd2	ksize1	ksize2	ksize3	learning_rate	logdir	momentum	no_epochs	optimizer	patch_reduction
36	relu	64	0.7117	0.919377	0.754637	0.711197	96	96	64	100	125	3	5	3	0.0020	./log/run-20180128T123115/	0.90	12	adam	0
35	lrelu	128	0.7109	0.906563	0.758348	0.701145	32	64	64	125	80	5	3	5	0.0020	./log/run-20180128T120818/	0.99	13	adam	0
37	relu	128	0.7093	0.897427	0.743490	0.734613	96	128	96	100	100	5	3	4	0.0020	./log/run-20180128T133019/	0.90	11	adam	0
45	relu	32	0.7093	0.873374	0.722251	0.805403	96	64	64	60	80	5	4	4	0.0030	./log/run-20180128T174207/	0.99	12	adam	2
43	elu	256	0.7033	0.910308	0.734158	0.771317	96	48	128	100	125	5	5	4	0.0015	./log/run-20180128T162309/	0.90	13	adam	0
24	relu	128	0.7008	1.032702	0.755814	0.708703	96	96	128	60	125	5	5	5	0.0030	./log/run-20180128T080126/	0.95	13	adam	2
3	relu	32	0.6966	0.940112	0.704069	0.840530	64	48	64	125	125	4	5	5	0.0030	./log/run-20180127T213021/	0.99	12	adam	1
7	lrelu	32	0.6947	0.906030	0.692000	0.867232	96	48	96	60	80	3	5	3	0.0020	./log/run-20180127T225520/	0.95	10	adam	0
16	relu	64	0.6923	0.946625	0.724659	0.792448	16	64	128	100	125	5	5	5	0.0030	./log/run-20180128T035715/	0.95	12	adam	1
19	lrelu	32	0.6911	0.905265	0.746788	0.762321	32	128	128	100	125	4	5	5	0.0015	./log/run-20180128T043540/	0.90	10	nesterov	0
48	relu	128	0.6898	0.896807	0.708308	0.837255	64	48	64	60	100	3	3	5	0.0010	./log/run-20180128T190408/	0.95	11	adam	0
41	elu	32	0.6891	0.934515	0.723434	0.806494	32	64	64	125	80	3	4	5	0.0030	./log/run-20180128T155401/	0.95	11	nesterov	1
17	elu	32	0.6870	0.904974	0.698303	0.855170	32	48	96	60	125	4	4	3	0.0010	./log/run-20180128T041125/	0.99	10	adam	0
8	lrelu	32	0.6831	0.923607	0.696061	0.872021	96	64	64	100	125	4	5	3	0.0010	./log/run-20180127T233311/	0.90	11	nesterov	0
20	lrelu	256	0.6810	0.904818	0.669038	0.952634	64	128	128	100	80	5	3	4	0.0020	./log/run-20180128T052017/	0.95	10	adam	0
46	relu	64	0.6806	0.998204	0.728308	0.783186	96	64	64	100	125	5	3	4	0.0020	./log/run-20180128T181510/	0.99	15	nesterov	1
26	elu	32	0.6806	0.928317	0.682667	0.900012	32	64	96	125	80	4	4	3	0.0030	./log/run-20180128T091329/	0.99	10	nesterov	1
21	relu	256	0.6802	1.084211	0.721635	0.786616	64	128	64	125	100	3	3	4	0.0020	./log/run-20180128T060308/	0.99	14	rmsprop	0
44	lrelu	64	0.6790	1.079847	0.710308	0.822083	32	96	96	60	100	5	5	5	0.0015	./log/run-20180128T171420/	0.99	13	rmsprop	1
34	relu	64	0.6787	1.073231	0.689846	0.880468	32	128	96	100	100	4	3	4	0.0010	./log/run-20180128T114718/	0.95	13	rmsprop	1
4	lrelu	64	0.6785	0.969094	0.667692	0.944729	64	96	64	125	80	4	4	4	0.0020	./log/run-20180127T215818/	0.90	11	rmsprop	1
13	elu	128	0.6784	0.929740	0.672779	0.942962	32	128	96	60	80	4	3	5	0.0030	./log/run-20180128T024713/	0.99	9	adam	1
11	relu	128	0.6781	0.965303	0.671795	0.937197	96	48	96	60	125	4	3	5	0.0020	./log/run-20180128T021052/	0.95	10	adam	2
6	relu	128	0.6740	0.989200	0.694608	0.867488	16	64	96	125	100	4	4	4	0.0030	./log/run-20180127T224516/	0.95	12	adam	2
31	lrelu	128	0.6733	0.974435	0.642516	1.008214	64	64	64	125	80	3	4	3	0.0030	./log/run-20180128T105754/	0.99	11	rmsprop	1
2	lrelu	128	0.6731	0.937059	0.665236	0.944229	16	96	96	125	100	5	3	5	0.0030	./log/run-20180127T211708/	0.99	10	adam	2
5	lrelu	32	0.6706	0.967186	0.673030	0.926152	16	128	128	60	80	4	4	4	0.0020	./log/run-20180127T222852/	0.90	9	adam	2
25	relu	64	0.6698	0.989890	0.663631	0.940841	16	96	64	60	100	4	4	4	0.0030	./log/run-20180128T090333/	0.99	10	adam	2
18	relu	256	0.6678	0.983491	0.673244	0.931913	32	64	96	100	100	3	3	3	0.0030	./log/run-20180128T042502/	0.95	11	adam	2
1	elu	64	0.6662	0.978574	0.685880	0.916311	32	96	64	100	80	4	3	3	0.0010	./log/run-20180127T205428/	0.90	14	nesterov	0
38	elu	64	0.6650	0.990182	0.666092	0.961927	96	48	64	60	125	5	4	3	0.0010	./log/run-20180128T141741/	0.95	10	rmsprop	1
14	lrelu	64	0.6647	0.969241	0.721600	0.819558	96	64	128	60	125	3	4	4	0.0030	./log/run-20180128T030917/	0.90	10	nesterov	0
47	relu	64	0.6629	0.958692	0.650000	0.990026	32	64	96	125	80	4	3	4	0.0020	./log/run-20180128T185436/	0.95	9	adam	2
10	elu	128	0.6620	0.993874	0.686842	0.904742	64	96	128	60	125	5	4	5	0.0015	./log/run-20180128T013357/	0.95	11	nesterov	1
39	elu	256	0.6608	1.003135	0.685734	0.912478	96	128	64	100	125	5	3	5	0.0015	./log/run-20180128T144602/	0.99	16	nesterov	1
22	elu	64	0.6566	1.000508	0.682462	0.924811	96	128	96	100	125	5	5	3	0.0030	./log/run-20180128T064406/	0.95	8	nesterov	0
9	relu	128	0.6559	1.076769	0.737892	0.768973	96	128	64	60	80	5	5	4	0.0015	./log/run-20180128T001926/	0.95	14	nesterov	1
33	lrelu	64	0.6515	1.080226	0.649231	0.997922	16	48	128	60	80	4	3	3	0.0015	./log/run-20180128T114001/	0.95	12	rmsprop	2
42	relu	128	0.6476	1.028002	0.641995	1.017022	32	48	96	60	125	4	5	3	0.0020	./log/run-20180128T160710/	0.90	13	nesterov	1
49	lrelu	128	0.6435	1.127514	0.728384	0.799779	32	128	128	100	80	4	5	3	0.0015	./log/run-20180128T192810/	0.90	15	nesterov	2
27	elu	128	0.6416	1.069904	0.586017	1.185793	96	48	128	125	125	4	5	3	0.0020	./log/run-20180128T092435/	0.90	9	rmsprop	1
12	elu	256	0.6406	1.066943	0.660266	0.985012	16	48	128	60	125	5	5	3	0.0030	./log/run-20180128T023235/	0.95	16	nesterov	1
23	elu	128	0.6356	1.032984	0.626516	1.079209	96	64	96	100	80	4	3	3	0.0020	./log/run-20180128T073650/	0.90	11	nesterov	2
40	relu	256	0.6258	1.061041	0.589693	1.173546	16	48	128	100	100	4	5	3	0.0020	./log/run-20180128T153954/	0.90	12	nesterov	0
28	elu	128	0.6224	1.100552	0.636136	1.058336	16	128	64	125	100	4	4	3	0.0030	./log/run-20180128T095112/	0.99	10	nesterov	2
32	lrelu	32	0.6164	1.188633	0.608620	1.110225	32	64	128	60	100	3	5	3	0.0030	./log/run-20180128T112125/	0.99	14	rmsprop	1
0	relu	128	0.6128	1.097343	0.588790	1.170430	64	96	128	125	80	3	3	3	0.0015	./log/run-20180127T203523/	0.90	10	nesterov	2
15	lrelu	32	0.5877	1.281553	0.545697	1.249027	16	64	128	100	80	3	5	3	0.0030	./log/run-20180128T034821/	0.99	10	rmsprop	1
29	relu	32	0.5828	1.182317	0.546320	1.258835	64	96	96	100	100	5	4	4	0.0030	./log/run-20180128T100253/	0.90	12	rmsprop	2
30	elu	32	0.4889	1.530312	0.479242	1.519294	32	128	128	100	100	5	4	5	0.0030	./log/run-20180128T103147/	0.99	9	rmsprop	2

	activation	batch_size	best_acc	best_loss	best_train_acc	best_train_loss	filters1	filters2	filters3	full_hidd1	full_hidd2	ksize1	ksize2	ksize3	learning_rate	logdir	momentum	no_epochs	optimizer
3	lrelu	64	0.8062	0.579502	0.736910	0.751420	96	128	128	125	100	4	5	5	0.0030	./log/run-20180130T233101/	0.95	28	adam
2	lrelu	64	0.8048	0.587198	0.761165	0.673304	96	96	128	125	125	4	5	5	0.0010	./log/run-20180130T093755/	0.95	33	adam
4	lrelu	64	0.7894	0.622922	0.721077	0.797818	64	128	128	100	100	5	5	5	0.0030	./log/run-20180131T132957/	0.90	25	adam
0	lrelu	64	0.7799	0.632761	0.714930	0.794581	64	96	96	125	100	4	5	5	0.0010	./log/run-20180129T200512/	0.95	26	adam
1	lrelu	64	0.7750	0.649881	0.697231	0.850535	64	96	128	100	125	5	4	5	0.0015	./log/run-20180130T034133/	0.95	21	adam