Analysis

Results

1st Round

In the 1st round, the model only had two convolutional layers. This was extended to three in later rounds.

The model is currently being trained on a CPU, which is why the number of layers has been limited.

Table of results

In [2]:
import pandas as pd
hyperparam_df = pd.read_csv('./log/model_hyperparam_1.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)
Out[2]:
batch_size best_acc best_loss filters_1 filters_2 full_hidd_1 full_hidd_2 ksize_1 ksize_2 learning_rate no_epochs
6 256.0 0.6085 1.124841 64.0 96.0 125.0 125.0 5.0 5.0 0.001 18.0
5 128.0 0.6002 1.142996 32.0 64.0 100.0 125.0 3.0 5.0 0.001 20.0
0 32.0 0.5978 1.147990 32.0 96.0 100.0 80.0 3.0 4.0 0.001 19.0
25 64.0 0.5903 1.196584 64.0 96.0 100.0 125.0 3.0 3.0 0.001 23.0
3 256.0 0.5806 1.192041 32.0 48.0 60.0 125.0 3.0 3.0 0.001 24.0
2 32.0 0.5749 1.211263 16.0 32.0 60.0 100.0 5.0 4.0 0.003 35.0
10 64.0 0.5735 1.226173 32.0 96.0 60.0 100.0 3.0 3.0 0.001 20.0
18 64.0 0.5606 1.266603 32.0 64.0 60.0 100.0 5.0 4.0 0.003 21.0
12 256.0 0.5598 1.247521 8.0 32.0 125.0 80.0 5.0 3.0 0.003 30.0
26 64.0 0.5589 1.245890 32.0 48.0 125.0 80.0 4.0 5.0 0.003 18.0
4 128.0 0.5307 1.304451 8.0 96.0 125.0 100.0 4.0 4.0 0.003 23.0
24 128.0 0.5296 1.316891 8.0 64.0 100.0 125.0 3.0 4.0 0.003 18.0
16 32.0 0.5266 1.396713 16.0 64.0 60.0 80.0 5.0 4.0 0.003 28.0
21 256.0 0.5261 1.326381 8.0 48.0 125.0 125.0 4.0 4.0 0.001 20.0
8 128.0 0.4189 1.511615 16.0 96.0 125.0 80.0 4.0 4.0 0.010 20.0
28 64.0 0.4175 1.584211 16.0 32.0 125.0 80.0 3.0 3.0 0.010 32.0
22 64.0 0.1001 2.302906 8.0 64.0 100.0 125.0 3.0 5.0 0.010 14.0
15 32.0 0.1000 2.303990 16.0 32.0 125.0 80.0 4.0 3.0 0.030 12.0
17 256.0 0.1000 2.302994 8.0 64.0 125.0 80.0 3.0 4.0 0.030 12.0
1 32.0 0.1000 2.302861 32.0 64.0 60.0 125.0 4.0 3.0 0.010 23.0
19 256.0 0.1000 2.303015 32.0 48.0 60.0 125.0 4.0 4.0 0.030 13.0
20 128.0 0.1000 2.303385 64.0 96.0 60.0 100.0 3.0 4.0 0.030 12.0
13 128.0 0.1000 2.302733 8.0 48.0 100.0 80.0 4.0 4.0 0.010 12.0
23 32.0 0.1000 2.303946 8.0 32.0 60.0 100.0 3.0 4.0 0.030 12.0
11 32.0 0.1000 2.303967 16.0 96.0 125.0 80.0 3.0 3.0 0.030 12.0
9 32.0 0.1000 2.303927 8.0 32.0 125.0 125.0 3.0 3.0 0.030 12.0
7 32.0 0.1000 2.303434 32.0 96.0 60.0 80.0 5.0 5.0 0.030 13.0
27 128.0 0.1000 2.302751 8.0 96.0 60.0 125.0 4.0 5.0 0.010 12.0
14 32.0 0.1000 2.303959 8.0 96.0 100.0 125.0 3.0 3.0 0.030 12.0
In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

mean_acc = hyperparam_df.groupby("learning_rate")[["best_acc"]].mean()
sns.barplot(mean_acc.index, mean_acc.best_acc, color="#555999")
plt.title("Test Accuracy vs. Learning Rate")
plt.ylabel("Test Accuracy")
plt.xlabel("Learning Rate")
plt.show()

Learning rates of 0.01 and 0.03 are clearly not converging. So these were eliminated from further search rounds.

Accuracy is only reached 60% for this two layer model. Training accuracy was not much higher, which suggests overfitting was not the problem.

Architectures with more features in the convolutional layers and more nodes in the fully connected layers tend to perform better. So in the next round a 3rd convolutional layer was added. At this time, the model is being trained on a CPU, so it is not practical to add more.

Alternative activation functions and optimization algorithms will also be included in future searches.

2nd Round

In addition to specifying the number of filters, kernel size etc.., The followings options were added as Hyperparameters:

  • Activation functions (relu, leaky relu or elu)
  • Optimization algorithms (Adam, RMSProp or Nesterov)
  • Using a reduced image patch, by removing border pixels (0, 1 or 2)
  • Momentum for Batch Normalization

A 3rd convolutional layer was also added.

Table of results

In [115]:
hyperparam_df = pd.read_csv('./log/model_hyperparam_2.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)
Out[115]:
activation batch_size best_acc best_loss best_train_acc best_train_loss filters1 filters2 filters3 full_hidd1 full_hidd2 ksize1 ksize2 ksize3 learning_rate logdir momentum no_epochs optimizer patch_reduction
36 relu 64 0.7117 0.919377 0.754637 0.711197 96 96 64 100 125 3 5 3 0.0020 ./log/run-20180128T123115/ 0.90 12 adam 0
35 lrelu 128 0.7109 0.906563 0.758348 0.701145 32 64 64 125 80 5 3 5 0.0020 ./log/run-20180128T120818/ 0.99 13 adam 0
37 relu 128 0.7093 0.897427 0.743490 0.734613 96 128 96 100 100 5 3 4 0.0020 ./log/run-20180128T133019/ 0.90 11 adam 0
45 relu 32 0.7093 0.873374 0.722251 0.805403 96 64 64 60 80 5 4 4 0.0030 ./log/run-20180128T174207/ 0.99 12 adam 2
43 elu 256 0.7033 0.910308 0.734158 0.771317 96 48 128 100 125 5 5 4 0.0015 ./log/run-20180128T162309/ 0.90 13 adam 0
24 relu 128 0.7008 1.032702 0.755814 0.708703 96 96 128 60 125 5 5 5 0.0030 ./log/run-20180128T080126/ 0.95 13 adam 2
3 relu 32 0.6966 0.940112 0.704069 0.840530 64 48 64 125 125 4 5 5 0.0030 ./log/run-20180127T213021/ 0.99 12 adam 1
7 lrelu 32 0.6947 0.906030 0.692000 0.867232 96 48 96 60 80 3 5 3 0.0020 ./log/run-20180127T225520/ 0.95 10 adam 0
16 relu 64 0.6923 0.946625 0.724659 0.792448 16 64 128 100 125 5 5 5 0.0030 ./log/run-20180128T035715/ 0.95 12 adam 1
19 lrelu 32 0.6911 0.905265 0.746788 0.762321 32 128 128 100 125 4 5 5 0.0015 ./log/run-20180128T043540/ 0.90 10 nesterov 0
48 relu 128 0.6898 0.896807 0.708308 0.837255 64 48 64 60 100 3 3 5 0.0010 ./log/run-20180128T190408/ 0.95 11 adam 0
41 elu 32 0.6891 0.934515 0.723434 0.806494 32 64 64 125 80 3 4 5 0.0030 ./log/run-20180128T155401/ 0.95 11 nesterov 1
17 elu 32 0.6870 0.904974 0.698303 0.855170 32 48 96 60 125 4 4 3 0.0010 ./log/run-20180128T041125/ 0.99 10 adam 0
8 lrelu 32 0.6831 0.923607 0.696061 0.872021 96 64 64 100 125 4 5 3 0.0010 ./log/run-20180127T233311/ 0.90 11 nesterov 0
20 lrelu 256 0.6810 0.904818 0.669038 0.952634 64 128 128 100 80 5 3 4 0.0020 ./log/run-20180128T052017/ 0.95 10 adam 0
46 relu 64 0.6806 0.998204 0.728308 0.783186 96 64 64 100 125 5 3 4 0.0020 ./log/run-20180128T181510/ 0.99 15 nesterov 1
26 elu 32 0.6806 0.928317 0.682667 0.900012 32 64 96 125 80 4 4 3 0.0030 ./log/run-20180128T091329/ 0.99 10 nesterov 1
21 relu 256 0.6802 1.084211 0.721635 0.786616 64 128 64 125 100 3 3 4 0.0020 ./log/run-20180128T060308/ 0.99 14 rmsprop 0
44 lrelu 64 0.6790 1.079847 0.710308 0.822083 32 96 96 60 100 5 5 5 0.0015 ./log/run-20180128T171420/ 0.99 13 rmsprop 1
34 relu 64 0.6787 1.073231 0.689846 0.880468 32 128 96 100 100 4 3 4 0.0010 ./log/run-20180128T114718/ 0.95 13 rmsprop 1
4 lrelu 64 0.6785 0.969094 0.667692 0.944729 64 96 64 125 80 4 4 4 0.0020 ./log/run-20180127T215818/ 0.90 11 rmsprop 1
13 elu 128 0.6784 0.929740 0.672779 0.942962 32 128 96 60 80 4 3 5 0.0030 ./log/run-20180128T024713/ 0.99 9 adam 1
11 relu 128 0.6781 0.965303 0.671795 0.937197 96 48 96 60 125 4 3 5 0.0020 ./log/run-20180128T021052/ 0.95 10 adam 2
6 relu 128 0.6740 0.989200 0.694608 0.867488 16 64 96 125 100 4 4 4 0.0030 ./log/run-20180127T224516/ 0.95 12 adam 2
31 lrelu 128 0.6733 0.974435 0.642516 1.008214 64 64 64 125 80 3 4 3 0.0030 ./log/run-20180128T105754/ 0.99 11 rmsprop 1
2 lrelu 128 0.6731 0.937059 0.665236 0.944229 16 96 96 125 100 5 3 5 0.0030 ./log/run-20180127T211708/ 0.99 10 adam 2
5 lrelu 32 0.6706 0.967186 0.673030 0.926152 16 128 128 60 80 4 4 4 0.0020 ./log/run-20180127T222852/ 0.90 9 adam 2
25 relu 64 0.6698 0.989890 0.663631 0.940841 16 96 64 60 100 4 4 4 0.0030 ./log/run-20180128T090333/ 0.99 10 adam 2
18 relu 256 0.6678 0.983491 0.673244 0.931913 32 64 96 100 100 3 3 3 0.0030 ./log/run-20180128T042502/ 0.95 11 adam 2
1 elu 64 0.6662 0.978574 0.685880 0.916311 32 96 64 100 80 4 3 3 0.0010 ./log/run-20180127T205428/ 0.90 14 nesterov 0
38 elu 64 0.6650 0.990182 0.666092 0.961927 96 48 64 60 125 5 4 3 0.0010 ./log/run-20180128T141741/ 0.95 10 rmsprop 1
14 lrelu 64 0.6647 0.969241 0.721600 0.819558 96 64 128 60 125 3 4 4 0.0030 ./log/run-20180128T030917/ 0.90 10 nesterov 0
47 relu 64 0.6629 0.958692 0.650000 0.990026 32 64 96 125 80 4 3 4 0.0020 ./log/run-20180128T185436/ 0.95 9 adam 2
10 elu 128 0.6620 0.993874 0.686842 0.904742 64 96 128 60 125 5 4 5 0.0015 ./log/run-20180128T013357/ 0.95 11 nesterov 1
39 elu 256 0.6608 1.003135 0.685734 0.912478 96 128 64 100 125 5 3 5 0.0015 ./log/run-20180128T144602/ 0.99 16 nesterov 1
22 elu 64 0.6566 1.000508 0.682462 0.924811 96 128 96 100 125 5 5 3 0.0030 ./log/run-20180128T064406/ 0.95 8 nesterov 0
9 relu 128 0.6559 1.076769 0.737892 0.768973 96 128 64 60 80 5 5 4 0.0015 ./log/run-20180128T001926/ 0.95 14 nesterov 1
33 lrelu 64 0.6515 1.080226 0.649231 0.997922 16 48 128 60 80 4 3 3 0.0015 ./log/run-20180128T114001/ 0.95 12 rmsprop 2
42 relu 128 0.6476 1.028002 0.641995 1.017022 32 48 96 60 125 4 5 3 0.0020 ./log/run-20180128T160710/ 0.90 13 nesterov 1
49 lrelu 128 0.6435 1.127514 0.728384 0.799779 32 128 128 100 80 4 5 3 0.0015 ./log/run-20180128T192810/ 0.90 15 nesterov 2
27 elu 128 0.6416 1.069904 0.586017 1.185793 96 48 128 125 125 4 5 3 0.0020 ./log/run-20180128T092435/ 0.90 9 rmsprop 1
12 elu 256 0.6406 1.066943 0.660266 0.985012 16 48 128 60 125 5 5 3 0.0030 ./log/run-20180128T023235/ 0.95 16 nesterov 1
23 elu 128 0.6356 1.032984 0.626516 1.079209 96 64 96 100 80 4 3 3 0.0020 ./log/run-20180128T073650/ 0.90 11 nesterov 2
40 relu 256 0.6258 1.061041 0.589693 1.173546 16 48 128 100 100 4 5 3 0.0020 ./log/run-20180128T153954/ 0.90 12 nesterov 0
28 elu 128 0.6224 1.100552 0.636136 1.058336 16 128 64 125 100 4 4 3 0.0030 ./log/run-20180128T095112/ 0.99 10 nesterov 2
32 lrelu 32 0.6164 1.188633 0.608620 1.110225 32 64 128 60 100 3 5 3 0.0030 ./log/run-20180128T112125/ 0.99 14 rmsprop 1
0 relu 128 0.6128 1.097343 0.588790 1.170430 64 96 128 125 80 3 3 3 0.0015 ./log/run-20180127T203523/ 0.90 10 nesterov 2
15 lrelu 32 0.5877 1.281553 0.545697 1.249027 16 64 128 100 80 3 5 3 0.0030 ./log/run-20180128T034821/ 0.99 10 rmsprop 1
29 relu 32 0.5828 1.182317 0.546320 1.258835 64 96 96 100 100 5 4 4 0.0030 ./log/run-20180128T100253/ 0.90 12 rmsprop 2
30 elu 32 0.4889 1.530312 0.479242 1.519294 32 128 128 100 100 5 4 5 0.0030 ./log/run-20180128T103147/ 0.99 9 rmsprop 2

Graphical Analysis

In [108]:
sns.distplot(hyperparam_df.best_acc, hist=False, rug=False)
plt.title("Test Accuracy")
plt.xlabel("Test Accuracy")
plt.show()

The distribution of Test Accuracy appears to be approximately Gaussian with a negative skew, from models that had difficulty converging.

In [109]:
sns.boxplot(y=hyperparam_df.activation, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Activation Function")
plt.xlabel("Test Accuracy")
plt.ylabel("Activation Function")
plt.show()

relu or leaky relu seems to offer better performance than elu, with leaky relu being more consistent.

In [110]:
sns.boxplot(y=hyperparam_df.optimizer, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Optimizer")
plt.xlabel("Test Accuracy")
plt.ylabel("Optimizer")
plt.show()

The Adam optimizer algorithm appears to be more successful than Nesterov Accelerated Gradient or RMSProp in this case.

In [111]:
sns.boxplot(y=hyperparam_df.patch_reduction, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Patch Reduction")
plt.xlabel("Test Accuracy")
plt.ylabel("Patch Reduction")
plt.show()

A patch reduction of 1, implies removal of the border pixels, resulting in the reduction of the image size from 32x32 to 30x30.

While other CNNs have had success with this approach, it would appear that there is no benefit with the current architecture.

In [112]:
sns.boxplot(y=hyperparam_df.batch_size, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Batch Size")
plt.xlabel("Test Accuracy")
plt.ylabel("Batch Size")
plt.show()

While a batch size of 32 may have higher median performance, the results are more erratic. A batch size of 64 is capable of achieving good results and is more reliable.

In [89]:
def subplot_performance(hyperparameter, title, ax):
    mean_acc = hyperparam_df.groupby(hyperparameter)[["best_acc"]].mean()
    sns.pointplot(mean_acc.index, mean_acc.best_acc, color="#555999", ax=ax)
    ax.set_xlabel(title)
    ax.set_ylabel("Mean Test Accuracy")
In [113]:
fig, axes = plt.subplots(2,3,figsize=(12,8))

subplot_performance("filters1","Layer 1 filters", axes[0,0])
subplot_performance("filters2","Layer 2 filters", axes[0,1])
subplot_performance("filters3","Layer 3 filters", axes[0,2])

subplot_performance("ksize1","Layer 1 kernel", axes[1,0])
subplot_performance("ksize2","Layer 2 kernel", axes[1,1])
subplot_performance("ksize3","Layer 3 kernel", axes[1,2])

plt.tight_layout()
plt.show() 

Increasing the number of filters in the first convolutional layer tends to improve performance. But surprisingly, the next two layers perform more poorly with more filters. Perhaps as a result of overfitting.

A 4x4 kernel has slightly better performance on average.

3rd Round

The stride of the 1st CNN layer was reduced from 2 to 1. The resulting increase in parameters to learn, greatly increased training time, so only 5 hyperparameter settings were sampled.

Hyperparameter values that tended to underperform in the 2nd round were also eliminated.

Table of results

In [4]:
hyperparam_df = pd.read_csv('./log/model_hyperparam_3.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)
Out[4]:
activation batch_size best_acc best_loss best_train_acc best_train_loss filters1 filters2 filters3 full_hidd1 full_hidd2 ksize1 ksize2 ksize3 learning_rate logdir momentum no_epochs optimizer patch_reduction
3 lrelu 64 0.8062 0.579502 0.736910 0.751420 96 128 128 125 100 4 5 5 0.0030 ./log/run-20180130T233101/ 0.95 28 adam 0
2 lrelu 64 0.8048 0.587198 0.761165 0.673304 96 96 128 125 125 4 5 5 0.0010 ./log/run-20180130T093755/ 0.95 33 adam 0
4 lrelu 64 0.7894 0.622922 0.721077 0.797818 64 128 128 100 100 5 5 5 0.0030 ./log/run-20180131T132957/ 0.90 25 adam 0
0 lrelu 64 0.7799 0.632761 0.714930 0.794581 64 96 96 125 100 4 5 5 0.0010 ./log/run-20180129T200512/ 0.95 26 adam 0
1 lrelu 64 0.7750 0.649881 0.697231 0.850535 64 96 128 100 125 5 4 5 0.0015 ./log/run-20180130T034133/ 0.95 21 adam 0

Conclusion

With over 15 million hyperparameter combinations to possibly choose from in the 2nd round, a grid search is clearly impractical. A randomized search allows some hyperparameter values be chosen by a process of elimination/selection.

The 1st round showed that learning rates of 0.01 and above were not converging. A full grid search would have wasted a lot of time attempting these learning rates with other hyperparameter settings.

In the 2nd round, the indication was that a batch size of 64 and the leaky relu activation function were more reliable. So in the 3rd round, these were both fixed. Having more features in the 1st convolutional layers also seemed to help, so 16 and 32 were both removed as possible values fo this setting.

The stride of the 1st convolutional layer was reduced from 2 to 1 in the 3rd round. As there were already signs of possible overfitting in the 2nd round, dropout layers were introduced to regularize the network.

Best test set accuracy in the three rounds was as follows:

  • 1st Round - 60.85% (2 convolutional, 2 fully connected)
  • 2nd Round - 71.17% (3 convolutional, 2 fully connected)
  • 3rd Round - 80.62% (3 convolutional, 2 fully connected, Dropout & stride 1 on all layers)

This is creeping into the bottom of the leaderboard for cifar-10, which is satisfying given the simplicity of the model and that the purpose of this study, was to investigate hyperparameter search (and learn TensorFlow).

Further work

Training on a GPU, e.g. using Google's cloud service, would allow more layers and a faster analysis of possible models.

Inception modules have been shown to be successful and could be an interesting addition.

Image pre-processing may also offer some advantages. For example, adding a greyscale channel or enhancing contrast.

Hyperparameter searches can also be carried out using a Gaussian Process Model, which aims to predict which set of Hyperparameters have the highest expected improvement of performance over the current best.