import pandas as pd
hyperparam_df = pd.read_csv('./log/model_hyperparam_1.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)
import matplotlib.pyplot as plt
import seaborn as sns
mean_acc = hyperparam_df.groupby("learning_rate")[["best_acc"]].mean()
sns.barplot(mean_acc.index, mean_acc.best_acc, color="#555999")
plt.title("Test Accuracy vs. Learning Rate")
plt.ylabel("Test Accuracy")
plt.xlabel("Learning Rate")
plt.show()
Learning rates of 0.01 and 0.03 are clearly not converging. So these were eliminated from further search rounds.
Accuracy is only reached 60% for this two layer model. Training accuracy was not much higher, which suggests overfitting was not the problem.
Architectures with more features in the convolutional layers and more nodes in the fully connected layers tend to perform better. So in the next round a 3rd convolutional layer was added. At this time, the model is being trained on a CPU, so it is not practical to add more.
Alternative activation functions and optimization algorithms will also be included in future searches.
In addition to specifying the number of filters, kernel size etc.., The followings options were added as Hyperparameters:
A 3rd convolutional layer was also added.
hyperparam_df = pd.read_csv('./log/model_hyperparam_2.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)
sns.distplot(hyperparam_df.best_acc, hist=False, rug=False)
plt.title("Test Accuracy")
plt.xlabel("Test Accuracy")
plt.show()
The distribution of Test Accuracy appears to be approximately Gaussian with a negative skew, from models that had difficulty converging.
sns.boxplot(y=hyperparam_df.activation, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Activation Function")
plt.xlabel("Test Accuracy")
plt.ylabel("Activation Function")
plt.show()
relu or leaky relu seems to offer better performance than elu, with leaky relu being more consistent.
sns.boxplot(y=hyperparam_df.optimizer, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Optimizer")
plt.xlabel("Test Accuracy")
plt.ylabel("Optimizer")
plt.show()
The Adam optimizer algorithm appears to be more successful than Nesterov Accelerated Gradient or RMSProp in this case.
sns.boxplot(y=hyperparam_df.patch_reduction, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Patch Reduction")
plt.xlabel("Test Accuracy")
plt.ylabel("Patch Reduction")
plt.show()
A patch reduction of 1, implies removal of the border pixels, resulting in the reduction of the image size from 32x32 to 30x30.
While other CNNs have had success with this approach, it would appear that there is no benefit with the current architecture.
sns.boxplot(y=hyperparam_df.batch_size, x=hyperparam_df.best_acc, color="#555999", orient="h")
plt.title("Test Accuracy vs. Batch Size")
plt.xlabel("Test Accuracy")
plt.ylabel("Batch Size")
plt.show()
While a batch size of 32 may have higher median performance, the results are more erratic. A batch size of 64 is capable of achieving good results and is more reliable.
def subplot_performance(hyperparameter, title, ax):
mean_acc = hyperparam_df.groupby(hyperparameter)[["best_acc"]].mean()
sns.pointplot(mean_acc.index, mean_acc.best_acc, color="#555999", ax=ax)
ax.set_xlabel(title)
ax.set_ylabel("Mean Test Accuracy")
fig, axes = plt.subplots(2,3,figsize=(12,8))
subplot_performance("filters1","Layer 1 filters", axes[0,0])
subplot_performance("filters2","Layer 2 filters", axes[0,1])
subplot_performance("filters3","Layer 3 filters", axes[0,2])
subplot_performance("ksize1","Layer 1 kernel", axes[1,0])
subplot_performance("ksize2","Layer 2 kernel", axes[1,1])
subplot_performance("ksize3","Layer 3 kernel", axes[1,2])
plt.tight_layout()
plt.show()
Increasing the number of filters in the first convolutional layer tends to improve performance. But surprisingly, the next two layers perform more poorly with more filters. Perhaps as a result of overfitting.
A 4x4 kernel has slightly better performance on average.
The stride of the 1st CNN layer was reduced from 2 to 1. The resulting increase in parameters to learn, greatly increased training time, so only 5 hyperparameter settings were sampled.
Hyperparameter values that tended to underperform in the 2nd round were also eliminated.
hyperparam_df = pd.read_csv('./log/model_hyperparam_3.csv')
hyperparam_df.sort_values(by = ['best_acc'], ascending = False)
With over 15 million hyperparameter combinations to possibly choose from in the 2nd round, a grid search is clearly impractical. A randomized search allows some hyperparameter values be chosen by a process of elimination/selection.
The 1st round showed that learning rates of 0.01 and above were not converging. A full grid search would have wasted a lot of time attempting these learning rates with other hyperparameter settings.
In the 2nd round, the indication was that a batch size of 64 and the leaky relu activation function were more reliable. So in the 3rd round, these were both fixed. Having more features in the 1st convolutional layers also seemed to help, so 16 and 32 were both removed as possible values fo this setting.
The stride of the 1st convolutional layer was reduced from 2 to 1 in the 3rd round. As there were already signs of possible overfitting in the 2nd round, dropout layers were introduced to regularize the network.
Best test set accuracy in the three rounds was as follows:
This is creeping into the bottom of the leaderboard for cifar-10, which is satisfying given the simplicity of the model and that the purpose of this study, was to investigate hyperparameter search (and learn TensorFlow).
Training on a GPU, e.g. using Google's cloud service, would allow more layers and a faster analysis of possible models.
Inception modules have been shown to be successful and could be an interesting addition.
Image pre-processing may also offer some advantages. For example, adding a greyscale channel or enhancing contrast.
Hyperparameter searches can also be carried out using a Gaussian Process Model, which aims to predict which set of Hyperparameters have the highest expected improvement of performance over the current best.