beta v.0.4.0 (~7.68-7.71 seconds)
Welcome to the release notes! If you're looking for the code/README, check out https://github.com/tysam-code/hlb-CIFAR10
If you're just looking to run the code, use git clone https://github.com/tysam-code/hlb-CIFAR10 && cd hlb-CIFAR10 && python -m pip install -r requirements.txt && python main.py
Summary
Welcome to the v.0.4.0 release of hlb-CIFAR10! In this release, we (somehow) find another large timesave in the model architecture, round a number of hyperparameters, and even remove one or two for good measure. We also convert our existing codebase to introduce an optimization protocol similar to the Lookahead optimizer, and get even more aggressive with our learning schedules. Oh, and of course we clean up some of the formatting a bit, and update some annotations/comments so they're more informative or no longer incorrect. And, bizzarely enough, this update is very special because we do practically all of this by just reorganizing, rearranging, condensing, or removing lines of code. Two purely novel lines this time*! Wow!
Now, unfortunately for this release we had set a personal goal to include at least one technique released in the 1990's/early 2000's, but couldn't find anything suited for this particular release. That said, it was a productive exercise and should at least indirectly help us on our journey to under 2 seconds of training time.
One final critical note -- for final accuracies on short training runs (<15-20 epochs), please refer to the val_acc column and not the ema_val_acc column. This is due to a quirk that we describe later in the patch notes.
Now, on to the patch notes!
*two short new lines in the EMA, and three total if you consider the statistics calculation reorganization in the dataloader to be novel
Patch Notes
Architecture Changes
-
While expanding our 'on-deck' research queue for current and future releases, we accidentally stumbled into a network architecture that converges to a higher accuracy much more quickly than before. What's more, it's a lot simpler, and is only 8 layers!
-
As this is a novel architecture (to our knowledge), we are now officially naming it the name it's had as a class name since the start of the codebase -- namely, "SpeedyResNet". The core block is now very simple, with only a one-depth residual branch, and no 'short' version used at all. Check it out here.
-
One downside is that it does seem to mildly overfit much more quickly, but I think there are some (possibly) very succinct solutions to this in the future. However, we'll need Pytorch 2.0 to be fully released (and Colab to be appropriately updated) for us to take full advantage of them due to kernel launch times. All in all, this provided the net fastest gain on the whole. For the longer runs, we now need to use more regularization -- in this case, cutout, to keep the network from overfitting. Thankfully, the overfitting does seem to be mild and not catastrophic (~95.5% vs 95.8% performance on the longer, 90 epoch runs).
-
With this change in architecture, the Squeeze-and-Excite layers seem not to be as useful as before. This could be for a variety of reasons, but they were being applied to that second convolution in the residual block, which no longer exists. Whether it's gradient flow, the degrees of freedom offered by a 2-block residual branch, or some other kind of phenomenon -- we don't quite know. All we know at this point is that we can relatively safely remove it without too much of a hit in performance. This increases our speed from our starting point in the last patch of ~9.92-ish seconds even more!
EMA
-
We now run the EMA over the entire network run, every 5 steps, with a simple added feature that sets all of network's weights (ignoring the running batchnorm averages) to the EMA's updated weights after every time the EMA runs. This is effectively analogous to the Lookup optimizer, but it reuses our existing code, and the momentum is much higher (.98**5 = .9). This seems to have some very interesting impacts on the learning procedure that might come in handy later, but for now, it provided a good accuracy boost for us.(note: on some post-investigation, it seems that the EMA code is not working properly for the Lookahead-like usecases in this release -- though the performance numbers still should be accurate! Hopefully we can fix this in the future. The below two lines should still be accurate) -
In adding this, we discover that for short-term training usecases, the 'final_lr_ratio' parameter is no longer very useful and can be safely replaced with a value that effectively makes the final lr to be 0. I believe this is also responsible for some of the overfitting in longer runs, but I think we can hope to address that some in the future.
-
One side effect (!!!) of this is that the ema_val_acc is not able to catch up to the final steps of training as quickly in the short training runs, though it tends to do better in the longer training runs. To avoid complexity and any scheduling changes in the short term, we leave it as is and encourage the user to refer to the val_acc column when making their training accuracy judgements.
Hyperparameters/Misc
-
The loss_scale_scaler value has been useful enough to training to move to the hyperparameters block. Increasing this seemes to help the resulting network be more robust, though there do seem to be some diminishing returns. It's been in here for a little while, play around with it and see what you think!
-
Lots of hyperparameters are rounded to nice, 'clean' numbers. It's a great thing we currently have the headroom to do this!
-
We increase the batchsize to double what it was! This helps us a lot but does/did require some retuning to account for the changes in how the network learns with this. Testing this configuration, it appears to run well on 6.5 GB or more of GPU memory, keeping it still very suitable for home users and/or researchers to use (something we'll try to keep as long as possible!). That said, if you're running in constrained memory with a Jupyter notebook, you may need to restart the whole kernel at times just because there's still some memory that gets freed too late for things to not get clogged up. Hopefully we'll have a good solution for this in future releases, though no promises. If you find a good clean one, feel free to open up a PR! :D
-
Instead of hardcoding the CIFAR10 statistics up front, we just now use the
torch.std_mean
function to get them dynamically after the dataset is loaded onto the GPU. It's really fast, simple, and does it all in one line of code. What's not to love?
Scaling and Generalization to Other Similar Datasets
I don't want to spend too much time in this section as this work is very preliminary, but the performance on other datasets at the same size seems to be very promising, at least. For example, changing over to CIFAR100 takes less than half a minute, and running the same network with the same hyperparameters matches about roughly the SOTA in the same timeframe for that dataset (~2015ish for both CIFAR10 and CIFAR100). If you squint your eyes, smooth the jumps in the progress charts a bit, and increase the base_depth of the networks 64->128 and the num_epochs 10->90, and change the ema and regularization in the same way as well (num_epochs 9->78, and cutout 0->11), then on a smoothed version of the SOTA charts we're near early ~2016 for both of them. Effectively -- the hyperparameters and architecture should roughly transfer well to the problems of at least the same size, since identical changes in the hyperparameters of the network seem to result in similar changes in performance for the respective datasets.
By the way, we do slightly upgrade our 'long'-running training values on a slightly larger network from ~95.77%->~95.84% accuracy in 188->172 seconds, though these number sets are both rounded a bit to worse values to slighly underpromise since the training process is noisy, and hey, underpromising is not always a bad bet.
I will note that we still lose some performance here relative to the short run's gains over the previous patch, as you may have noticed, and you'll probably see it in the EMA of the network over the long runs. Sometimes it swings a (rather monstrous) .10-.20% up and then down in the last epochs as what can only likely be overfitting happens. However, that said, we do have some improvements over the previous release and the short runs are the main focus for now. We may have some update focusing on the long runs in the future, but the utility of when it's best to do do that remains up in the air (since speeding up the network does offer raw benefits each time, and that might outweigh the utility of a long-run-only kind of update).
Special Thanks
Special thanks to everyone who's contributed help, insight, thoughts, or who's helped promote this project to other people. Your support has truly helped!
I'd also like to especially extend my thanks to my current Patreon sponsors, 🎉Daniel G.🎉 and 🎉Carter B.🎉 These two helped support a fair bit of the overhead expense that went into securing GPU resources for this project. I used more GPU hours in this last cycle than I have for any release, and it really helped a lot. Many, many thanks to you two for your support!
If you'd like to help assist me in making future software releases, please consider supporting me on Patreon (https://www.patreon.com/tysam). Every bit goes a long way!
If you have any questions or problems, feel free to open up an issue here or reach out to me on the reddit thread. I'm also on Twitter: https://twitter.com/hi_tysam
Thanks again for reading, and I hope this software provides you with a lot of value! It's certainly been quite a bit of work to pull it all together, but I think it's been worth it. Have a good one!