beta v0.2.0 (~12.31-12.38s)
New speed: ~12.31-12.38 seconds on an A100 SXM4 (through Colab). Many, many thanks to @99991 (https://github.com/99991/cifar10-fast-simple) for their help with finding the issues that eventually led to some of these improvements, as well as detailed debugging and verification of results on their end. I encourage you to check out some of their work! <3 :)
Notes
After some NVIDIA kernel profiling, we changed the following:
-- Swap the memory channels out for channels_last (in beta for pytorch currently)
-- Replace the nn.AdaptiveMaxPooling with a similarly-named class wrapping torch.amax (nearly a ~.5 second speedup in total)
-- Replace GhostNorm with a more noisy BatchNorm to take advantage of the faster/simpler kernels. This resulted in roughly a 5.25 second (!!!!) speedup over the baseline, but required some parameter tuning to get to a similar level of regularization that GhostNorm provided.
In doing so, we, along with Peter, Ray, Egon, and Winston, helped GhostNorm finally find its peaceful rest. That said, the idea of batch norm noise for the sake of helping regularize the network does continue to live on, albeit in a strangely different form.
There are many other avenues that we can be going down in the future to continue speeding up this network and to get below the training in ~<2 seconds or so mark. I always burn myself out with these releases, but I already have a couple fun things on deck that have been doing well/showing promise!
As always, feel free to support the project through my Patreon, or drop me a line at hire.tysam@gmail.com if you ever want to hire me for up to a part time amount of hours!