Skip to content

beta v.0.5.0 (~6.98-6.99ish seconds)

Compare
Choose a tag to compare
@tysam-code tysam-code released this 20 Feb 13:51
· 3 commits to main since this release

Hello everyone,

One-shot patch notes today since I'm pretty tired from getting this one to completion. But, in the name of scientific and otherwise integrity, let's get some of these changes logged! This is a more rough cut, so apologies in advance for any spelling/otherwise mistakes (will update as necessary!).

Changes Summary + Notes

  • While last patch we christened our SpeedyResNet 1.0 architecture, there was in fact at least one more network architecture speedup awaiting us. By halving the depth of the first layer group, we can gain a massive speed boost with a decently mild reduction in performance. Increasing the training time to 12.6 epochs allows us to regain this performance but still perform in just under 7 seconds.

  • Speaking of fractional epochs, we support them now! They added more complexity than what was needed before but seem to be very important now. I haven't quite logged the exchange rate of accuracy for training epochs, but it does seem to follow some consistent nonlinear law. Now it's much easier to implement incremental percentage increase improvements and not have to 'save up' to remove or add whole epochs in one big jump. I recommend taking it for a spin and playing around!

  • We fixed a bug in which the Lookahead-inspired update strategy with the exponential moving average (EMA) was doing...nothing at all last patch! This was because the updates to the original network were not being applied in-place. That felt embarrassing to learn about! But thankfully we were able to get it working in this patch to good effect. Notably, we do change the paradigm a bit by adding an aggressive warmup schedule that strongly ramps the EMA value smoothly primarily in the last few epochs of training, allowing the network to train helter-skelter until the last minute, where we then strongly self-ensemble with a continously-decreasing learning rate.

  • We bring back the final_lr_ratio parameter, because unfortunately having the final lr ratio go to 0 did not play nicely with an extremely strong EMA at the end. We still need some kind of learning rate to learn something interesting! (especially as the effective learning rate in the EMA drops very quickly towards the end of training). However, we generally left this alone around .05 as that seemed good enough for our purposes.

  • Cutout is out, and Cutmix has been added into the mix! Theoretically this should help us more as we're not just destroying information to regularize our network, at least not as much as we are otherwise. This means that our labels are now one-hot arrays that get changed around too, which opens up some opportunities for fun label-based tricks in the future! We use this now in the shorter runs, since the accelerated Lookahead-EMA optimization tends to overfit more quickly at the end otherwise. On the whole, the combination seems to be a strong net positive.

  • In between and after implementing these things (most of which happened near the beginning of the week), we did about 25-30 hours of manual hyperparameter tuning. I do not know how many thousands of runs this was, but it was certainly a lot of runs. I tried implementing my own short genetic algorithm to run live hyperparameter tuning, but unfortunately I am not that well-versed in that field and the noise surface of the run results is extremely noisy at this point -- even a 25 run battery (~3-4 minutes or so) is a moderately noisy measure of performance at this point. I got lost in the valleys of trying to tell apart whether the hyperparameter surface was nonlinear or incredibly noisy, or if there was a stateful hesienbug of some kind, or all three... In any case, I 'gave up' by increasing the number of epochs at two different points and then tuning to allow for more generous performance headroom. Even if the average is bang-on at 94.00%, the visceral feeling of getting >94% on that first run is an important part of the UX of this repo.

  • An important point that I want to emphasize from the above is that the hyperparameter surface around these peaked areas seems to be very aggressively flat...for the most part. And that a lot of the hyperparameters are truly good being left where they are when translating into different contexts.

  • One final thing that we added was weight normalization to the initial projection and the final linear layer. It seemed to perform well on both of these, but the p norm values are both empirically-derived magic numbers, so this is slightly messier than I'd want. However, it did let us save ~.15 seconds or so in removing the initial BatchNorm and let us break under 7 seconds. I have some very rough initial thoughts on this, but want to try to develop them more over time. Hopefully there's a nuanced and predictable way to apply this throughout the network. It didn't work well on the other layers I tried (though it did make for very rapid overfitting when I applied it to certain layers and left the BatchNorms in instead). It seems like BatchNorm really is still doing something very protective for the network during training (we are throwing an awful lot at it, to be sure), and one of the main downsides of most of the BatchNorm replacements seems to be that they are unfortunately still very noise-sensitive. That said, noise helped us out when we originally had GhostBatchNorm (it was just too slow for our purposes), so maybe there is something to be said there for the future.

  • Oh, and I guess there is one more thing that I thought about for adding this section. For a tiny (but still good!) speedboost, we set foreach=True in the sgd optimizers. Every bit counts! Thanks to @bonlime for the suggestion/reminder to revisit this -- I'd totally passed over it and written it off initially!

Moving Forward

I think it might be a longer time before the next direct speed/accuracy/etc update on this project. (Though admittedly, I think I've felt this every time. But, then again all of that hyperparameter tuning was a pretty big wall, and most of my initial/initial-mid-term toolkit is exhausted at this point. It is a good toolkit though and I'm very glad to have it.). I wouldn't be surprised if development enters a slower, more analytical phase. I also may take a break just to palate cleanse from this project as I've been working quite a lot on it! Quite a lot! There are a few cool options on the table for investigation, but since they A. are something I haven't really done before and don't know the corners/guarantees of, B. could take a longer time to test/implement, and C. aren't guaranteed or have a high chance of paying off without sweat/blood/tears/significantly breaking the intentions of the repo, I'm really hesitant to go into them alone.

So, with that said, I think anything that the community contributes would be super helpful! There are a lot of avenues that we can tackle improving this thing. For example, did you know that MaxPooling now takes almost as much time as all of our convolutions added together? That's crazy! And in the backwards pass, the MaxPooling kernels are using 100% of the SM capacity they're allocated, but it is quite slow. That should be just a tile and multiply of a cached mask under ideal circumstances, right?

There's also some analysis of the learning dynamics of this thing as an option on the table that I would love to get some help/eyes on it. I have a sneaking suspicion that how this network is learning looks very different than maybe other, more traditional networks would when looked at through a training dynamics lens. Whether or not having the training process being compressed so tightly 'brings certain dynamics into focus' or not, I can't quite say. But I think us being able to lay out and start tracing what in the heck is happening now that the empirical flurry of optimization has happened can help us out from a theoretical approach standpoint. There is of course the engineering standpoint too, but since we've focused on that a lot, maybe there is some unrealized gains to be had now from a deeper, more thorough analysis.

Many thanks for reading, hope you enjoyed or that this was helpful to you, and if you find anything, feel free to let me know! I'm on twitter, so feel free to reach out if you have any pressing questions (or find something interesting, etc!): https://twitter.com/hi_tysam. You can also open an issue as well, of course.

Look forward to seeing what we all make together! :D :) 🎇 🎉

Special Thanks

Special thanks to Carter B. and Daniel G. for supporting me on Patreon. Y'all rock! Thank you both so very much for your support! The GPU hours helped a ton.

I also want to extend special thanks as well to Carter B. again, this time for pointing me at a helpful resource that was helpful in developing this release.

Lastly, many thanks to the people who have supportive and been helpful with this project in other ways. Your kind words and encouragement have meant a ton to me. Many thanks!

Support

If this software has helped you, please consider supporting me on my (Patreon)[https://www.patreon.com/tysam]. Every bit counts, and I certainly do use a number of GPU hours! Your support directly helps me create more open source software for the community.