Note that the .ipynb was a google Colab working file, so all file paths are drive links. SudaBERT Language model files are too large to upload to github, here's the drive link: https://drive.google.com/drive/folders/1GNcccdCfVRfAig8or9hj0eBKTznjDlut?usp=sharing. The Data Analysis file was a working file on Google Collab, we worked there to utilise the High RAM feature(the code is in R, "%%R" has to be pasted on each cell before running it.