A quick reference to access NYU's High Performance Computing Prince Cluster.
The official wiki is here, this is an unofficial document created as a quick-start guide for first-time users with a focus in Python and PyTorch.
You need to be affiliated to NYU and have a sponsor.
To get an account approved, follow this steps.
Once you have been approved, you can access HPC from:
- Within the NYU network (in campus):
ssh NYUNetID@prince.hpc.nyu.edu
Remember to replace NYUNetID for your own NetID.
Once logged in, the root should be:
/home/NYUNetID
, so running pwd
should print:
[NYUNetID@log-0 ~]$ pwd
/home/NYUNetID
- From an off-campus location:
First, Login to your VPN and then login to the bastion host, :
ssh NYUNetID@gw.hpc.nyu.edu
Then login to the cluster:
ssh prince.hpc.nyu.edu
I use the MobaXterm ssh client with the following settings for the Prince Cluster:
Remote host: prince.hpc.nyu.edu
Username: NYUNetID
Port: 22
This makes it one click to open a terminal to Prince.
You can get acces to three filesystems: /home
, /scratch
, and /archive
.
Scratch is a file system mounted on Prince that is connected to the compute nodes where we can upload files faster. Notice that the content gets flushed every 60 days with no backup!
[NYUNetID@log-0 ~]$ cd /scratch/NYUNetID
[NYUNetID@log-0 ~]$ pwd
/scratch/NYUNetID
/home
and /scratch
are separate filesystems in separate places.
Depending on how often you use your files you might want to choose the appropiate file system. I use /home for the files I won't touch often.
Slurm allows you to load and manage multiple versions and configurations of software packages.
To see available package environments:
module avail
To load a model:
module load [package name]
For example if you want to use Tensorflow-gpu:
module load cudnn/8.0v6.0
module load cuda/8.0.44
module load tensorflow/python3.6/1.3.0
To check what is currently loaded:
module list
To remove all packages:
module purge
To get helpful information about the package:
module show torch/gnu/20170504
Will print something like
--------------------------------------------------------------------------------------------------------------------------------------------------
/share/apps/modulefiles/torch/gnu/20170504.lua:
--------------------------------------------------------------------------------------------------------------------------------------------------
whatis("Torch: a scientific computing framework with wide support for machine learning algorithms that puts GPUs first")
whatis("Name: torch version: 20170504 compilers: gnu")
load("cmake/intel/3.7.1")
load("cuda/8.0.44")
load("cudnn/8.0v5.1")
load("magma/intel/2.2.0")
...
load(...)
are the dependencies that are also loaded when you load a package.
You can submit batch jobs in prince to schedule jobs. This requires to write custom bash scripts. Batch jobs are great for longer jobs, and you can also run in interactive mode, which is great for short jobs and troubleshooting.
To run in interactive mode:
[NYUNetID@log-0 ~]$ srun --pty /bin/bash
This will run the default mode: a single CPU core and 2GB memory for 1 hour.
To request more CPU's:
[NYUNetID@log-0 ~]$ srun -n4 -t2:00:00 --mem=4000 --pty /bin/bash
[NYUNetID@c26-16 ~]$
That will request 4 compute nodes for 2 hours with 4 Gb of memory.
To exit a request:
[NYUNetID@c26-16 ~]$ exit
[NYUNetID@log-0 ~]$
[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash
[NYUNetID@gpu-25 ~]$ nvidia-smi
Mon Oct 23 17:49:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:12:00.0 Off | 0 |
| N/A 37C P8 29W / 149W | 0MiB / 11439MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You can write a script that will be executed when the resources you requested became available.
A simple CPU demo:
## 1) Job settings
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=5:00:00
#SBATCH --mem=2GB
#SBATCH --job-name=CPUDemo
#SBATCH --mail-type=END
#SBATCH --mail-user=itp@nyu.edu
#SBATCH --output=slurm_%j.out
## 2) Everything from here on is going to run:
cd /scratch/NYUNetID/demos
python demo.py
Request GPU:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --time=10:00:00
#SBATCH --mem=3GB
#SBATCH --job-name=GPUDemo
#SBATCH --mail-type=END
#SBATCH --mail-user=itp@nyu.edu
#SBATCH --output=slurm_%j.out
cd /scratch/NYUNetID/trainSomething
source activate ML
python train.py
Submit your job with:
sbatch myscript.s
Monitor the job:
squeue -u $USER
More info here
I transfer files using MobaXTerm. If you need to setup a tunnel look here
Once you are all setup with the above, to get pytorch you need to do a couple of things:
- Create a virtual Environment
- Load the appropiate modules in the environment
mkdir /scratch/gs157/tmp/pytorch-gpu
cd pytorch-gpu/
module load python3/intel/3.6.3
virtualenv --system-site-packages py3.6.3
source py3.6.3/bin/activate
After the above you have your virtual environment setup. Now you need to get pytorch
Note on 5/12/20: On Prince, GPU driver does not support CUDA 10.2, if you are running PyTorch, please try to use PyTorch built with CUDA 10.1.
pip3 install torch torchvision
pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
Now everytime you want to use your pytorch environment all you need to do is:
[NYUNetID@log-0 ~]$ source py3.6.3/bin/activate - activate python environment
[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash - interactive gpu environment on HPC
[NYUNetID@gpu-25 ~]$ cd /scratch/NYUNetID/trainSomething
[NYUNetID@gpu-25 ~]$ python train.py
Instructions are here
- Once you copied and have your run-jupyter.sbatch
[NYUNetID@log-0 ~]$ source py3.6.3/bin/activate - activate python environment
[NYUNetID@log-0 ~]$ sbatch run-jupyter-gpu.sbatch
[NYUNetID@log-0 ~]$ cat slurm-xxxx.out
in a separate window (ubuntu shell) type:
ssh -L NNNN:localhost:NNN netID@prince
Open a browser at localhost NNNN:
http://localhost:8925/?token=76f100825af441457502d5d080c1776b987a2f76101460f4