Satori
Logging In
Before logging in for the first time, you'll need to activate your account by following these instructions.
Now you can ssh
in to either of the login nodes like this.
ssh <your username>@satori-login-001.mit.edu
ssh <your username>@satori-login-002.mit.edu
According to this, the first login node should be used for submitting jobs, and the second for compiling code or transferring large files. But it also says that if one isn't available, just try the other. Both have 160 cores.
Modules
Satori is set up to use Environment Modules
to control which executables, libraries, etc. are on your path(s). So you'll want to become familiar
with the module
command.
-
module avail
lists all available modules -
module spider <module name>
gives you info about a module, including which other modules have to to be loaded first -
module load <module name>
loads a specific module -
module list
shows all the currently loaded modules -
module unload <module name>
unloads a specific modeul -
module purge
unloads all modules
Satori also uses Spack to manage versions of many tools, so generally speaking
you should always have this module loaded: module load spack
. If you run module avail
before and
after loading spack, you'll see that a lot more modules become visible.
For compiling C/C++ and CUDA code, these are the modules I start with.
module load spack git cuda gcc/7.3.0 cmake
Note: I'd like to use gcc 8, but I get build errors when I use it.
Running Jobs
Let's start with these simple CUDA hello world programs.
With the modules above loaded, you should be able to clone the repo and build it. (The first time through, you probably want to do a little git setup.)
git clone ssh://git@gitlab.cba.mit.edu:846/pub/hello-world/cuda.git
cd cuda
make -j
Since all these programs are very lightweight, I admit I tested them all on the login node directly.
Running get_gpu_info
in particular revealed that the login nodes each have two V100 GPUs. (The
compute nodes have four.)
But let's do things the right way, using slurm. We'll
start by making a submission script for saxpy
. I called mine saxpy.slurm
, and put it in its own
directory outside the repo.
#!/bin/bash
#SBATCH -J saxpy # sets the job name
#SBATCH -o saxpy_%j.out # determines the main output file (%j will be replaced with the job number)
#SBATCH -e saxpy_%j.err # determines the error output file
#SBATCH --mail-user=<your email address>
#SBATCH --mail-type=ALL
#SBATCH --gres=gpu:1 # requests one GPU per node...
#SBATCH --nodes=1 # and one node...
#SBATCH --ntasks-per-node=1 # running only instance of our command.
#SBATCH --mem=256M # We ask for 256 megabyte of memory (plenty for our purposes)...
#SBATCH --time=00:01:00 # and one minute of time (again, more than we really need).
~/code/cuda/saxpy
echo "Run completed at:"
date
All the lines that start with #SBATCH
are parsed by slurm to determine which resources you need.
You can also pass these on the command line, but I like to put everything in a file so I don't
forget what I asked for.
To submit the job, run sbatch saxpy.slurm
. Slurm will then tell you the job id.
[strand@satori-login-002 saxpy]$ sbatch saxpy.slurm
Submitted batch job 61187
To query jobs in the queue, use squeue
. If you run it with no arguments, you'll see all the queued
jobs. To ask about a specific job, use -j
. To ask about all jobs that you've submitted, use -u
.
[strand@satori-login-002 saxpy]$ squeue -u strand
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
61187 sched_sys saxpy strand R 0:00 1 node0023
Since we only asked for one minute of compute time, our job is started very quickly. So if you run
squeue
and don't see anything, it might just be because the job already finished.
You'll know the job is finished when its output files appear. They should show up in the directory
where you queued the job with sbatch
.
[strand@satori-login-002 saxpy]$ cat saxpy_61187.out
Performing SAXPY on vectors of dim 1048576
CPU time: 323 microseconds
GPU time: 59 microseconds
Max error: 0
Run completed at:
Mon Mar 1 19:40:43 EST 2021
Now let's try submitting saxpy_multi_gpu
, and giving it multiple GPUs. We can use basically the
same batch script, just with the new executable and GPU count (i.e. --gres=gpu:4
). It doesn't
matter for this program, but for real work you may also want to add #SBATCH --exclusive
to make
sure you're not competing with other jobs on the same GPU.
We submit the job in the same way: sbatch saxpy_multi_gpu.slurm
. Soon after I had this output
file.
Performing SAXPY on vectors of dim 1048576.
Found 4 GPUs.
CPU time: 579 microseconds
GPU 0 time: 55 microseconds
GPU 1 time: 85 microseconds
GPU 2 time: 60 microseconds
GPU 3 time: 61 microseconds
GPU 0 max error: 0
GPU 1 max error: 0
GPU 2 max error: 0
GPU 3 max error: 0
Run completed at:
Mon Mar 1 20:27:16 EST 2021
TODO
- MPI hello world
- Interactive sessions
Questions
- How can I load CUDA 11?
- Why is gcc 8 broken?
- Is there a module for cmake 3.19? If not, can I make one?
- Is there a dedicated test queue?