Skip to content
Snippets Groups Projects
user avatar
Erik Strand authored
e93ae6b4
History

Satori

Logging In

Before logging in for the first time, you'll need to activate your account by following these instructions.

Now you can ssh in to either of the login nodes like this.

ssh <your username>@satori-login-001.mit.edu
ssh <your username>@satori-login-002.mit.edu

According to this, the first login node should be used for submitting jobs, and the second for compiling code or transferring large files. But it also says that if one isn't available, just try the other. Both have 160 cores.

Modules

Satori is set up to use Environment Modules to control which executables, libraries, etc. are on your path(s). So you'll want to become familiar with the module command.

  • module avail lists all available modules
  • module spider <module name> gives you info about a module, including which other modules have to to be loaded first
  • module load <module name> loads a specific module
  • module list shows all the currently loaded modules
  • module unload <module name> unloads a specific modeul
  • module purge unloads all modules

Satori also uses Spack to manage versions of many tools, so generally speaking you should always have this module loaded: module load spack. If you run module avail before and after loading spack, you'll see that a lot more modules become visible.

For compiling C/C++ and CUDA code, these are the modules I start with.

module load spack git cuda gcc/7.3.0 cmake

Note: I'd like to use gcc 8, but I get build errors when I use it.

Running Jobs

Let's start with these simple CUDA hello world programs.

With the modules above loaded, you should be able to clone the repo and build it. (The first time through, you probably want to do a little git setup.)

git clone ssh://git@gitlab.cba.mit.edu:846/pub/hello-world/cuda.git
cd cuda
make -j

Since all these programs are very lightweight, I admit I tested them all on the login node directly. Running get_gpu_info in particular revealed that the login nodes each have two V100 GPUs. (The compute nodes have four.)

But let's do things the right way, using slurm. We'll start by making a submission script for saxpy. I called mine saxpy.slurm, and put it in its own directory outside the repo.

#!/bin/bash

#SBATCH -J saxpy        # sets the job name
#SBATCH -o saxpy_%j.out # determines the main output file (%j will be replaced with the job number)
#SBATCH -e saxpy_%j.err # determines the error output file
#SBATCH --mail-user=<your email address>
#SBATCH --mail-type=ALL
#SBATCH --gres=gpu:1    # requests one GPU per node...
#SBATCH --nodes=1       # and one node...
#SBATCH --ntasks-per-node=1 # running only instance of our command.
#SBATCH --mem=256M      # We ask for 256 megabyte of memory (plenty for our purposes)...
#SBATCH --time=00:01:00 # and one minute of time (again, more than we really need).

~/code/cuda/saxpy

echo "Run completed at:"
date

All the lines that start with #SBATCH are parsed by slurm to determine which resources you need. You can also pass these on the command line, but I like to put everything in a file so I don't forget what I asked for.

To submit the job, run sbatch saxpy.slurm. Slurm will then tell you the job id.

[strand@satori-login-002 saxpy]$ sbatch saxpy.slurm
Submitted batch job 61187

To query jobs in the queue, use squeue. If you run it with no arguments, you'll see all the queued jobs. To ask about a specific job, use -j. To ask about all jobs that you've submitted, use -u.

[strand@satori-login-002 saxpy]$ squeue -u strand
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             61187 sched_sys    saxpy   strand  R       0:00      1 node0023

Since we only asked for one minute of compute time, our job is started very quickly. So if you run squeue and don't see anything, it might just be because the job already finished.

You'll know the job is finished when its output files appear. They should show up in the directory where you queued the job with sbatch.

[strand@satori-login-002 saxpy]$ cat saxpy_61187.out
Performing SAXPY on vectors of dim 1048576
CPU time: 323 microseconds
GPU time: 59 microseconds
Max error: 0
Run completed at:
Mon Mar  1 19:40:43 EST 2021

Now let's try submitting saxpy_multi_gpu, and giving it multiple GPUs. We can use basically the same batch script, just with the new executable and GPU count (i.e. --gres=gpu:4). It doesn't matter for this program, but for real work you may also want to add #SBATCH --exclusive to make sure you're not competing with other jobs on the same GPU.

We submit the job in the same way: sbatch saxpy_multi_gpu.slurm. Soon after I had this output file.

Performing SAXPY on vectors of dim 1048576.
Found 4 GPUs.

CPU time: 579 microseconds
GPU 0 time: 55 microseconds
GPU 1 time: 85 microseconds
GPU 2 time: 60 microseconds
GPU 3 time: 61 microseconds

GPU 0 max error: 0
GPU 1 max error: 0
GPU 2 max error: 0
GPU 3 max error: 0

Run completed at:
Mon Mar  1 20:27:16 EST 2021

TODO

  • MPI hello world
  • Interactive sessions

Questions

  • How can I load CUDA 11?
  • Why is gcc 8 broken?
  • Is there a module for cmake 3.19? If not, can I make one?
  • Is there a dedicated test queue?