From d5e45a08a8ccd3664cf1af78b1eafa87dfb5e6ec Mon Sep 17 00:00:00 2001 From: Erik Strand <erik.strand@cba.mit.edu> Date: Mon, 1 Mar 2021 21:26:45 -0500 Subject: [PATCH] Update README --- README.md | 166 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 165 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6b0bacc..caa4bfc 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,166 @@ -# satori +# Satori + +## Logging In + +Before logging in for the first time, you'll need to activate your account by following these +[instructions](https://mit-satori.github.io/satori-getting-started.html#logging-in-to-satori). + +Now you can `ssh` in to either of the login nodes like this (replacing `strand` with your username). + +``` +ssh strand@satori-login-001.mit.edu +ssh strand@satori-login-002.mit.edu +``` + +According to [this](https://mit-satori.github.io/satori-ssh.html), the first login node should be +used for submitting jobs, and the second for compiling code or transferring large files. But it also +says that if one isn't available, just try the other. Both have 160 cores. + + +## Modules + +Satori is set up to use [Environment Modules](https://modules.readthedocs.io/en/latest/index.html) +to control which executables, libraries, etc. are on your path(s). So you'll want to become familiar +with the `module` command. + +- `module avail` lists all available modules +- `module spider <module name>` gives you info about a module, including which other modules have to + to be loaded first +- `module load <module name>` loads a specific module +- `module list` shows all the currently loaded modules +- `module unload <module name>` unloads a specific modeul +- `module purge` unloads all modules + +Satori also uses [Spack](https://spack.io/) to manage versions of many tools, so generally speaking +you should always have this module loaded: `module load spack`. If you run `module avail` before and +after loading spack, you'll see that a lot more modules become visible. + +For compiling C/C++ and CUDA code, these are the modules I start with. + +``` +module load spack git cuda gcc/7.3.0 cmake +``` + +Note: I'd like to use gcc 8, but I get build errors when I use it. + + +## Running Jobs + +Let's start with these simple CUDA [hello world](https://gitlab.cba.mit.edu/pub/hello-world/cuda) +programs. + +With the modules above loaded, you should be able to clone the repo and build it. (The first time +through, you probably want to do a little git +[setup](https://git-scm.com/book/en/v2/Getting-Started-First-Time-Git-Setup).) + +``` +git clone ssh://git@gitlab.cba.mit.edu:846/pub/hello-world/cuda.git +cd cuda +make -j +``` + +Since all these programs are very lightweight, I admit I tested them all on the login node directly. +Running `get_gpu_info` in particular revealed that the login nodes each have two V100 GPUs. (The +compute nodes have four.) + +But let's do things the right way, using [slurm](https://slurm.schedmd.com/overview.html). We'll +start by making a submission script for `saxpy`. I called mine `saxpy.slurm`, and put it in its own +directory outside the repo. + +``` +#!/bin/bash + +#SBATCH -J saxpy # sets the job name +#SBATCH -o saxpy_%j.out # determines the main output file (%j will be replaced with the job number) +#SBATCH -e saxpy_%j.err # determines the error output file +#SBATCH --mail-user=erik.strand@cba.mit.edu +#SBATCH --mail-type=ALL +#SBATCH --gres=gpu:1 # requests one GPU per node... +#SBATCH --nodes=1 # and one node... +#SBATCH --ntasks-per-node=1 # running only instance of our command. +#SBATCH --mem=256M # We ask for 256 megabyte of memory (plenty for our purposes)... +#SBATCH --time=00:01:00 # and one minute of time (again, more than we really need). + +~/code/cuda/saxpy + +echo "Run completed at:" +date +``` + +All the lines that start with `#SBATCH` are parsed by slurm to determine which resources you need. +You can also pass these on the command line, but I like to put everything in a file so I don't +forget what I asked for. + +To submit the job, run `sbatch saxpy.slurm`. Slurm will then tell you the job id. + +``` +[strand@satori-login-002 saxpy]$ sbatch saxpy.slurm +Submitted batch job 61187 +``` + +To query jobs in the queue, use `squeue`. If you run it with no arguments, you'll see all the queued +jobs. To ask about a specific job, use `-j`. To ask about all jobs that you've submitted, use `-u`. + +``` +[strand@satori-login-002 saxpy]$ squeue -u strand + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 61187 sched_sys saxpy strand R 0:00 1 node0023 +``` + +Since we only asked for one minute of compute time, our job is started very quickly. So if you run +`squeue` and don't see anything, it might just be because the job already finished. + +You'll know the job is finished when its output files appear. They should show up in the directory +where you queued the job with `sbatch`. + +``` +[strand@satori-login-002 saxpy]$ cat saxpy_61187.out +Performing SAXPY on vectors of dim 1048576 +CPU time: 323 microseconds +GPU time: 59 microseconds +Max error: 0 +Run completed at: +Mon Mar 1 19:40:43 EST 2021 +``` + +Now let's try submitting `saxpy_multi_gpu`, and giving it multiple GPUs. We can use basically the +same batch script, just with the new executable and GPU count (i.e. `--gres=gpu:4`). It doesn't +matter for this program, but for real work you may also want to add `#SBATCH --exclusive` to make +sure you're not competing with other jobs on the same GPU. + +We submit the job in the same way: `sbatch saxpy_multi_gpu.slurm`. Soon after I had this output +file. + +``` +Performing SAXPY on vectors of dim 1048576. +Found 4 GPUs. + +CPU time: 579 microseconds +GPU 0 time: 55 microseconds +GPU 1 time: 85 microseconds +GPU 2 time: 60 microseconds +GPU 3 time: 61 microseconds + +GPU 0 max error: 0 +GPU 1 max error: 0 +GPU 2 max error: 0 +GPU 3 max error: 0 + +Run completed at: +Mon Mar 1 20:27:16 EST 2021 +``` + + +## TODO + +- MPI hello world +- Interactive sessions + + +## Questions + +- How can I load CUDA 11? +- Why is gcc 8 broken? +- Is there a module for cmake 3.19? If not, can I make one? +- Is there a dedicated test queue? -- GitLab