Skip to content
Snippets Groups Projects
Select Git revision
  • d5e45a08a8ccd3664cf1af78b1eafa87dfb5e6ec
  • master default protected
  • develop
  • pi
4 results

README.md

Blame
  • Satori

    Logging In

    Before logging in for the first time, you'll need to activate your account by following these instructions.

    Now you can ssh in to either of the login nodes like this (replacing strand with your username).

    ssh strand@satori-login-001.mit.edu
    ssh strand@satori-login-002.mit.edu

    According to this, the first login node should be used for submitting jobs, and the second for compiling code or transferring large files. But it also says that if one isn't available, just try the other. Both have 160 cores.

    Modules

    Satori is set up to use Environment Modules to control which executables, libraries, etc. are on your path(s). So you'll want to become familiar with the module command.

    • module avail lists all available modules
    • module spider <module name> gives you info about a module, including which other modules have to to be loaded first
    • module load <module name> loads a specific module
    • module list shows all the currently loaded modules
    • module unload <module name> unloads a specific modeul
    • module purge unloads all modules

    Satori also uses Spack to manage versions of many tools, so generally speaking you should always have this module loaded: module load spack. If you run module avail before and after loading spack, you'll see that a lot more modules become visible.

    For compiling C/C++ and CUDA code, these are the modules I start with.

    module load spack git cuda gcc/7.3.0 cmake

    Note: I'd like to use gcc 8, but I get build errors when I use it.

    Running Jobs

    Let's start with these simple CUDA hello world programs.

    With the modules above loaded, you should be able to clone the repo and build it. (The first time through, you probably want to do a little git setup.)

    git clone ssh://git@gitlab.cba.mit.edu:846/pub/hello-world/cuda.git
    cd cuda
    make -j

    Since all these programs are very lightweight, I admit I tested them all on the login node directly. Running get_gpu_info in particular revealed that the login nodes each have two V100 GPUs. (The compute nodes have four.)

    But let's do things the right way, using slurm. We'll start by making a submission script for saxpy. I called mine saxpy.slurm, and put it in its own directory outside the repo.

    #!/bin/bash
    
    #SBATCH -J saxpy        # sets the job name
    #SBATCH -o saxpy_%j.out # determines the main output file (%j will be replaced with the job number)
    #SBATCH -e saxpy_%j.err # determines the error output file
    #SBATCH --mail-user=erik.strand@cba.mit.edu
    #SBATCH --mail-type=ALL
    #SBATCH --gres=gpu:1    # requests one GPU per node...
    #SBATCH --nodes=1       # and one node...
    #SBATCH --ntasks-per-node=1 # running only instance of our command.
    #SBATCH --mem=256M      # We ask for 256 megabyte of memory (plenty for our purposes)...
    #SBATCH --time=00:01:00 # and one minute of time (again, more than we really need).
    
    ~/code/cuda/saxpy
    
    echo "Run completed at:"
    date

    All the lines that start with #SBATCH are parsed by slurm to determine which resources you need. You can also pass these on the command line, but I like to put everything in a file so I don't forget what I asked for.

    To submit the job, run sbatch saxpy.slurm. Slurm will then tell you the job id.

    [strand@satori-login-002 saxpy]$ sbatch saxpy.slurm
    Submitted batch job 61187

    To query jobs in the queue, use squeue. If you run it with no arguments, you'll see all the queued jobs. To ask about a specific job, use -j. To ask about all jobs that you've submitted, use -u.

    [strand@satori-login-002 saxpy]$ squeue -u strand
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 61187 sched_sys    saxpy   strand  R       0:00      1 node0023

    Since we only asked for one minute of compute time, our job is started very quickly. So if you run squeue and don't see anything, it might just be because the job already finished.

    You'll know the job is finished when its output files appear. They should show up in the directory where you queued the job with sbatch.

    [strand@satori-login-002 saxpy]$ cat saxpy_61187.out
    Performing SAXPY on vectors of dim 1048576
    CPU time: 323 microseconds
    GPU time: 59 microseconds
    Max error: 0
    Run completed at:
    Mon Mar  1 19:40:43 EST 2021

    Now let's try submitting saxpy_multi_gpu, and giving it multiple GPUs. We can use basically the same batch script, just with the new executable and GPU count (i.e. --gres=gpu:4). It doesn't matter for this program, but for real work you may also want to add #SBATCH --exclusive to make sure you're not competing with other jobs on the same GPU.

    We submit the job in the same way: sbatch saxpy_multi_gpu.slurm. Soon after I had this output file.

    Performing SAXPY on vectors of dim 1048576.
    Found 4 GPUs.
    
    CPU time: 579 microseconds
    GPU 0 time: 55 microseconds
    GPU 1 time: 85 microseconds
    GPU 2 time: 60 microseconds
    GPU 3 time: 61 microseconds
    
    GPU 0 max error: 0
    GPU 1 max error: 0
    GPU 2 max error: 0
    GPU 3 max error: 0
    
    Run completed at:
    Mon Mar  1 20:27:16 EST 2021

    TODO

    • MPI hello world
    • Interactive sessions

    Questions

    • How can I load CUDA 11?
    • Why is gcc 8 broken?
    • Is there a module for cmake 3.19? If not, can I make one?
    • Is there a dedicated test queue?