---
title: "Working with Slurm"
author: "George G Vega Yon"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with Slurm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

Nowadays, high-performance-computing (HPC) clusters are commonly available tools
for either in or out of cloud settings.
[Slurm Work Manager](https://slurm.schedmd.com/) (formerly
*Simple Linux Utility for Resource Manager*) is a program written in C that is
used to efficiently manage resources in HPC clusters. The slurmR R package
provides tools for using R in HPC settings that work with Slurm. It provides
wrappers and functions that allow the user to seamlessly integrate
their analysis pipeline with HPC clusters, putting emphasis on providing the user with a
family of functions similar to those that the parallel R package provides.

# Definitions

First, some important discussion points within the context of Slurm+R that users
in general will find useful. Most of the points have to do with options
available for Slurm, and in particular, with the `sbatch` command with is used
to submit batch jobs to Slurm. Users who have used Slurm in the past may wish
to skip this and continue reading the following section.


*   **Node** A single computer in the HPC: A lot of times jobs will be submitted
    to a single node. The simplest way of using R+Slurm is submitting a single
    job and requesting multiple CPUs to use, for example, `parallel::parLapply`
    or `parallel::mclapply`. Usually users do not need to request an specific
    number of nodes to be used as Slurm will allocate the resources as needed.
    
    A common mistake of R users is to specify the number of nodes and expect
    that their script will be parallelized. This won't happen unless the user
    explicitly writes a parallel computing script.
    
    The relevant flag for `sbatch` is `--nodes`.
    
*   **Partition** A group of nodes in HPC. Generally large nodes may have
    multiple partitions, meaning that nodes may be grouped in various
    ways. For example, nodes belonging to a single group of users may
    be in a single partition, nodes dedicated to work with large data
    may be in another partition. Usually, partitions are associated with
    account privileges, so users may need to specify which account are
    they using when telling Slurm what partition they plan to use.
    
    The relevant flag for `sbatch` is `--partition`.

*   **Account** Accounts may be associated with partitions. Accounts can have
    privileges to use a partition or set of nodes. Often, users need to 
    specify the account when submitting jobs to a particular partition.
    
    The relevant flag for `sbatch` is `--account`.

*   **Task** A step within a job. A particular job can have multiple tasks.
    tasks may span multiple nodes, so if the user wants to submit a multicore
    job, this option may not be the right one.

    The relevant flag for `sbatch` is `--ntasks`

*   **CPU** generally this refers to core or thread (which may be different
    in systems supporting multithreaded cores). Users may want to specify
    how many CPUs they want to use for a task. And this is the relevant
    option when using things like OpenMP or functions that allow creating
    cluster objects in R (e.g. `makePSOCKcluster`, `makeForkCluster`).
    
    The relevant option in `sbatch` is `--cpus-per-task`. More information
    regarding CPUs in Slurm can be found
    [here](https://slurm.schedmd.com/cpu_management.html). Information
    regarding how Slurm counts CPUs/cores/threads can be found
    [here](https://slurm.schedmd.com/faq.html#cpu_count).
    
*   **Job Array** Slurm supports job arrays. A job array is in simple terms
    a job that is repeated multiple times by Slurm, this is, replicates a
    single job as requested per the user. In the case of R, when using 
    this option, a single R script is spanned in multiple jobs, so the
    user can take advantage of this and parallelize jobs across multiple
    nodes. Besides from the fact that jobs within a Job Array may be spanned
    across multiple nodes, each job in that array has a unique ID that
    is available to the user via environment variables, in particular
    `SLURM_ARRAY_TASK_ID`.
    
    Within R, and hence the Rscript submitted to Slurm, users can access
    this environment variable with `Sys.getenv("SLURM_ARRAY_TASK_ID")`.
    Some of the functionalities of `slurmR` rely on Job Arrays.
    
    More information on Job Arrays can be found
    [here](https://slurm.schedmd.com/job_array.html).
    The relevant option for this in `sbatch` is `--array`.
    
More information about Slurm can be found their official website
[here](https://slurm.schedmd.com/). A tutorial about how to use Slurm with R
can be found [here](https://uscbiostats.github.io/slurmr-workshop/).

# Submitting jobs via sbatch

In general, users will submit jobs to Slurm using the `sbatch` command line
function. The `sbatch` function's main argument is the name (path) to a bash
script that holds the instructions (and sometimes options) associated to
the program. Here is an example of an bash file to be submitted to Slurm

```bash
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --job-name="A long job"
#SBATCH --mem=5GB
#SBATCH --output=long-job.out
cd /path/where/to/start/the/job

# This may vary per HPC system. At USC's hpc system
# we use: source /usr/usc/R/default/setup.sh
module load R

Rscript --vanilla long-job-rscript.R
```

This example bash file, which we name "long-job-rscript.slurm", has the following
components:

*  `#!/bin/bash` The interpreter directive that is common to bash scripts.[^shvsbash]
   
*  The `#SBATCH` lines specify options for scheduling the job. In order, these
   options are: Set a maximum time of 1 hour, name the job `A long job`, allocate
   5GB of memory to the job, write all the output (including `Rscript`'s) to
   `long-job.out`.

*  The `cd` line changes the directory to some other place where the Rscript
   needs to be executed.
   
*  The `module` line loads R. There are various ways to do this, but it is
   a common requirement for the user to specify that it will be using R.
   
*  Finally, `Rscript` executes the R script named `long-job-rscript.R`.

This batch script can be submitted to Slurm using the `sbatch` command line tool:

```bash
$ sbatch long-job-rscript.slurm
```

This is what happens under-the-hood in `slurmR` overall.

[^shvsbash]: For more on this see
[this thread on StackExchange](https://askubuntu.com/questions/141928/what-is-the-difference-between-bin-sh-and-bin-bash).