Slurm Cluster Resolver

The Slurm Cluster Resolver resolves cluster specification for distributing TensorFlow work launched on HPC systems running on Slurm. This implementation is able to handle homogeneous and heterogeneous tasks as long as the number of GPUs per node and task are the same. This means on nodes with 4 GPUs each it will be possible to allocate 4 processes on node A and only 2 on node B. The resolution is done by determining job configuration through a number of Slurm variables and can be overwritten by user input. By default everything is determined from the Slurm environment, hence for most uses case no manual setting of parameters is required.

How it works

SlurmClusterResolver reads the environment variables that are set inside a job step launched by Slurm. This means it will only work correctly for applications launched via srun.

The process ID/rank is extracted from environment variable SLURM_PROCID and the total number of tasks launched is extracted from SLURM_STEP_NUM_TASKS. The hostnames are resolved by inspection SLURM_STEP_NODELIST. The number of tasks per node is extracted from SLURM_STEP_TASKS_PER_NODE, unless a value is specified by user. By using this variable heterogeneous task distributions are possible. The user can set tasks_per_node to a single integer for homogeneous tasks or a dictionary mapping node names to number of tasks for heterogeneous distributions. However setting this is NOT recommended as there is a chance it makes SLURM_PROCID be wrong.

A base port can be specified by user and in case there are more than one task launched per node the port number will be incremented for each additional tasks on that node. However a reasonable default is used.

The number of GPUs present on each node and number of GPUs for each tasks are automatically detected. This is done by checking for CUDA_VISIBLE_DEVICES first (which is set by Slurm to a list of GPUs for the current node) and has a fallback to using nvidia-smi. If this doesn't work or non-NVIDIA GPUs are used those 2 values have to be specified by the user. By default allocated GPUs will be automatically exposed to processes according to specification by setting CUDA_VISIBLE_DEVICES.

Basic example

  • Slurm allocation in shell salloc --nodes=2 -t 01:30:00 --ntasks-per-node=2 --gres=gpu:k80:4 --exclusive
  • Run the example srun python tf_example.py
  • Creating cluster in Python import tensorflow as tf cluster_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver() strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(cluster_resolver=cluster_resolver) with strategy.scope(): # Load and compile model and data

The above example will allocate 4 jobs on 2 nodes with each node having 2 jobs and 4 GPUs. cluster_resolver.cluster_spec() will return a cluster specification object in protobuf format with the following value (host names may vary): job { name: "worker" tasks { key: 0 value: "t02n13:8888" } tasks { key: 1 value: "t02n13:8889" } tasks { key: 2 value: "t02n41:8888" } tasks { key: 3 value: "t02n41:8889" } }

The job_name will be worker for all nodes and task_index will be 0 to 3. Also GPUs will be allocated automatically, so the first job on each node will see GPU 0 and 1, and the second GPU 2 and 3.

Advanced example

  • Assuming the same job parameters (salloc & srun) as above
  • Creating cluster in Python ``` cluster_resolver = tf.contrib.cluster_resolver.SlurmClusterResolver( {‘ps’: 1, ‘worker’: 3}, port_base=1337, tasks_per_node=2, gpus_per_node=2, gpus_per_task=1, auto_set_gpu=False)

cluster = cluster_resolver.cluster_spec() job_name, task_index = cluster_resolver.get_task_info() ```

In this case 1 parameter server job and 3 worker jobs are used. The resulting protobuf specification will look similar to this: job { name: "ps" tasks { key: 0 value: "t02n13:1337" } } job { name: "worker" tasks { key: 0 value: "t02n13:1338" } tasks { key: 1 value: "t02n41:1337" } tasks { key: 2 value: "t02n41:1338" } }

The value of job_name will be ps for t02n13:1337 and worker for all others. There will be no GPU allocation done by the cluster resolver, so this has to be done manually which is useful if e.g. GPUs 0 should go to the first process and GPU 3 to the second process on each node. Also note that only 1 GPU will be used per task.

Extension points

The class SlurmClusterResolver provides some methods that are meant to be overwritten by deriving classes:

  • _resolve_own_rank

  • _resolve_num_tasks

  • _resolve_hostlist

  • _resolve_task_configuration

    Those can be used to implement a cluster resolver that gets information from a different source, e.g. via MPI, a file or other environment variables. See the documentation of these methods on what to return.