Priv:SchedSuperB

=Analysis of several resource managers and schedulers for SuperB=

Introduction
SuperB is a federation of clusters which has been described in the last issue of the Flash Informatique Special HPC. A key component of SuperB is the scheduler, and users' needs showed the shortcomings of the combination Torque/Maui. This lead us to start a study for an alternative scheduler to be used in SuperB ans similar setups.

Schedulers in the study
The reference scheduler is Maui, associated to the resource manager Torque. Other schedulers that will be taken into account are:


 * Slurm
 * PBSPro
 * Moab (in combination with Torque)
 * Platform LSF
 * Condor
 * Sun Grid Engine
 * IBM LoadLeveler
 * OAR

This choice is motivated by the fairly large user base of those schedulers (we want community support).

Specifications for the perfect scheduler
Our experience with Maui helped us to have a clear idea of what we want from a scheduler. Here is a short list of features:

Vital:


 * node-to-users mapping
 * preempting (of all types of jobs)

Wished:


 * shortpool

Node to users mapping = node to queue mapping + nodes ACLs
In SuperB, users submit jobs to a routing queue, which then routes the job to a queue specific to the lab, which has access to a specific set of nodes. The final result of this mechanism is that a user has default access only to a defined set of nodes.

This is achieved in Torque + Maui in the following way:


 * create queue for lab
 * set queue acl_host_enable to false (this means it will be taken care of by Maui)
 * set queue acl hosts
 * set queue acl_user_enable to True
 * set queue acl users

Preempting
Preempting is the act of stopping a task with the intention of resuming it at a later time. In our case, a batch job is stopped (killed) and requeued in order to give priority to a owner job.

This is achieved with Maui in the following way:

The jobs in the batch queue (guests' jobs) are declared as preemptees:

QOSCFG[batch] QFLAGS=PREEMPTEE

The owner jobs are declared as preemptors:

QOSCFG[owner]  QFLAGS=PREEMPTOR:IGNSYSTEM

A preemptor can preempt a preemptee. Then queues are classified as owner or batch, for example:

CLASSCFG[itp] QDEF=owner PRIORITY=10000 CLASSCFG[batch]  QDEF=batch PRIORITY=0

From the documentation everything should work smoothly, but Maui only preempts jobs which have been run through backfill (aggressive backfilling: the scheduler tries to apply backfill even if it's risky but allows preempting of backfilled jobs). This means that not all jobs in the batch queue will be preempted. We use a dirty hack with batch scripts to correct this issue, but this is not a good solution. It is the number one reason we want to move away from Maui.

Short pool
The short pool policy allows to ensure that a given number (or percentage) of nodes will be available within a given period of time. This is useful when high-priority jobs arise and you would like to guarantee a queue-time no higher than a chosen value.

Brief description of the candidate schedulers

 * Slurm
 * PBSPro
 * Moab (in combination with Torque)
 * Platform LSF
 * Condor
 * Sun Grid Engine
 * IBM LoadLeveler
 * OAR