Documentation:Fair Share PBS

Fair-Share in PBSPro
This article aims to give an overview of how percentage resource assignment (a.k.a. fairshare) works in the PBSPro batch system. Please note that it contains certain approximations but is generally true.

What is Fair Share Scheduling?
The idea behind fair-share scheduling is to allocate a percentage of a cluster to a user group in a statistically guaranteed manner. For example a cluster of 100 nodes may have 20 "dedicated" to the physics department, 60 to the civil engineers and 20 for the students. Rather than only allowing the users to access their "own" machines resulting in private nodes the scheduler ensures that, on average, users will get the amount of resource allocated to them.

There are certain caveats and conditions though:


 * the users of the system must submit enough jobs
 * the job mix should be sufficiently diverse relative to the size of the cluster
 * the submitted jobs do not hit restrictions (e.g. run time or size) imposed by the scheduler

The first is easy to understand and is the same as if users had private nodes - if there is insufficient work for them to do then they will sit idle. The advantage with fair-share being that others can use these idle resources.

The second can be illustrated by three groups each having an equal share in a 1024 way cluster and all submitting 512 way jobs. Because the job is very large compared to the cluster it is hard (impossible) to share the cluster equally three ways.

The final point is that there may be policies in place that, for example, restrict the number of long running jobs in order to ensure that the system does not get blocked or monopolised. If a user only submits jobs of this kind they may be restricted by this before they are able to fill all their allocated resource.

Advantages of sharing a cluster
There are a number of reasons for the overwhelming dominance of fair-share scheduling in mixed-usage HPC with the two principal ones being:


 * As the allocation is a percentage and is statistically implemented then in the case of hardware failure or downtime the amount of resource available is only decreased proportionally. In the case of private nodes something like a switch failure can result in 100% loss of resource for the unlucky owners.
 * Users can benefit from periods when the cluster is under utilised in order to run more of their jobs.

How it works
To give an example of what happens we can consider the case of the university of Elbonia which has two departments, Maths and Biology. Because ownership of property in Elbonia is illegal the administrator sets up a fair-share structure to allocate 2/3 of the cluster to Biology and the remaining 1/3 to Maths.

Biology 66% |- Bob `- Kate

Maths  33% |- Arash |- Claude `- Tim

If all the users are submitting enough jobs to fill the system then the scheduler ensures that, on average, 66% of the jobs (or rather the wall time) run are from the Biology department. Within the groups there is no further ordering by PBSPro so we have a FIFO situation which means that if Bob submits all his jobs on a Monday and Kate submits hers on a Tuesday she will have to wait until all of Bob's jobs have finished.

If the cluster is not fully used then either group can take advantage of this to use a greater amount of resource than they have been allocated. In the extreme case during the Bioweek conference in March the Maths department use 100% of the cluster.

Upon returning from their conference the biologists immediately submit lots of tasks to the queue and the scheduler starts to re-establish the correct mix of jobs based on their allocated resource.

As resources are used the system records this as a running total that is compared to the allocated share in order to calculate the "fair-share factor". This can be thought of as the difference between the allocated share and what the actual usage is.

This can be expressed as

Effective Priority = Allocated Resource Percentage x F(Resources Allocated, Resources Used)

Different schedulers have slightly different ways of implementing this but with the same ultimate aim.

Internally PBSPro ranks users using P = usage / percentage allocated with a smaller number giving a higher priority.

Looking at an extract from a typical PBSPro installation we see:

maths   Shares: 148    Usage: 7414719838 Perc: 14.712% astro   Shares: 19     Usage: 591040272  Perc:  1.889% hr      Shares: 28     Usage: 2020109913 Perc:  2.783% bio     Shares: 74     Usage: 3356608502 Perc:  7.356% ...     ...            ...               ... xxx      Shares: nnn    Usage: aaa        Perc:  a.b%

This gives the following ranking between these four groups bio      Usage / Perc:   456308932 astro    Usage / Perc: 31288526840 maths    Usage / Perc: 50399128861 hr       Usage / Perc: 72587492382

Therefore in this case the bio group is heavily favoured by the scheduler.

Resource usage half-life
In order facilitate the correct assignment of resources and to compensate for any excess/under use of the system schedulers have the concept of "usage half-life" which works as follows. This historical usage is decayed (cut in half) every half-life so that excess or under usage is forgotten about with time. The following graphic shows the evolution of the fair share factor for a user with an allocation of 20 arbitrary units. Because the usage of 40 AU was two half-lives ago it is divided by 2 twice (so by 4) and so on.



We see that the greater than allocated usage (because the cluster was empty) a few half lives ago is forgotten about with time.

The underlying reasons for this decrease in priority are: It is not meant to "punish" those who have benefited from what would otherwise have been idle CPU cycles. In general the half-life should be more or less the same as the average run time of jobs on the system. If it's either too long or too short then the behaviour deviates from what is "fair".
 * To allow users returning to the system to get their jobs running and thus re-establish an equilibrium with the allocated shares.
 * To give those who have not been using the system some compensation especially for the fact that they have to wait for their jobs to be queued and dispatched.

Not being fair with fair-share
The are, nevertheless, certain circumstances when one might to use the fair-share scheduling method to achieve a different result. At the University of Elbonia each summer a number of students are given the chance to try their hand at research. In order to create a level playing field the project leader decides that they should all have the right to use a certain amount of computational power before the end of their projects. To do this the fair-share half life is set to be infinite so usage is never forgotten about. The five students each have a share on a 100 node machine for two months (an Elbonian month is 4 weeks long). This equates to a CPU allocation of 40 node months (NoMo). Joe 40 NoMo Sue 40 NoMo Ben 40 NoMo Pat 40 NoMo Ken 40 NoMo ---   200 NoMo

For the first two weeks nobody has any code ready so the total available before the end of the project (150 NoMo) is now less than the total allocated. Ken then starts to run jobs and finds that as he is the only user of the system he can use all the nodes for two weeks. By the end of the first month he has used up 50 NoMo but still seems to be getting some errors in his code that mean the results are not reliable. After the first month the other users are now ready and start to submit jobs alongside Ken who has now found the source of his problems. Because Ken has already used up his share for the entire two months he is heavily penalised by the scheduler. His fair share factor becomes P = ( 20 / 50 ) = 0.4 The others have an equal factor of P = ( 20 / 0 ) = A very large number The other four are now sharing the cluster four ways with Ken not being able to run any jobs. Two weeks later we see that the total individual usage and fair share factor is: Joe 12.5 NoMo  P = 1.6 Sue 12.5 NoMo  P = 1.6 Ben 12.5 NoMo  P = 1.6 Pat 12.5 NoMo  P = 1.6 Ken 50   NoMo  P = 0.4

Ken has been unable to run any jobs for two weeks so has instead been forced to find an analytic solution for his problem that wins him the Elbonian science federation medal. At the end of the two month period the resource used and the proportion of the amount allocated was is shown along with the percentage if fair-share scheduling with a very short half life had been used Joe 25 NoMo  -> 62.5% ( FF_short -> 50% ) Sue 25 NoMo  -> 62.5% ( FF_short -> 50% ) Ben 25 NoMo  -> 62.5% ( FF_short -> 50% ) Pat 25 NoMo  -> 62.5% ( FF_short -> 50% ) Ken 50 NoMo  -> 125%  ( FF_short -> 175% ) ---   150 NoMo

After this experience Ken has learned that when using a fair share system with a long half-life that it's better to let resources sit unused unless you're really sure that you aren't going to need them later.

Fair-share within groups?
It is tempting to try and control things at a finer level - an example being that within a group people who haven't used much resource might want to jump ahead of their colleagues' tasks. The issue here is that PBSPro needs to compare apples with apples so if we do this then we have to allocate a share of the machine at the individual level. At the University of Elbonia the heads of department carve up the allocation as follows:

Biology 66% |- Bob  50%  -> 33%    machine share `- Kate 50%  -> 33%    machine share

Maths  33% |- Arash 80%  -> 26.4%  machine share |- Claude 10% -> 3.3%   machine share `- Tim   10%  -> 3.3%   machine share

We see that overall the biologists still have 2/3 of the machine and the mathematicians 1/3 and if all users submit enough jobs to fill the system then everybody will be happy.

One week Bob is away and so does not submit jobs which means that the system has some free space. Because the allocation is per person and not per group this unused capacity does not go to Biology (i.e. Kate) but is rather shared equally between the two groups.

The reason for this is that Kate's share (33%) is the same as that of all the mathematicians combined (26.4 + 3.3 + 3.3).

Summary
Fair-share ensures that, despite having no private nodes, users receive the amount of resource allocated to them. It is, nevertheless, only the final stage of job sorting and other factors such as the local queue structure, requested resources and other restrictions can influence the end result. This is why other users, even in the same group, may seem to "jump ahead" in the queue.

References and further reading
Section 4.8.18 "Using Fairshare" of the PBSPro Administration Guide contains the technical details of how to configure fair-share

The wikipedia article on the subject is sadly lacking in detail: http://en.wikipedia.org/wiki/Fair-share_scheduling