Bellatrix scheduler requirements

This page lists some necessary requirements that we (users) feel are important to be implemented/activated on the scheduler of bellatrix (and possibly also aries) to be able to work effectively with the cluster.

Let us try to keep the list to the really important requirements, maybe opening another page for less important requests.

For the first phase I would suggest to add comments to previous users' content, instead of simply deleting old content, and use the discussion page for more detailed discussion on specific topics. Add your group name between brackets in front of added points, or of points with which you agree, so that we have a rough idea of who needs what.

Requirements

 * [THEOS,LAMMM] Per-user fair-share on the private nodes: the scheduler must support at the same time a user "A" filling the queue with hundreds of small jobs, and another user "B" that almost never uses the machine, but wants to have almost immediate access to the running status when he submits its jobs, even if there are a lot of jobs queued by user "A"
 * [THEOS,LAMMM] Per-group and per-user fair-share queuing system on the shared nodes, similarly to the previous point
 * [THEOS,LAMMM] Documentation of the fair-share algorithm (how priorities are assigned, how the amount paid by each group is taken into account, which is the period that is taken into account), and possibility to discuss and tune the parameters for optimal usage
 * [THEOS,LAMMM] Utility to know in "real-time" the priority that the system would assign to my job (e.g. a command line utility, or a web page), and a way to know the priority assigned to jobs that are sitting in the queue. Possibility to get from PBSPro an estimate starting time of the job.
 * [THEOS,LAMMM] The length of the reference period used for the calculation of the priority should be shorter than 6 months (between 2 weeks and 2 months?) or even less in private queues, and tunable (for private queues) by the group owning the nodes (see also the discussion page)
 * [THEOS,LAMMM] For clarity, let us assume as an example that the length of the reference period is 1 month, and today is the 20th of June. Then the reference period should be the previous 30 days, i.e. 21st May-19th June, rather than the previous month (1st May - 31st May), or at least there should be the possibility to choose such an option (see also the discussion page)
 * [THEOS,LAMMM] The system should allow to submit large jobs (i.e., that use a lot of nodes) and reserve the nodes for them so that the jobs starts the execution in a reasonable time (say, of the order of its walltime) even if the queue is full of very small 1-node jobs, both on private and shared nodes.

Further requirements

 * [THEOS] It would be beneficial if the max number of queued and running jobs per person limits are increased on Q_free, with the possibility to run also on unused private nodes (at the moment, it seems that they can only access the shared nodes).