Magellan Linux

Annotation of /trunk/kernel26-magellan/patches-2.6.21-r9/0001-2.6.21-sd-0.48.patch

Parent Directory Parent Directory | Revision Log Revision Log


Revision 289 - (hide annotations) (download)
Wed Aug 15 20:37:46 2007 UTC (16 years, 9 months ago) by niro
File size: 89890 byte(s)
ver bump to 2.6.21-magellan-r9:
- updated to linux-2.6.21.7

1 niro 289 Staircase Deadline cpu scheduler policy
2     ================================================
3    
4     Design summary
5     ==============
6    
7     A novel design which incorporates a foreground-background descending priority
8     system (the staircase) via a bandwidth allocation matrix according to nice
9     level.
10    
11    
12     Features
13     ========
14    
15     A starvation free, strict fairness O(1) scalable design with interactivity
16     as good as the above restrictions can provide. There is no interactivity
17     estimator, no sleep/run measurements and only simple fixed accounting.
18     The design has strict enough a design and accounting that task behaviour
19     can be modelled and maximum scheduling latencies can be predicted by
20     the virtual deadline mechanism that manages runqueues. The prime concern
21     in this design is to maintain fairness at all costs determined by nice level,
22     yet to maintain as good interactivity as can be allowed within the
23     constraints of strict fairness.
24    
25    
26     Design description
27     ==================
28    
29     SD works off the principle of providing each task a quota of runtime that it is
30     allowed to run at a number of priority levels determined by its static priority
31     (ie. its nice level). If the task uses up its quota it has its priority
32     decremented to the next level determined by a priority matrix. Once every
33     runtime quota has been consumed of every priority level, a task is queued on the
34     "expired" array. When no other tasks exist with quota, the expired array is
35     activated and fresh quotas are handed out. This is all done in O(1).
36    
37     Design details
38     ==============
39    
40     Each task keeps a record of its own entitlement of cpu time. Most of the rest of
41     these details apply to non-realtime tasks as rt task management is straight
42     forward.
43    
44     Each runqueue keeps a record of what major epoch it is up to in the
45     rq->prio_rotation field which is incremented on each major epoch. It also
46     keeps a record of the current prio_level for each static priority task.
47    
48     Each task keeps a record of what major runqueue epoch it was last running
49     on in p->rotation. It also keeps a record of what priority levels it has
50     already been allocated quota from during this epoch in a bitmap p->bitmap.
51    
52     The only tunable that determines all other details is the RR_INTERVAL. This
53     is set to 8ms, and is scaled gently upwards with more cpus. This value is
54     tunable via a /proc interface.
55    
56     All tasks are initially given a quota based on RR_INTERVAL. This is equal to
57     RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and
58     progressively larger for nice values from -1 to -20. This is assigned to
59     p->quota and only changes with changes in nice level.
60    
61     As a task is first queued, it checks in recalc_task_prio to see if it has run at
62     this runqueue's current priority rotation. If it has not, it will have its
63     p->prio level set according to the first slot in a "priority matrix" and will be
64     given a p->time_slice equal to the p->quota, and has its allocation bitmap bit
65     set in p->bitmap for this prio level. It is then queued on the current active
66     priority array.
67    
68     If a task has already been running during this major epoch, and it has
69     p->time_slice left and the rq->prio_quota for the task's p->prio still
70     has quota, it will be placed back on the active array, but no more quota
71     will be added.
72    
73     If a task has been running during this major epoch, but does not have
74     p->time_slice left, it will find the next lowest priority in its bitmap that it
75     has not been allocated quota from. It then gets the a full quota in
76     p->time_slice. It is then queued on the current active priority array at the
77     newly determined lower priority.
78    
79     If a task has been running during this major epoch, and does not have
80     any entitlement left in p->bitmap and no time_slice left, it will have its
81     bitmap cleared, and be queued at its best prio again, but on the expired
82     priority array.
83    
84     When a task is queued, it has its relevant bit set in the array->prio_bitmap.
85    
86     p->time_slice is stored in nanosconds and is updated via update_cpu_clock on
87     schedule() and scheduler_tick. If p->time_slice is below zero then the
88     recalc_task_prio is readjusted and the task rescheduled.
89    
90    
91     Priority Matrix
92     ===============
93    
94     In order to minimise the latencies between tasks of different nice levels
95     running concurrently, the dynamic priority slots where different nice levels
96     are queued are dithered instead of being sequential. What this means is that
97     there are 40 priority slots where a task may run during one major rotation,
98     and the allocation of slots is dependant on nice level. In the
99     following table, a zero represents a slot where the task may run.
100    
101     PRIORITY:0..................20.................39
102     nice -20 0000000000000000000000000000000000000000
103     nice -10 1000100010001000100010001000100010010000
104     nice 0 1010101010101010101010101010101010101010
105     nice 5 1011010110110101101101011011010110110110
106     nice 10 1110111011101110111011101110111011101110
107     nice 15 1111111011111110111111101111111011111110
108     nice 19 1111111111111111111111111111111111111110
109    
110     As can be seen, a nice -20 task runs in every priority slot whereas a nice 19
111     task only runs one slot per major rotation. This dithered table allows for the
112     smallest possible maximum latencies between tasks of varying nice levels, thus
113     allowing vastly different nice levels to be used.
114    
115     SCHED_BATCH tasks are managed slightly differently, receiving only the top
116     slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but
117     slightly higher latencies.
118    
119    
120     Modelling deadline behaviour
121     ============================
122    
123     As the accounting in this design is hard and not modified by sleep average
124     calculations or interactivity modifiers, it is possible to accurately
125     predict the maximum latency that a task may experience under different
126     conditions. This is a virtual deadline mechanism enforced by mandatory
127     timeslice expiration and not outside bandwidth measurement.
128    
129     The maximum duration a task can run during one major epoch is determined by its
130     nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL
131     duration during each epoch. Nice 10 tasks can run at 9 priority levels for each
132     epoch, and so on. The table in the priority matrix above demonstrates how this
133     is enforced.
134    
135     Therefore the maximum duration a runqueue epoch can take is determined by
136     the number of tasks running, and their nice level. After that, the maximum
137     duration it can take before a task can wait before it get scheduled is
138     determined by the position of its first slot on the matrix.
139    
140     In the following examples, these are _worst case scenarios_ and would rarely
141     occur, but can be modelled nonetheless to determine the maximum possible
142     latency.
143    
144     So for example, if two nice 0 tasks are running, and one has just expired as
145     another is activated for the first time receiving a full quota for this
146     runqueue rotation, the first task will wait:
147    
148     nr_tasks * max_duration + nice_difference * rr_interval
149     1 * 19 * RR_INTERVAL + 0 = 152ms
150    
151     In the presence of a nice 10 task, a nice 0 task would wait a maximum of
152     1 * 10 * RR_INTERVAL + 0 = 80ms
153    
154     In the presence of a nice 0 task, a nice 10 task would wait a maximum of
155     1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms
156    
157     More useful than these values, though, are the average latencies which are
158     a matter of determining the average distance between priority slots of
159     different nice values and multiplying them by the tasks' quota. For example
160     in the presence of a nice -10 task, a nice 0 task will wait either one or
161     two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL,
162     this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or
163     20 and 40ms respectively (on uniprocessor at 1000HZ).
164    
165    
166     Achieving interactivity
167     =======================
168    
169     A requirement of this scheduler design was to achieve good interactivity
170     despite being a completely fair deadline based design. The disadvantage of
171     designs that try to achieve interactivity is that they usually do so at
172     the expense of maintaining fairness. As cpu speeds increase, the requirement
173     for some sort of metered unfairness towards interactive tasks becomes a less
174     desirable phenomenon, but low latency and fairness remains mandatory to
175     good interactive performance.
176    
177     This design relies on the fact that interactive tasks, by their nature,
178     sleep often. Most fair scheduling designs end up penalising such tasks
179     indirectly giving them less than their fair possible share because of the
180     sleep, and have to use a mechanism of bonusing their priority to offset
181     this based on the duration they sleep. This becomes increasingly inaccurate
182     as the number of running tasks rises and more tasks spend time waiting on
183     runqueues rather than sleeping, and it is impossible to tell whether the
184     task that's waiting on a runqueue only intends to run for a short period and
185     then sleep again after than runqueue wait. Furthermore, all such designs rely
186     on a period of time to pass to accumulate some form of statistic on the task
187     before deciding on how much to give them preference. The shorter this period,
188     the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
189     longer this period, the longer it takes for interactive tasks to get low
190     scheduling latencies and fair cpu.
191    
192     This design does not measure sleep time at all. Interactive tasks that sleep
193     often will wake up having consumed very little if any of their quota for
194     the current major priority rotation. The longer they have slept, the less
195     likely they are to even be on the current major priority rotation. Once
196     woken up, though, they get to use up a their full quota for that epoch,
197     whether part of a quota remains or a full quota. Overall, however, they
198     can still only run as much cpu time for that epoch as any other task of the
199     same nice level. This means that two tasks behaving completely differently
200     from fully cpu bound to waking/sleeping extremely frequently will still
201     get the same quota of cpu, but the latter will be using its quota for that
202     epoch in bursts rather than continuously. This guarantees that interactive
203     tasks get the same amount of cpu as cpu bound ones.
204    
205     The other requirement of interactive tasks is also to obtain low latencies
206     for when they are scheduled. Unlike fully cpu bound tasks and the maximum
207     latencies possible described in the modelling deadline behaviour section
208     above, tasks that sleep will wake up with quota available usually at the
209     current runqueue's priority_level or better. This means that the most latency
210     they are likely to see is one RR_INTERVAL, and often they will preempt the
211     current task if it is not of a sleeping nature. This then guarantees very
212     low latency for interactive tasks, and the lowest latencies for the least
213     cpu bound tasks.
214    
215    
216     Fri, 4 May 2007
217    
218     Signed-off-by: Con Kolivas <kernel@kolivas.org>
219    
220     ---
221     Documentation/sched-design.txt | 234 +++++++
222     Documentation/sysctl/kernel.txt | 14
223     fs/pipe.c | 7
224     fs/proc/array.c | 2
225     include/linux/init_task.h | 4
226     include/linux/sched.h | 32 -
227     kernel/sched.c | 1277 +++++++++++++++++++---------------------
228     kernel/softirq.c | 2
229     kernel/sysctl.c | 26
230     kernel/workqueue.c | 2
231     10 files changed, 908 insertions(+), 692 deletions(-)
232    
233     Index: linux-2.6.21-ck2/kernel/workqueue.c
234     ===================================================================
235     --- linux-2.6.21-ck2.orig/kernel/workqueue.c 2007-05-03 22:20:57.000000000 +1000
236     +++ linux-2.6.21-ck2/kernel/workqueue.c 2007-05-14 19:30:30.000000000 +1000
237     @@ -355,8 +355,6 @@ static int worker_thread(void *__cwq)
238     if (!cwq->freezeable)
239     current->flags |= PF_NOFREEZE;
240    
241     - set_user_nice(current, -5);
242     -
243     /* Block and flush all signals */
244     sigfillset(&blocked);
245     sigprocmask(SIG_BLOCK, &blocked, NULL);
246     Index: linux-2.6.21-ck2/fs/proc/array.c
247     ===================================================================
248     --- linux-2.6.21-ck2.orig/fs/proc/array.c 2007-05-03 22:20:56.000000000 +1000
249     +++ linux-2.6.21-ck2/fs/proc/array.c 2007-05-14 19:30:30.000000000 +1000
250     @@ -165,7 +165,6 @@ static inline char * task_state(struct t
251     rcu_read_lock();
252     buffer += sprintf(buffer,
253     "State:\t%s\n"
254     - "SleepAVG:\t%lu%%\n"
255     "Tgid:\t%d\n"
256     "Pid:\t%d\n"
257     "PPid:\t%d\n"
258     @@ -173,7 +172,6 @@ static inline char * task_state(struct t
259     "Uid:\t%d\t%d\t%d\t%d\n"
260     "Gid:\t%d\t%d\t%d\t%d\n",
261     get_task_state(p),
262     - (p->sleep_avg/1024)*100/(1020000000/1024),
263     p->tgid, p->pid,
264     pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
265     pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0,
266     Index: linux-2.6.21-ck2/include/linux/init_task.h
267     ===================================================================
268     --- linux-2.6.21-ck2.orig/include/linux/init_task.h 2007-05-03 22:20:57.000000000 +1000
269     +++ linux-2.6.21-ck2/include/linux/init_task.h 2007-05-14 19:30:30.000000000 +1000
270     @@ -102,13 +102,15 @@ extern struct group_info init_groups;
271     .prio = MAX_PRIO-20, \
272     .static_prio = MAX_PRIO-20, \
273     .normal_prio = MAX_PRIO-20, \
274     + .rotation = 0, \
275     .policy = SCHED_NORMAL, \
276     .cpus_allowed = CPU_MASK_ALL, \
277     .mm = NULL, \
278     .active_mm = &init_mm, \
279     .run_list = LIST_HEAD_INIT(tsk.run_list), \
280     .ioprio = 0, \
281     - .time_slice = HZ, \
282     + .time_slice = 1000000000, \
283     + .quota = 1000000000, \
284     .tasks = LIST_HEAD_INIT(tsk.tasks), \
285     .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \
286     .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \
287     Index: linux-2.6.21-ck2/include/linux/sched.h
288     ===================================================================
289     --- linux-2.6.21-ck2.orig/include/linux/sched.h 2007-05-03 22:20:57.000000000 +1000
290     +++ linux-2.6.21-ck2/include/linux/sched.h 2007-05-14 19:30:30.000000000 +1000
291     @@ -149,8 +149,7 @@ extern unsigned long weighted_cpuload(co
292     #define EXIT_ZOMBIE 16
293     #define EXIT_DEAD 32
294     /* in tsk->state again */
295     -#define TASK_NONINTERACTIVE 64
296     -#define TASK_DEAD 128
297     +#define TASK_DEAD 64
298    
299     #define __set_task_state(tsk, state_value) \
300     do { (tsk)->state = (state_value); } while (0)
301     @@ -522,8 +521,9 @@ struct signal_struct {
302    
303     #define MAX_USER_RT_PRIO 100
304     #define MAX_RT_PRIO MAX_USER_RT_PRIO
305     +#define PRIO_RANGE (40)
306    
307     -#define MAX_PRIO (MAX_RT_PRIO + 40)
308     +#define MAX_PRIO (MAX_RT_PRIO + PRIO_RANGE)
309    
310     #define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO)
311     #define rt_task(p) rt_prio((p)->prio)
312     @@ -788,13 +788,6 @@ struct mempolicy;
313     struct pipe_inode_info;
314     struct uts_namespace;
315    
316     -enum sleep_type {
317     - SLEEP_NORMAL,
318     - SLEEP_NONINTERACTIVE,
319     - SLEEP_INTERACTIVE,
320     - SLEEP_INTERRUPTED,
321     -};
322     -
323     struct prio_array;
324    
325     struct task_struct {
326     @@ -814,20 +807,33 @@ struct task_struct {
327     int load_weight; /* for niceness load balancing purposes */
328     int prio, static_prio, normal_prio;
329     struct list_head run_list;
330     + /*
331     + * This bitmap shows what priorities this task has received quota
332     + * from for this major priority rotation on its current runqueue.
333     + */
334     + DECLARE_BITMAP(bitmap, PRIO_RANGE + 1);
335     struct prio_array *array;
336     + /* Which major runqueue rotation did this task run */
337     + unsigned long rotation;
338    
339     unsigned short ioprio;
340     #ifdef CONFIG_BLK_DEV_IO_TRACE
341     unsigned int btrace_seq;
342     #endif
343     - unsigned long sleep_avg;
344     unsigned long long timestamp, last_ran;
345     unsigned long long sched_time; /* sched_clock time spent running */
346     - enum sleep_type sleep_type;
347    
348     unsigned long policy;
349     cpumask_t cpus_allowed;
350     - unsigned int time_slice, first_time_slice;
351     + /*
352     + * How much this task is entitled to run at the current priority
353     + * before being requeued at a lower priority.
354     + */
355     + int time_slice;
356     + /* Is this the very first time_slice this task has ever run. */
357     + unsigned int first_time_slice;
358     + /* How much this task receives at each priority level */
359     + int quota;
360    
361     #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
362     struct sched_info sched_info;
363     Index: linux-2.6.21-ck2/kernel/sched.c
364     ===================================================================
365     --- linux-2.6.21-ck2.orig/kernel/sched.c 2007-05-03 22:20:57.000000000 +1000
366     +++ linux-2.6.21-ck2/kernel/sched.c 2007-05-14 19:30:30.000000000 +1000
367     @@ -16,6 +16,7 @@
368     * by Davide Libenzi, preemptible kernel bits by Robert Love.
369     * 2003-09-03 Interactivity tuning by Con Kolivas.
370     * 2004-04-02 Scheduler domains code by Nick Piggin
371     + * 2007-03-02 Staircase deadline scheduling policy by Con Kolivas
372     */
373    
374     #include <linux/mm.h>
375     @@ -52,6 +53,7 @@
376     #include <linux/tsacct_kern.h>
377     #include <linux/kprobes.h>
378     #include <linux/delayacct.h>
379     +#include <linux/log2.h>
380     #include <asm/tlb.h>
381    
382     #include <asm/unistd.h>
383     @@ -83,126 +85,72 @@ unsigned long long __attribute__((weak))
384     #define USER_PRIO(p) ((p)-MAX_RT_PRIO)
385     #define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)
386     #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
387     +#define SCHED_PRIO(p) ((p)+MAX_RT_PRIO)
388    
389     -/*
390     - * Some helpers for converting nanosecond timing to jiffy resolution
391     - */
392     -#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
393     +/* Some helpers for converting to/from various scales.*/
394     #define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))
395     -
396     -/*
397     - * These are the 'tuning knobs' of the scheduler:
398     - *
399     - * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger),
400     - * default timeslice is 100 msecs, maximum timeslice is 800 msecs.
401     - * Timeslices get refilled after they expire.
402     - */
403     -#define MIN_TIMESLICE max(5 * HZ / 1000, 1)
404     -#define DEF_TIMESLICE (100 * HZ / 1000)
405     -#define ON_RUNQUEUE_WEIGHT 30
406     -#define CHILD_PENALTY 95
407     -#define PARENT_PENALTY 100
408     -#define EXIT_WEIGHT 3
409     -#define PRIO_BONUS_RATIO 25
410     -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
411     -#define INTERACTIVE_DELTA 2
412     -#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS)
413     -#define STARVATION_LIMIT (MAX_SLEEP_AVG)
414     -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
415     -
416     -/*
417     - * If a task is 'interactive' then we reinsert it in the active
418     - * array after it has expired its current timeslice. (it will not
419     - * continue to run immediately, it will still roundrobin with
420     - * other interactive tasks.)
421     - *
422     - * This part scales the interactivity limit depending on niceness.
423     - *
424     - * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
425     - * Here are a few examples of different nice levels:
426     - *
427     - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
428     - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
429     - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0]
430     - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
431     - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
432     - *
433     - * (the X axis represents the possible -5 ... 0 ... +5 dynamic
434     - * priority range a task can explore, a value of '1' means the
435     - * task is rated interactive.)
436     - *
437     - * Ie. nice +19 tasks can never get 'interactive' enough to be
438     - * reinserted into the active array. And only heavily CPU-hog nice -20
439     - * tasks will be expired. Default nice 0 tasks are somewhere between,
440     - * it takes some effort for them to get interactive, but it's not
441     - * too hard.
442     - */
443     -
444     -#define CURRENT_BONUS(p) \
445     - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
446     - MAX_SLEEP_AVG)
447     -
448     -#define GRANULARITY (10 * HZ / 1000 ? : 1)
449     -
450     -#ifdef CONFIG_SMP
451     -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \
452     - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
453     - num_online_cpus())
454     -#else
455     -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \
456     - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
457     -#endif
458     -
459     -#define SCALE(v1,v1_max,v2_max) \
460     - (v1) * (v2_max) / (v1_max)
461     -
462     -#define DELTA(p) \
463     - (SCALE(TASK_NICE(p) + 20, 40, MAX_BONUS) - 20 * MAX_BONUS / 40 + \
464     - INTERACTIVE_DELTA)
465     -
466     -#define TASK_INTERACTIVE(p) \
467     - ((p)->prio <= (p)->static_prio - DELTA(p))
468     -
469     -#define INTERACTIVE_SLEEP(p) \
470     - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
471     - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
472     -
473     -#define TASK_PREEMPTS_CURR(p, rq) \
474     - ((p)->prio < (rq)->curr->prio)
475     -
476     -#define SCALE_PRIO(x, prio) \
477     - max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
478     -
479     -static unsigned int static_prio_timeslice(int static_prio)
480     -{
481     - if (static_prio < NICE_TO_PRIO(0))
482     - return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio);
483     - else
484     - return SCALE_PRIO(DEF_TIMESLICE, static_prio);
485     -}
486     -
487     -/*
488     - * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
489     - * to time slice values: [800ms ... 100ms ... 5ms]
490     - *
491     - * The higher a thread's priority, the bigger timeslices
492     - * it gets during one round of execution. But even the lowest
493     - * priority thread gets MIN_TIMESLICE worth of execution time.
494     +#define MS_TO_NS(TIME) ((TIME) * 1000000)
495     +#define MS_TO_US(TIME) ((TIME) * 1000)
496     +#define US_TO_MS(TIME) ((TIME) / 1000)
497     +
498     +#define TASK_PREEMPTS_CURR(p, curr) ((p)->prio < (curr)->prio)
499     +
500     +/*
501     + * This is the time all tasks within the same priority round robin.
502     + * Value is in ms and set to a minimum of 8ms. Scales with number of cpus.
503     + * Tunable via /proc interface.
504     + */
505     +int rr_interval __read_mostly = 8;
506     +
507     +/*
508     + * This contains a bitmap for each dynamic priority level with empty slots
509     + * for the valid priorities each different nice level can have. It allows
510     + * us to stagger the slots where differing priorities run in a way that
511     + * keeps latency differences between different nice levels at a minimum.
512     + * The purpose of a pre-generated matrix is for rapid lookup of next slot in
513     + * O(1) time without having to recalculate every time priority gets demoted.
514     + * All nice levels use priority slot 39 as this allows less niced tasks to
515     + * get all priority slots better than that before expiration is forced.
516     + * ie, where 0 means a slot for that priority, priority running from left to
517     + * right is from prio 0 to prio 39:
518     + * nice -20 0000000000000000000000000000000000000000
519     + * nice -10 1000100010001000100010001000100010010000
520     + * nice 0 1010101010101010101010101010101010101010
521     + * nice 5 1011010110110101101101011011010110110110
522     + * nice 10 1110111011101110111011101110111011101110
523     + * nice 15 1111111011111110111111101111111011111110
524     + * nice 19 1111111111111111111111111111111111111110
525     */
526     +static unsigned long prio_matrix[PRIO_RANGE][BITS_TO_LONGS(PRIO_RANGE)]
527     + __read_mostly;
528    
529     -static inline unsigned int task_timeslice(struct task_struct *p)
530     -{
531     - return static_prio_timeslice(p->static_prio);
532     -}
533     +struct rq;
534    
535     /*
536     * These are the runqueue data structures:
537     */
538     -
539     struct prio_array {
540     - unsigned int nr_active;
541     - DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
542     + /* Tasks queued at each priority */
543     struct list_head queue[MAX_PRIO];
544     +
545     + /*
546     + * The bitmap of priorities queued for this array. While the expired
547     + * array will never have realtime tasks on it, it is simpler to have
548     + * equal sized bitmaps for a cheap array swap. Include 1 bit for
549     + * delimiter.
550     + */
551     + DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1);
552     +
553     + /*
554     + * The best static priority (of the dynamic priority tasks) queued
555     + * this array.
556     + */
557     + int best_static_prio;
558     +
559     +#ifdef CONFIG_SMP
560     + /* For convenience looks back at rq */
561     + struct rq *rq;
562     +#endif
563     };
564    
565     /*
566     @@ -234,14 +182,24 @@ struct rq {
567     */
568     unsigned long nr_uninterruptible;
569    
570     - unsigned long expired_timestamp;
571     /* Cached timestamp set by update_cpu_clock() */
572     unsigned long long most_recent_timestamp;
573     struct task_struct *curr, *idle;
574     unsigned long next_balance;
575     struct mm_struct *prev_mm;
576     +
577     struct prio_array *active, *expired, arrays[2];
578     - int best_expired_prio;
579     + unsigned long *dyn_bitmap, *exp_bitmap;
580     +
581     + /*
582     + * The current dynamic priority level this runqueue is at per static
583     + * priority level.
584     + */
585     + int prio_level[PRIO_RANGE];
586     +
587     + /* How many times we have rotated the priority queue */
588     + unsigned long prio_rotation;
589     +
590     atomic_t nr_iowait;
591    
592     #ifdef CONFIG_SMP
593     @@ -579,12 +537,9 @@ static inline struct rq *this_rq_lock(vo
594     #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
595     /*
596     * Called when a process is dequeued from the active array and given
597     - * the cpu. We should note that with the exception of interactive
598     - * tasks, the expired queue will become the active queue after the active
599     - * queue is empty, without explicitly dequeuing and requeuing tasks in the
600     - * expired queue. (Interactive tasks may be requeued directly to the
601     - * active queue, thus delaying tasks in the expired queue from running;
602     - * see scheduler_tick()).
603     + * the cpu. We should note that the expired queue will become the active
604     + * queue after the active queue is empty, without explicitly dequeuing and
605     + * requeuing tasks in the expired queue.
606     *
607     * This function is only called from sched_info_arrive(), rather than
608     * dequeue_task(). Even though a task may be queued and dequeued multiple
609     @@ -682,71 +637,227 @@ sched_info_switch(struct task_struct *pr
610     #define sched_info_switch(t, next) do { } while (0)
611     #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
612    
613     +static inline int task_queued(struct task_struct *task)
614     +{
615     + return !list_empty(&task->run_list);
616     +}
617     +
618     +static inline void set_dynamic_bit(struct task_struct *p, struct rq *rq)
619     +{
620     + __set_bit(p->prio, p->array->prio_bitmap);
621     +}
622     +
623     /*
624     - * Adding/removing a task to/from a priority array:
625     + * Removing from a runqueue.
626     */
627     -static void dequeue_task(struct task_struct *p, struct prio_array *array)
628     +static void dequeue_task(struct task_struct *p, struct rq *rq)
629     {
630     - array->nr_active--;
631     - list_del(&p->run_list);
632     - if (list_empty(array->queue + p->prio))
633     - __clear_bit(p->prio, array->bitmap);
634     + list_del_init(&p->run_list);
635     + if (list_empty(p->array->queue + p->prio))
636     + __clear_bit(p->prio, p->array->prio_bitmap);
637     }
638    
639     -static void enqueue_task(struct task_struct *p, struct prio_array *array)
640     +static void reset_first_time_slice(struct task_struct *p)
641     {
642     - sched_info_queued(p);
643     - list_add_tail(&p->run_list, array->queue + p->prio);
644     - __set_bit(p->prio, array->bitmap);
645     - array->nr_active++;
646     + if (unlikely(p->first_time_slice))
647     + p->first_time_slice = 0;
648     +}
649     +
650     +/*
651     + * The task is being queued on a fresh array so it has its entitlement
652     + * bitmap cleared.
653     + */
654     +static void task_new_array(struct task_struct *p, struct rq *rq,
655     + struct prio_array *array)
656     +{
657     + bitmap_zero(p->bitmap, PRIO_RANGE);
658     + p->rotation = rq->prio_rotation;
659     + p->time_slice = p->quota;
660     p->array = array;
661     + reset_first_time_slice(p);
662     +}
663     +
664     +/* Find the first slot from the relevant prio_matrix entry */
665     +static int first_prio_slot(struct task_struct *p)
666     +{
667     + if (unlikely(p->policy == SCHED_BATCH))
668     + return p->static_prio;
669     + return SCHED_PRIO(find_first_zero_bit(
670     + prio_matrix[USER_PRIO(p->static_prio)], PRIO_RANGE));
671     }
672    
673     /*
674     - * Put task to the end of the run list without the overhead of dequeue
675     - * followed by enqueue.
676     + * Find the first unused slot by this task that is also in its prio_matrix
677     + * level. SCHED_BATCH tasks do not use the priority matrix. They only take
678     + * priority slots from their static_prio and above.
679     */
680     -static void requeue_task(struct task_struct *p, struct prio_array *array)
681     +static int next_entitled_slot(struct task_struct *p, struct rq *rq)
682     {
683     - list_move_tail(&p->run_list, array->queue + p->prio);
684     + int search_prio = MAX_RT_PRIO, uprio = USER_PRIO(p->static_prio);
685     + struct prio_array *array = rq->active;
686     + DECLARE_BITMAP(tmp, PRIO_RANGE);
687     +
688     + /*
689     + * Go straight to expiration if there are higher priority tasks
690     + * already expired.
691     + */
692     + if (p->static_prio > rq->expired->best_static_prio)
693     + return MAX_PRIO;
694     + if (!rq->prio_level[uprio])
695     + rq->prio_level[uprio] = MAX_RT_PRIO;
696     + /*
697     + * Only priorities equal to the prio_level and above for their
698     + * static_prio are acceptable, and only if it's not better than
699     + * a queued better static_prio's prio_level.
700     + */
701     + if (p->static_prio < array->best_static_prio) {
702     + if (likely(p->policy != SCHED_BATCH))
703     + array->best_static_prio = p->static_prio;
704     + } else if (p->static_prio == array->best_static_prio) {
705     + search_prio = rq->prio_level[uprio];
706     + } else {
707     + int i;
708     +
709     + search_prio = rq->prio_level[uprio];
710     + /* A bound O(n) function, worst case n is 40 */
711     + for (i = array->best_static_prio; i <= p->static_prio ; i++) {
712     + if (!rq->prio_level[USER_PRIO(i)])
713     + rq->prio_level[USER_PRIO(i)] = MAX_RT_PRIO;
714     + search_prio = max(search_prio,
715     + rq->prio_level[USER_PRIO(i)]);
716     + }
717     + }
718     + if (unlikely(p->policy == SCHED_BATCH)) {
719     + search_prio = max(search_prio, p->static_prio);
720     + return SCHED_PRIO(find_next_zero_bit(p->bitmap, PRIO_RANGE,
721     + USER_PRIO(search_prio)));
722     + }
723     + bitmap_or(tmp, p->bitmap, prio_matrix[uprio], PRIO_RANGE);
724     + return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
725     + USER_PRIO(search_prio)));
726     +}
727     +
728     +static void queue_expired(struct task_struct *p, struct rq *rq)
729     +{
730     + task_new_array(p, rq, rq->expired);
731     + p->prio = p->normal_prio = first_prio_slot(p);
732     + if (p->static_prio < rq->expired->best_static_prio)
733     + rq->expired->best_static_prio = p->static_prio;
734     + reset_first_time_slice(p);
735     }
736    
737     -static inline void
738     -enqueue_task_head(struct task_struct *p, struct prio_array *array)
739     +#ifdef CONFIG_SMP
740     +/*
741     + * If we're waking up a task that was previously on a different runqueue,
742     + * update its data appropriately. Note we may be reading data from src_rq->
743     + * outside of lock, but the occasional inaccurate result should be harmless.
744     + */
745     + static void update_if_moved(struct task_struct *p, struct rq *rq)
746     +{
747     + struct rq *src_rq = p->array->rq;
748     +
749     + if (src_rq == rq)
750     + return;
751     + /*
752     + * Only need to set p->array when p->rotation == rq->prio_rotation as
753     + * they will be set in recalc_task_prio when != rq->prio_rotation.
754     + */
755     + if (p->rotation == src_rq->prio_rotation) {
756     + p->rotation = rq->prio_rotation;
757     + if (p->array == src_rq->expired)
758     + p->array = rq->expired;
759     + else
760     + p->array = rq->active;
761     + } else
762     + p->rotation = 0;
763     +}
764     +#else
765     +static inline void update_if_moved(struct task_struct *p, struct rq *rq)
766     {
767     - list_add(&p->run_list, array->queue + p->prio);
768     - __set_bit(p->prio, array->bitmap);
769     - array->nr_active++;
770     - p->array = array;
771     }
772     +#endif
773    
774     /*
775     - * __normal_prio - return the priority that is based on the static
776     - * priority but is modified by bonuses/penalties.
777     - *
778     - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
779     - * into the -5 ... 0 ... +5 bonus/penalty range.
780     - *
781     - * We use 25% of the full 0...39 priority range so that:
782     - *
783     - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
784     - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
785     - *
786     - * Both properties are important to certain workloads.
787     + * recalc_task_prio determines what priority a non rt_task will be
788     + * queued at. If the task has already been running during this runqueue's
789     + * major rotation (rq->prio_rotation) then it continues at the same
790     + * priority if it has tick entitlement left. If it does not have entitlement
791     + * left, it finds the next priority slot according to its nice value that it
792     + * has not extracted quota from. If it has not run during this major
793     + * rotation, it starts at the next_entitled_slot and has its bitmap quota
794     + * cleared. If it does not have any slots left it has all its slots reset and
795     + * is queued on the expired at its first_prio_slot.
796     */
797     +static void recalc_task_prio(struct task_struct *p, struct rq *rq)
798     +{
799     + struct prio_array *array = rq->active;
800     + int queue_prio;
801    
802     -static inline int __normal_prio(struct task_struct *p)
803     + update_if_moved(p, rq);
804     + if (p->rotation == rq->prio_rotation) {
805     + if (p->array == array) {
806     + if (p->time_slice > 0)
807     + return;
808     + p->time_slice = p->quota;
809     + } else if (p->array == rq->expired) {
810     + queue_expired(p, rq);
811     + return;
812     + } else
813     + task_new_array(p, rq, array);
814     + } else
815     + task_new_array(p, rq, array);
816     +
817     + queue_prio = next_entitled_slot(p, rq);
818     + if (queue_prio >= MAX_PRIO) {
819     + queue_expired(p, rq);
820     + return;
821     + }
822     + p->prio = p->normal_prio = queue_prio;
823     + __set_bit(USER_PRIO(p->prio), p->bitmap);
824     +}
825     +
826     +/*
827     + * Adding to a runqueue. The dynamic priority queue that it is added to is
828     + * determined by recalc_task_prio() above.
829     + */
830     +static inline void __enqueue_task(struct task_struct *p, struct rq *rq)
831     +{
832     + if (rt_task(p))
833     + p->array = rq->active;
834     + else
835     + recalc_task_prio(p, rq);
836     +
837     + sched_info_queued(p);
838     + set_dynamic_bit(p, rq);
839     +}
840     +
841     +static void enqueue_task(struct task_struct *p, struct rq *rq)
842     {
843     - int bonus, prio;
844     + __enqueue_task(p, rq);
845     + list_add_tail(&p->run_list, p->array->queue + p->prio);
846     +}
847    
848     - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
849     +static inline void enqueue_task_head(struct task_struct *p, struct rq *rq)
850     +{
851     + __enqueue_task(p, rq);
852     + list_add(&p->run_list, p->array->queue + p->prio);
853     +}
854    
855     - prio = p->static_prio - bonus;
856     - if (prio < MAX_RT_PRIO)
857     - prio = MAX_RT_PRIO;
858     - if (prio > MAX_PRIO-1)
859     - prio = MAX_PRIO-1;
860     - return prio;
861     +/*
862     + * requeue_task is only called when p->static_prio does not change. p->prio
863     + * can change with dynamic tasks.
864     + */
865     +static void requeue_task(struct task_struct *p, struct rq *rq,
866     + struct prio_array *old_array, int old_prio)
867     +{
868     + if (p->array == rq->expired)
869     + queue_expired(p, rq);
870     + list_move_tail(&p->run_list, p->array->queue + p->prio);
871     + if (!rt_task(p)) {
872     + if (list_empty(old_array->queue + old_prio))
873     + __clear_bit(old_prio, old_array->prio_bitmap);
874     + set_dynamic_bit(p, rq);
875     + }
876     }
877    
878     /*
879     @@ -759,17 +870,24 @@ static inline int __normal_prio(struct t
880     */
881    
882     /*
883     - * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE
884     - * If static_prio_timeslice() is ever changed to break this assumption then
885     - * this code will need modification
886     - */
887     -#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE
888     -#define LOAD_WEIGHT(lp) \
889     - (((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO)
890     -#define PRIO_TO_LOAD_WEIGHT(prio) \
891     - LOAD_WEIGHT(static_prio_timeslice(prio))
892     -#define RTPRIO_TO_LOAD_WEIGHT(rp) \
893     - (PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp))
894     + * task_timeslice - the total duration a task can run during one major
895     + * rotation. Returns value in milliseconds as the smallest value can be 1.
896     + */
897     +static int task_timeslice(struct task_struct *p)
898     +{
899     + int slice = p->quota; /* quota is in us */
900     +
901     + if (!rt_task(p))
902     + slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * slice;
903     + return US_TO_MS(slice);
904     +}
905     +
906     +/*
907     + * The load weight is basically the task_timeslice in ms. Realtime tasks are
908     + * special cased to be proportionately larger than nice -20 by their
909     + * rt_priority. The weight for rt tasks can only be arbitrary at best.
910     + */
911     +#define RTPRIO_TO_LOAD_WEIGHT(rp) (rr_interval * 20 * (40 + rp))
912    
913     static void set_load_weight(struct task_struct *p)
914     {
915     @@ -786,7 +904,7 @@ static void set_load_weight(struct task_
916     #endif
917     p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
918     } else
919     - p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
920     + p->load_weight = task_timeslice(p);
921     }
922    
923     static inline void
924     @@ -814,28 +932,38 @@ static inline void dec_nr_running(struct
925     }
926    
927     /*
928     - * Calculate the expected normal priority: i.e. priority
929     - * without taking RT-inheritance into account. Might be
930     - * boosted by interactivity modifiers. Changes upon fork,
931     - * setprio syscalls, and whenever the interactivity
932     - * estimator recalculates.
933     + * __activate_task - move a task to the runqueue.
934     */
935     -static inline int normal_prio(struct task_struct *p)
936     +static inline void __activate_task(struct task_struct *p, struct rq *rq)
937     +{
938     + enqueue_task(p, rq);
939     + inc_nr_running(p, rq);
940     +}
941     +
942     +/*
943     + * __activate_idle_task - move idle task to the _front_ of runqueue.
944     + */
945     +static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
946     {
947     - int prio;
948     + enqueue_task_head(p, rq);
949     + inc_nr_running(p, rq);
950     +}
951    
952     +static inline int normal_prio(struct task_struct *p)
953     +{
954     if (has_rt_policy(p))
955     - prio = MAX_RT_PRIO-1 - p->rt_priority;
956     + return MAX_RT_PRIO-1 - p->rt_priority;
957     + /* Other tasks all have normal_prio set in recalc_task_prio */
958     + if (likely(p->prio >= MAX_RT_PRIO && p->prio < MAX_PRIO))
959     + return p->prio;
960     else
961     - prio = __normal_prio(p);
962     - return prio;
963     + return p->static_prio;
964     }
965    
966     /*
967     * Calculate the current priority, i.e. the priority
968     * taken into account by the scheduler. This value might
969     - * be boosted by RT tasks, or might be boosted by
970     - * interactivity modifiers. Will be RT if the task got
971     + * be boosted by RT tasks as it will be RT if the task got
972     * RT-boosted. If not then it returns p->normal_prio.
973     */
974     static int effective_prio(struct task_struct *p)
975     @@ -852,111 +980,41 @@ static int effective_prio(struct task_st
976     }
977    
978     /*
979     - * __activate_task - move a task to the runqueue.
980     + * All tasks have quotas based on rr_interval. RT tasks all get rr_interval.
981     + * From nice 1 to 19 they are smaller than it only if they are at least one
982     + * tick still. Below nice 0 they get progressively larger.
983     + * ie nice -6..0 = rr_interval. nice -10 = 2.5 * rr_interval
984     + * nice -20 = 10 * rr_interval. nice 1-19 = rr_interval / 2.
985     + * Value returned is in microseconds.
986     */
987     -static void __activate_task(struct task_struct *p, struct rq *rq)
988     +static inline unsigned int rr_quota(struct task_struct *p)
989     {
990     - struct prio_array *target = rq->active;
991     -
992     - if (batch_task(p))
993     - target = rq->expired;
994     - enqueue_task(p, target);
995     - inc_nr_running(p, rq);
996     -}
997     + int nice = TASK_NICE(p), rr = rr_interval;
998    
999     -/*
1000     - * __activate_idle_task - move idle task to the _front_ of runqueue.
1001     - */
1002     -static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
1003     -{
1004     - enqueue_task_head(p, rq->active);
1005     - inc_nr_running(p, rq);
1006     + if (!rt_task(p)) {
1007     + if (nice < -6) {
1008     + rr *= nice * nice;
1009     + rr /= 40;
1010     + } else if (nice > 0)
1011     + rr = rr / 2 ? : 1;
1012     + }
1013     + return MS_TO_US(rr);
1014     }
1015    
1016     -/*
1017     - * Recalculate p->normal_prio and p->prio after having slept,
1018     - * updating the sleep-average too:
1019     - */
1020     -static int recalc_task_prio(struct task_struct *p, unsigned long long now)
1021     +/* Every time we set the quota we need to set the load weight */
1022     +static void set_quota(struct task_struct *p)
1023     {
1024     - /* Caller must always ensure 'now >= p->timestamp' */
1025     - unsigned long sleep_time = now - p->timestamp;
1026     -
1027     - if (batch_task(p))
1028     - sleep_time = 0;
1029     -
1030     - if (likely(sleep_time > 0)) {
1031     - /*
1032     - * This ceiling is set to the lowest priority that would allow
1033     - * a task to be reinserted into the active array on timeslice
1034     - * completion.
1035     - */
1036     - unsigned long ceiling = INTERACTIVE_SLEEP(p);
1037     -
1038     - if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
1039     - /*
1040     - * Prevents user tasks from achieving best priority
1041     - * with one single large enough sleep.
1042     - */
1043     - p->sleep_avg = ceiling;
1044     - /*
1045     - * Using INTERACTIVE_SLEEP() as a ceiling places a
1046     - * nice(0) task 1ms sleep away from promotion, and
1047     - * gives it 700ms to round-robin with no chance of
1048     - * being demoted. This is more than generous, so
1049     - * mark this sleep as non-interactive to prevent the
1050     - * on-runqueue bonus logic from intervening should
1051     - * this task not receive cpu immediately.
1052     - */
1053     - p->sleep_type = SLEEP_NONINTERACTIVE;
1054     - } else {
1055     - /*
1056     - * Tasks waking from uninterruptible sleep are
1057     - * limited in their sleep_avg rise as they
1058     - * are likely to be waiting on I/O
1059     - */
1060     - if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
1061     - if (p->sleep_avg >= ceiling)
1062     - sleep_time = 0;
1063     - else if (p->sleep_avg + sleep_time >=
1064     - ceiling) {
1065     - p->sleep_avg = ceiling;
1066     - sleep_time = 0;
1067     - }
1068     - }
1069     -
1070     - /*
1071     - * This code gives a bonus to interactive tasks.
1072     - *
1073     - * The boost works by updating the 'average sleep time'
1074     - * value here, based on ->timestamp. The more time a
1075     - * task spends sleeping, the higher the average gets -
1076     - * and the higher the priority boost gets as well.
1077     - */
1078     - p->sleep_avg += sleep_time;
1079     -
1080     - }
1081     - if (p->sleep_avg > NS_MAX_SLEEP_AVG)
1082     - p->sleep_avg = NS_MAX_SLEEP_AVG;
1083     - }
1084     -
1085     - return effective_prio(p);
1086     + p->quota = rr_quota(p);
1087     + set_load_weight(p);
1088     }
1089    
1090     /*
1091     * activate_task - move a task to the runqueue and do priority recalculation
1092     - *
1093     - * Update all the scheduling statistics stuff. (sleep average
1094     - * calculation, priority modifiers, etc.)
1095     */
1096     static void activate_task(struct task_struct *p, struct rq *rq, int local)
1097     {
1098     - unsigned long long now;
1099     -
1100     - if (rt_task(p))
1101     - goto out;
1102     + unsigned long long now = sched_clock();
1103    
1104     - now = sched_clock();
1105     #ifdef CONFIG_SMP
1106     if (!local) {
1107     /* Compensate for drifting sched_clock */
1108     @@ -977,32 +1035,9 @@ static void activate_task(struct task_st
1109     (now - p->timestamp) >> 20);
1110     }
1111    
1112     - p->prio = recalc_task_prio(p, now);
1113     -
1114     - /*
1115     - * This checks to make sure it's not an uninterruptible task
1116     - * that is now waking up.
1117     - */
1118     - if (p->sleep_type == SLEEP_NORMAL) {
1119     - /*
1120     - * Tasks which were woken up by interrupts (ie. hw events)
1121     - * are most likely of interactive nature. So we give them
1122     - * the credit of extending their sleep time to the period
1123     - * of time they spend on the runqueue, waiting for execution
1124     - * on a CPU, first time around:
1125     - */
1126     - if (in_interrupt())
1127     - p->sleep_type = SLEEP_INTERRUPTED;
1128     - else {
1129     - /*
1130     - * Normal first-time wakeups get a credit too for
1131     - * on-runqueue time, but it will be weighted down:
1132     - */
1133     - p->sleep_type = SLEEP_INTERACTIVE;
1134     - }
1135     - }
1136     + set_quota(p);
1137     + p->prio = effective_prio(p);
1138     p->timestamp = now;
1139     -out:
1140     __activate_task(p, rq);
1141     }
1142    
1143     @@ -1012,8 +1047,7 @@ out:
1144     static void deactivate_task(struct task_struct *p, struct rq *rq)
1145     {
1146     dec_nr_running(p, rq);
1147     - dequeue_task(p, p->array);
1148     - p->array = NULL;
1149     + dequeue_task(p, rq);
1150     }
1151    
1152     /*
1153     @@ -1095,7 +1129,7 @@ migrate_task(struct task_struct *p, int
1154     * If the task is not on a runqueue (and not running), then
1155     * it is sufficient to simply update the task's cpu field.
1156     */
1157     - if (!p->array && !task_running(rq, p)) {
1158     + if (!task_queued(p) && !task_running(rq, p)) {
1159     set_task_cpu(p, dest_cpu);
1160     return 0;
1161     }
1162     @@ -1126,7 +1160,7 @@ void wait_task_inactive(struct task_stru
1163     repeat:
1164     rq = task_rq_lock(p, &flags);
1165     /* Must be off runqueue entirely, not preempted. */
1166     - if (unlikely(p->array || task_running(rq, p))) {
1167     + if (unlikely(task_queued(p) || task_running(rq, p))) {
1168     /* If it's preempted, we yield. It could be a while. */
1169     preempted = !task_running(rq, p);
1170     task_rq_unlock(rq, &flags);
1171     @@ -1391,6 +1425,31 @@ static inline int wake_idle(int cpu, str
1172     }
1173     #endif
1174    
1175     +/*
1176     + * We need to have a special definition for an idle runqueue when testing
1177     + * for preemption on CONFIG_HOTPLUG_CPU as the idle task may be scheduled as
1178     + * a realtime task in sched_idle_next.
1179     + */
1180     +#ifdef CONFIG_HOTPLUG_CPU
1181     +#define rq_idle(rq) ((rq)->curr == (rq)->idle && !rt_task((rq)->curr))
1182     +#else
1183     +#define rq_idle(rq) ((rq)->curr == (rq)->idle)
1184     +#endif
1185     +
1186     +static inline int task_preempts_curr(struct task_struct *p, struct rq *rq)
1187     +{
1188     + struct task_struct *curr = rq->curr;
1189     +
1190     + return ((p->array == task_rq(p)->active &&
1191     + TASK_PREEMPTS_CURR(p, curr)) || rq_idle(rq));
1192     +}
1193     +
1194     +static inline void try_preempt(struct task_struct *p, struct rq *rq)
1195     +{
1196     + if (task_preempts_curr(p, rq))
1197     + resched_task(rq->curr);
1198     +}
1199     +
1200     /***
1201     * try_to_wake_up - wake up a thread
1202     * @p: the to-be-woken-up thread
1203     @@ -1422,7 +1481,7 @@ static int try_to_wake_up(struct task_st
1204     if (!(old_state & state))
1205     goto out;
1206    
1207     - if (p->array)
1208     + if (task_queued(p))
1209     goto out_running;
1210    
1211     cpu = task_cpu(p);
1212     @@ -1515,7 +1574,7 @@ out_set_cpu:
1213     old_state = p->state;
1214     if (!(old_state & state))
1215     goto out;
1216     - if (p->array)
1217     + if (task_queued(p))
1218     goto out_running;
1219    
1220     this_cpu = smp_processor_id();
1221     @@ -1524,25 +1583,9 @@ out_set_cpu:
1222    
1223     out_activate:
1224     #endif /* CONFIG_SMP */
1225     - if (old_state == TASK_UNINTERRUPTIBLE) {
1226     + if (old_state == TASK_UNINTERRUPTIBLE)
1227     rq->nr_uninterruptible--;
1228     - /*
1229     - * Tasks on involuntary sleep don't earn
1230     - * sleep_avg beyond just interactive state.
1231     - */
1232     - p->sleep_type = SLEEP_NONINTERACTIVE;
1233     - } else
1234     -
1235     - /*
1236     - * Tasks that have marked their sleep as noninteractive get
1237     - * woken up with their sleep average not weighted in an
1238     - * interactive way.
1239     - */
1240     - if (old_state & TASK_NONINTERACTIVE)
1241     - p->sleep_type = SLEEP_NONINTERACTIVE;
1242     -
1243    
1244     - activate_task(p, rq, cpu == this_cpu);
1245     /*
1246     * Sync wakeups (i.e. those types of wakeups where the waker
1247     * has indicated that it will leave the CPU in short order)
1248     @@ -1551,10 +1594,9 @@ out_activate:
1249     * the waker guarantees that the freshly woken up task is going
1250     * to be considered on this CPU.)
1251     */
1252     - if (!sync || cpu != this_cpu) {
1253     - if (TASK_PREEMPTS_CURR(p, rq))
1254     - resched_task(rq->curr);
1255     - }
1256     + activate_task(p, rq, cpu == this_cpu);
1257     + if (!sync || cpu != this_cpu)
1258     + try_preempt(p, rq);
1259     success = 1;
1260    
1261     out_running:
1262     @@ -1577,7 +1619,6 @@ int fastcall wake_up_state(struct task_s
1263     return try_to_wake_up(p, state, 0);
1264     }
1265    
1266     -static void task_running_tick(struct rq *rq, struct task_struct *p);
1267     /*
1268     * Perform scheduler related setup for a newly forked process p.
1269     * p is forked by current.
1270     @@ -1605,7 +1646,6 @@ void fastcall sched_fork(struct task_str
1271     p->prio = current->normal_prio;
1272    
1273     INIT_LIST_HEAD(&p->run_list);
1274     - p->array = NULL;
1275     #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
1276     if (unlikely(sched_info_on()))
1277     memset(&p->sched_info, 0, sizeof(p->sched_info));
1278     @@ -1617,30 +1657,31 @@ void fastcall sched_fork(struct task_str
1279     /* Want to start with kernel preemption disabled. */
1280     task_thread_info(p)->preempt_count = 1;
1281     #endif
1282     + if (unlikely(p->policy == SCHED_FIFO))
1283     + goto out;
1284     /*
1285     * Share the timeslice between parent and child, thus the
1286     * total amount of pending timeslices in the system doesn't change,
1287     * resulting in more scheduling fairness.
1288     */
1289     local_irq_disable();
1290     - p->time_slice = (current->time_slice + 1) >> 1;
1291     - /*
1292     - * The remainder of the first timeslice might be recovered by
1293     - * the parent if the child exits early enough.
1294     - */
1295     - p->first_time_slice = 1;
1296     - current->time_slice >>= 1;
1297     - p->timestamp = sched_clock();
1298     - if (unlikely(!current->time_slice)) {
1299     + if (current->time_slice > 0) {
1300     + current->time_slice /= 2;
1301     + if (current->time_slice)
1302     + p->time_slice = current->time_slice;
1303     + else
1304     + p->time_slice = 1;
1305     /*
1306     - * This case is rare, it happens when the parent has only
1307     - * a single jiffy left from its timeslice. Taking the
1308     - * runqueue lock is not a problem.
1309     + * The remainder of the first timeslice might be recovered by
1310     + * the parent if the child exits early enough.
1311     */
1312     - current->time_slice = 1;
1313     - task_running_tick(cpu_rq(cpu), current);
1314     - }
1315     + p->first_time_slice = 1;
1316     + } else
1317     + p->time_slice = 0;
1318     +
1319     + p->timestamp = sched_clock();
1320     local_irq_enable();
1321     +out:
1322     put_cpu();
1323     }
1324    
1325     @@ -1662,38 +1703,16 @@ void fastcall wake_up_new_task(struct ta
1326     this_cpu = smp_processor_id();
1327     cpu = task_cpu(p);
1328    
1329     - /*
1330     - * We decrease the sleep average of forking parents
1331     - * and children as well, to keep max-interactive tasks
1332     - * from forking tasks that are max-interactive. The parent
1333     - * (current) is done further down, under its lock.
1334     - */
1335     - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
1336     - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
1337     -
1338     - p->prio = effective_prio(p);
1339     -
1340     if (likely(cpu == this_cpu)) {
1341     + activate_task(p, rq, 1);
1342     if (!(clone_flags & CLONE_VM)) {
1343     /*
1344     * The VM isn't cloned, so we're in a good position to
1345     * do child-runs-first in anticipation of an exec. This
1346     * usually avoids a lot of COW overhead.
1347     */
1348     - if (unlikely(!current->array))
1349     - __activate_task(p, rq);
1350     - else {
1351     - p->prio = current->prio;
1352     - p->normal_prio = current->normal_prio;
1353     - list_add_tail(&p->run_list, &current->run_list);
1354     - p->array = current->array;
1355     - p->array->nr_active++;
1356     - inc_nr_running(p, rq);
1357     - }
1358     set_need_resched();
1359     - } else
1360     - /* Run child last */
1361     - __activate_task(p, rq);
1362     + }
1363     /*
1364     * We skip the following code due to cpu == this_cpu
1365     *
1366     @@ -1710,19 +1729,16 @@ void fastcall wake_up_new_task(struct ta
1367     */
1368     p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
1369     + rq->most_recent_timestamp;
1370     - __activate_task(p, rq);
1371     - if (TASK_PREEMPTS_CURR(p, rq))
1372     - resched_task(rq->curr);
1373     + activate_task(p, rq, 0);
1374     + try_preempt(p, rq);
1375    
1376     /*
1377     * Parent and child are on different CPUs, now get the
1378     - * parent runqueue to update the parent's ->sleep_avg:
1379     + * parent runqueue to update the parent's ->flags:
1380     */
1381     task_rq_unlock(rq, &flags);
1382     this_rq = task_rq_lock(current, &flags);
1383     }
1384     - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
1385     - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
1386     task_rq_unlock(this_rq, &flags);
1387     }
1388    
1389     @@ -1737,23 +1753,17 @@ void fastcall wake_up_new_task(struct ta
1390     */
1391     void fastcall sched_exit(struct task_struct *p)
1392     {
1393     + struct task_struct *parent;
1394     unsigned long flags;
1395     struct rq *rq;
1396    
1397     - /*
1398     - * If the child was a (relative-) CPU hog then decrease
1399     - * the sleep_avg of the parent as well.
1400     - */
1401     - rq = task_rq_lock(p->parent, &flags);
1402     - if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
1403     - p->parent->time_slice += p->time_slice;
1404     - if (unlikely(p->parent->time_slice > task_timeslice(p)))
1405     - p->parent->time_slice = task_timeslice(p);
1406     - }
1407     - if (p->sleep_avg < p->parent->sleep_avg)
1408     - p->parent->sleep_avg = p->parent->sleep_avg /
1409     - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
1410     - (EXIT_WEIGHT + 1);
1411     + parent = p->parent;
1412     + rq = task_rq_lock(parent, &flags);
1413     + if (p->first_time_slice > 0 && task_cpu(p) == task_cpu(parent)) {
1414     + parent->time_slice += p->time_slice;
1415     + if (unlikely(parent->time_slice > parent->quota))
1416     + parent->time_slice = parent->quota;
1417     + }
1418     task_rq_unlock(rq, &flags);
1419     }
1420    
1421     @@ -2085,23 +2095,17 @@ void sched_exec(void)
1422     * pull_task - move a task from a remote runqueue to the local runqueue.
1423     * Both runqueues must be locked.
1424     */
1425     -static void pull_task(struct rq *src_rq, struct prio_array *src_array,
1426     - struct task_struct *p, struct rq *this_rq,
1427     - struct prio_array *this_array, int this_cpu)
1428     +static void pull_task(struct rq *src_rq, struct task_struct *p,
1429     + struct rq *this_rq, int this_cpu)
1430     {
1431     - dequeue_task(p, src_array);
1432     + dequeue_task(p, src_rq);
1433     dec_nr_running(p, src_rq);
1434     set_task_cpu(p, this_cpu);
1435     inc_nr_running(p, this_rq);
1436     - enqueue_task(p, this_array);
1437     + enqueue_task(p, this_rq);
1438     p->timestamp = (p->timestamp - src_rq->most_recent_timestamp)
1439     + this_rq->most_recent_timestamp;
1440     - /*
1441     - * Note that idle threads have a prio of MAX_PRIO, for this test
1442     - * to be always true for them.
1443     - */
1444     - if (TASK_PREEMPTS_CURR(p, this_rq))
1445     - resched_task(this_rq->curr);
1446     + try_preempt(p, this_rq);
1447     }
1448    
1449     /*
1450     @@ -2144,7 +2148,16 @@ int can_migrate_task(struct task_struct
1451     return 1;
1452     }
1453    
1454     -#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
1455     +static inline int rq_best_prio(struct rq *rq)
1456     +{
1457     + int best_prio, exp_prio;
1458     +
1459     + best_prio = sched_find_first_bit(rq->dyn_bitmap);
1460     + exp_prio = find_next_bit(rq->exp_bitmap, MAX_PRIO, MAX_RT_PRIO);
1461     + if (unlikely(best_prio > exp_prio))
1462     + best_prio = exp_prio;
1463     + return best_prio;
1464     +}
1465    
1466     /*
1467     * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
1468     @@ -2160,7 +2173,7 @@ static int move_tasks(struct rq *this_rq
1469     {
1470     int idx, pulled = 0, pinned = 0, this_best_prio, best_prio,
1471     best_prio_seen, skip_for_load;
1472     - struct prio_array *array, *dst_array;
1473     + struct prio_array *array;
1474     struct list_head *head, *curr;
1475     struct task_struct *tmp;
1476     long rem_load_move;
1477     @@ -2187,26 +2200,21 @@ static int move_tasks(struct rq *this_rq
1478     * be cache-cold, thus switching CPUs has the least effect
1479     * on them.
1480     */
1481     - if (busiest->expired->nr_active) {
1482     - array = busiest->expired;
1483     - dst_array = this_rq->expired;
1484     - } else {
1485     - array = busiest->active;
1486     - dst_array = this_rq->active;
1487     - }
1488     -
1489     + array = busiest->expired;
1490     new_array:
1491     - /* Start searching at priority 0: */
1492     - idx = 0;
1493     + /* Expired arrays don't have RT tasks so they're always MAX_RT_PRIO+ */
1494     + if (array == busiest->expired)
1495     + idx = MAX_RT_PRIO;
1496     + else
1497     + idx = 0;
1498     skip_bitmap:
1499     if (!idx)
1500     - idx = sched_find_first_bit(array->bitmap);
1501     + idx = sched_find_first_bit(array->prio_bitmap);
1502     else
1503     - idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
1504     + idx = find_next_bit(array->prio_bitmap, MAX_PRIO, idx);
1505     if (idx >= MAX_PRIO) {
1506     - if (array == busiest->expired && busiest->active->nr_active) {
1507     + if (array == busiest->expired) {
1508     array = busiest->active;
1509     - dst_array = this_rq->active;
1510     goto new_array;
1511     }
1512     goto out;
1513     @@ -2237,7 +2245,7 @@ skip_queue:
1514     goto skip_bitmap;
1515     }
1516    
1517     - pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
1518     + pull_task(busiest, tmp, this_rq, this_cpu);
1519     pulled++;
1520     rem_load_move -= tmp->load_weight;
1521    
1522     @@ -3013,11 +3021,36 @@ EXPORT_PER_CPU_SYMBOL(kstat);
1523     /*
1524     * This is called on clock ticks and on context switches.
1525     * Bank in p->sched_time the ns elapsed since the last tick or switch.
1526     + * CPU scheduler quota accounting is also performed here in microseconds.
1527     + * The value returned from sched_clock() occasionally gives bogus values so
1528     + * some sanity checking is required.
1529     */
1530     -static inline void
1531     -update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
1532     +static void
1533     +update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now,
1534     + int tick)
1535     {
1536     - p->sched_time += now - p->last_ran;
1537     + long time_diff = now - p->last_ran;
1538     +
1539     + if (tick) {
1540     + /*
1541     + * Called from scheduler_tick() there should be less than two
1542     + * jiffies worth, and not negative/overflow.
1543     + */
1544     + if (time_diff > JIFFIES_TO_NS(2) || time_diff < 0)
1545     + time_diff = JIFFIES_TO_NS(1);
1546     + } else {
1547     + /*
1548     + * Called from context_switch there should be less than one
1549     + * jiffy worth, and not negative/overflow. There should be
1550     + * some time banked here so use a nominal 1us.
1551     + */
1552     + if (time_diff > JIFFIES_TO_NS(1) || time_diff < 1)
1553     + time_diff = 1000;
1554     + }
1555     + /* time_slice accounting is done in usecs to avoid overflow on 32bit */
1556     + if (p != rq->idle && p->policy != SCHED_FIFO)
1557     + p->time_slice -= time_diff / 1000;
1558     + p->sched_time += time_diff;
1559     p->last_ran = rq->most_recent_timestamp = now;
1560     }
1561    
1562     @@ -3038,27 +3071,6 @@ unsigned long long current_sched_time(co
1563     }
1564    
1565     /*
1566     - * We place interactive tasks back into the active array, if possible.
1567     - *
1568     - * To guarantee that this does not starve expired tasks we ignore the
1569     - * interactivity of a task if the first expired task had to wait more
1570     - * than a 'reasonable' amount of time. This deadline timeout is
1571     - * load-dependent, as the frequency of array switched decreases with
1572     - * increasing number of running tasks. We also ignore the interactivity
1573     - * if a better static_prio task has expired:
1574     - */
1575     -static inline int expired_starving(struct rq *rq)
1576     -{
1577     - if (rq->curr->static_prio > rq->best_expired_prio)
1578     - return 1;
1579     - if (!STARVATION_LIMIT || !rq->expired_timestamp)
1580     - return 0;
1581     - if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
1582     - return 1;
1583     - return 0;
1584     -}
1585     -
1586     -/*
1587     * Account user cpu time to a process.
1588     * @p: the process that the cpu time gets accounted to
1589     * @hardirq_offset: the offset to subtract from hardirq_count()
1590     @@ -3131,87 +3143,47 @@ void account_steal_time(struct task_stru
1591     cpustat->steal = cputime64_add(cpustat->steal, tmp);
1592     }
1593    
1594     -static void task_running_tick(struct rq *rq, struct task_struct *p)
1595     +/*
1596     + * The task has used up its quota of running in this prio_level so it must be
1597     + * dropped a priority level, all managed by recalc_task_prio().
1598     + */
1599     +static void task_expired_entitlement(struct rq *rq, struct task_struct *p)
1600     {
1601     - if (p->array != rq->active) {
1602     - /* Task has expired but was not scheduled yet */
1603     - set_tsk_need_resched(p);
1604     + int overrun;
1605     +
1606     + reset_first_time_slice(p);
1607     + if (rt_task(p)) {
1608     + p->time_slice += p->quota;
1609     + list_move_tail(&p->run_list, p->array->queue + p->prio);
1610     return;
1611     }
1612     - spin_lock(&rq->lock);
1613     + overrun = p->time_slice;
1614     + dequeue_task(p, rq);
1615     + enqueue_task(p, rq);
1616     /*
1617     - * The task was running during this tick - update the
1618     - * time slice counter. Note: we do not update a thread's
1619     - * priority until it either goes to sleep or uses up its
1620     - * timeslice. This makes it possible for interactive tasks
1621     - * to use up their timeslices at their highest priority levels.
1622     + * Subtract any extra time this task ran over its time_slice; ie
1623     + * overrun will either be 0 or negative.
1624     */
1625     - if (rt_task(p)) {
1626     - /*
1627     - * RR tasks need a special form of timeslice management.
1628     - * FIFO tasks have no timeslices.
1629     - */
1630     - if ((p->policy == SCHED_RR) && !--p->time_slice) {
1631     - p->time_slice = task_timeslice(p);
1632     - p->first_time_slice = 0;
1633     - set_tsk_need_resched(p);
1634     -
1635     - /* put it at the end of the queue: */
1636     - requeue_task(p, rq->active);
1637     - }
1638     - goto out_unlock;
1639     - }
1640     - if (!--p->time_slice) {
1641     - dequeue_task(p, rq->active);
1642     - set_tsk_need_resched(p);
1643     - p->prio = effective_prio(p);
1644     - p->time_slice = task_timeslice(p);
1645     - p->first_time_slice = 0;
1646     -
1647     - if (!rq->expired_timestamp)
1648     - rq->expired_timestamp = jiffies;
1649     - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
1650     - enqueue_task(p, rq->expired);
1651     - if (p->static_prio < rq->best_expired_prio)
1652     - rq->best_expired_prio = p->static_prio;
1653     - } else
1654     - enqueue_task(p, rq->active);
1655     - } else {
1656     - /*
1657     - * Prevent a too long timeslice allowing a task to monopolize
1658     - * the CPU. We do this by splitting up the timeslice into
1659     - * smaller pieces.
1660     - *
1661     - * Note: this does not mean the task's timeslices expire or
1662     - * get lost in any way, they just might be preempted by
1663     - * another task of equal priority. (one with higher
1664     - * priority would have preempted this task already.) We
1665     - * requeue this task to the end of the list on this priority
1666     - * level, which is in essence a round-robin of tasks with
1667     - * equal priority.
1668     - *
1669     - * This only applies to tasks in the interactive
1670     - * delta range with at least TIMESLICE_GRANULARITY to requeue.
1671     - */
1672     - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
1673     - p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
1674     - (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
1675     - (p->array == rq->active)) {
1676     + p->time_slice += overrun;
1677     +}
1678    
1679     - requeue_task(p, rq->active);
1680     - set_tsk_need_resched(p);
1681     - }
1682     - }
1683     -out_unlock:
1684     +/* This manages tasks that have run out of timeslice during a scheduler_tick */
1685     +static void task_running_tick(struct rq *rq, struct task_struct *p)
1686     +{
1687     + /* SCHED_FIFO tasks never run out of timeslice. */
1688     + if (p->time_slice > 0 || p->policy == SCHED_FIFO)
1689     + return;
1690     + /* p->time_slice <= 0 */
1691     + spin_lock(&rq->lock);
1692     + if (likely(task_queued(p)))
1693     + task_expired_entitlement(rq, p);
1694     + set_tsk_need_resched(p);
1695     spin_unlock(&rq->lock);
1696     }
1697    
1698     /*
1699     * This function gets called by the timer code, with HZ frequency.
1700     * We call it with interrupts disabled.
1701     - *
1702     - * It also gets called by the fork code, when changing the parent's
1703     - * timeslices.
1704     */
1705     void scheduler_tick(void)
1706     {
1707     @@ -3220,7 +3192,7 @@ void scheduler_tick(void)
1708     int cpu = smp_processor_id();
1709     struct rq *rq = cpu_rq(cpu);
1710    
1711     - update_cpu_clock(p, rq, now);
1712     + update_cpu_clock(p, rq, now, 1);
1713    
1714     if (p != rq->idle)
1715     task_running_tick(rq, p);
1716     @@ -3269,10 +3241,55 @@ EXPORT_SYMBOL(sub_preempt_count);
1717    
1718     #endif
1719    
1720     -static inline int interactive_sleep(enum sleep_type sleep_type)
1721     +static void reset_prio_levels(struct rq *rq)
1722     {
1723     - return (sleep_type == SLEEP_INTERACTIVE ||
1724     - sleep_type == SLEEP_INTERRUPTED);
1725     + rq->active->best_static_prio = MAX_PRIO - 1;
1726     + rq->expired->best_static_prio = MAX_PRIO - 1;
1727     + memset(rq->prio_level, 0, sizeof(int) * PRIO_RANGE);
1728     +}
1729     +
1730     +/*
1731     + * next_dynamic_task finds the next suitable dynamic task.
1732     + */
1733     +static inline struct task_struct *next_dynamic_task(struct rq *rq, int idx)
1734     +{
1735     + struct prio_array *array = rq->active;
1736     + struct task_struct *next;
1737     + struct list_head *queue;
1738     + int nstatic;
1739     +
1740     +retry:
1741     + if (idx >= MAX_PRIO) {
1742     + /* There are no more tasks in the active array. Swap arrays */
1743     + array = rq->expired;
1744     + rq->expired = rq->active;
1745     + rq->active = array;
1746     + rq->exp_bitmap = rq->expired->prio_bitmap;
1747     + rq->dyn_bitmap = rq->active->prio_bitmap;
1748     + rq->prio_rotation++;
1749     + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO);
1750     + reset_prio_levels(rq);
1751     + }
1752     + queue = array->queue + idx;
1753     + next = list_entry(queue->next, struct task_struct, run_list);
1754     + if (unlikely(next->time_slice <= 0)) {
1755     + /*
1756     + * Unlucky enough that this task ran out of time_slice
1757     + * before it hit a scheduler_tick so it should have its
1758     + * priority reassessed and choose another task (possibly
1759     + * the same one)
1760     + */
1761     + task_expired_entitlement(rq, next);
1762     + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO);
1763     + goto retry;
1764     + }
1765     + next->rotation = rq->prio_rotation;
1766     + nstatic = next->static_prio;
1767     + if (nstatic < array->best_static_prio)
1768     + array->best_static_prio = nstatic;
1769     + if (idx > rq->prio_level[USER_PRIO(nstatic)])
1770     + rq->prio_level[USER_PRIO(nstatic)] = idx;
1771     + return next;
1772     }
1773    
1774     /*
1775     @@ -3281,13 +3298,11 @@ static inline int interactive_sleep(enum
1776     asmlinkage void __sched schedule(void)
1777     {
1778     struct task_struct *prev, *next;
1779     - struct prio_array *array;
1780     struct list_head *queue;
1781     unsigned long long now;
1782     - unsigned long run_time;
1783     - int cpu, idx, new_prio;
1784     long *switch_count;
1785     struct rq *rq;
1786     + int cpu, idx;
1787    
1788     /*
1789     * Test if we are atomic. Since do_exit() needs to call into
1790     @@ -3323,18 +3338,6 @@ need_resched_nonpreemptible:
1791    
1792     schedstat_inc(rq, sched_cnt);
1793     now = sched_clock();
1794     - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
1795     - run_time = now - prev->timestamp;
1796     - if (unlikely((long long)(now - prev->timestamp) < 0))
1797     - run_time = 0;
1798     - } else
1799     - run_time = NS_MAX_SLEEP_AVG;
1800     -
1801     - /*
1802     - * Tasks charged proportionately less run_time at high sleep_avg to
1803     - * delay them losing their interactive status
1804     - */
1805     - run_time /= (CURRENT_BONUS(prev) ? : 1);
1806    
1807     spin_lock_irq(&rq->lock);
1808    
1809     @@ -3356,59 +3359,29 @@ need_resched_nonpreemptible:
1810     idle_balance(cpu, rq);
1811     if (!rq->nr_running) {
1812     next = rq->idle;
1813     - rq->expired_timestamp = 0;
1814     goto switch_tasks;
1815     }
1816     }
1817    
1818     - array = rq->active;
1819     - if (unlikely(!array->nr_active)) {
1820     - /*
1821     - * Switch the active and expired arrays.
1822     - */
1823     - schedstat_inc(rq, sched_switch);
1824     - rq->active = rq->expired;
1825     - rq->expired = array;
1826     - array = rq->active;
1827     - rq->expired_timestamp = 0;
1828     - rq->best_expired_prio = MAX_PRIO;
1829     - }
1830     -
1831     - idx = sched_find_first_bit(array->bitmap);
1832     - queue = array->queue + idx;
1833     - next = list_entry(queue->next, struct task_struct, run_list);
1834     -
1835     - if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
1836     - unsigned long long delta = now - next->timestamp;
1837     - if (unlikely((long long)(now - next->timestamp) < 0))
1838     - delta = 0;
1839     -
1840     - if (next->sleep_type == SLEEP_INTERACTIVE)
1841     - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
1842     -
1843     - array = next->array;
1844     - new_prio = recalc_task_prio(next, next->timestamp + delta);
1845     -
1846     - if (unlikely(next->prio != new_prio)) {
1847     - dequeue_task(next, array);
1848     - next->prio = new_prio;
1849     - enqueue_task(next, array);
1850     - }
1851     + idx = sched_find_first_bit(rq->dyn_bitmap);
1852     + if (!rt_prio(idx))
1853     + next = next_dynamic_task(rq, idx);
1854     + else {
1855     + queue = rq->active->queue + idx;
1856     + next = list_entry(queue->next, struct task_struct, run_list);
1857     }
1858     - next->sleep_type = SLEEP_NORMAL;
1859     switch_tasks:
1860     - if (next == rq->idle)
1861     + if (next == rq->idle) {
1862     + reset_prio_levels(rq);
1863     + rq->prio_rotation++;
1864     schedstat_inc(rq, sched_goidle);
1865     + }
1866     prefetch(next);
1867     prefetch_stack(next);
1868     clear_tsk_need_resched(prev);
1869     rcu_qsctr_inc(task_cpu(prev));
1870    
1871     - update_cpu_clock(prev, rq, now);
1872     -
1873     - prev->sleep_avg -= run_time;
1874     - if ((long)prev->sleep_avg <= 0)
1875     - prev->sleep_avg = 0;
1876     + update_cpu_clock(prev, rq, now, 0);
1877     prev->timestamp = prev->last_ran = now;
1878    
1879     sched_info_switch(prev, next);
1880     @@ -3844,29 +3817,22 @@ EXPORT_SYMBOL(sleep_on_timeout);
1881     */
1882     void rt_mutex_setprio(struct task_struct *p, int prio)
1883     {
1884     - struct prio_array *array;
1885     unsigned long flags;
1886     + int queued, oldprio;
1887     struct rq *rq;
1888     - int oldprio;
1889    
1890     BUG_ON(prio < 0 || prio > MAX_PRIO);
1891    
1892     rq = task_rq_lock(p, &flags);
1893    
1894     oldprio = p->prio;
1895     - array = p->array;
1896     - if (array)
1897     - dequeue_task(p, array);
1898     + queued = task_queued(p);
1899     + if (queued)
1900     + dequeue_task(p, rq);
1901     p->prio = prio;
1902    
1903     - if (array) {
1904     - /*
1905     - * If changing to an RT priority then queue it
1906     - * in the active array!
1907     - */
1908     - if (rt_task(p))
1909     - array = rq->active;
1910     - enqueue_task(p, array);
1911     + if (queued) {
1912     + enqueue_task(p, rq);
1913     /*
1914     * Reschedule if we are currently running on this runqueue and
1915     * our priority decreased, or if we are not currently running on
1916     @@ -3875,8 +3841,8 @@ void rt_mutex_setprio(struct task_struct
1917     if (task_running(rq, p)) {
1918     if (p->prio > oldprio)
1919     resched_task(rq->curr);
1920     - } else if (TASK_PREEMPTS_CURR(p, rq))
1921     - resched_task(rq->curr);
1922     + } else
1923     + try_preempt(p, rq);
1924     }
1925     task_rq_unlock(rq, &flags);
1926     }
1927     @@ -3885,8 +3851,7 @@ void rt_mutex_setprio(struct task_struct
1928    
1929     void set_user_nice(struct task_struct *p, long nice)
1930     {
1931     - struct prio_array *array;
1932     - int old_prio, delta;
1933     + int queued, old_prio,delta;
1934     unsigned long flags;
1935     struct rq *rq;
1936    
1937     @@ -3907,20 +3872,20 @@ void set_user_nice(struct task_struct *p
1938     p->static_prio = NICE_TO_PRIO(nice);
1939     goto out_unlock;
1940     }
1941     - array = p->array;
1942     - if (array) {
1943     - dequeue_task(p, array);
1944     + queued = task_queued(p);
1945     + if (queued) {
1946     + dequeue_task(p, rq);
1947     dec_raw_weighted_load(rq, p);
1948     }
1949    
1950     p->static_prio = NICE_TO_PRIO(nice);
1951     - set_load_weight(p);
1952     old_prio = p->prio;
1953     p->prio = effective_prio(p);
1954     + set_quota(p);
1955     delta = p->prio - old_prio;
1956    
1957     - if (array) {
1958     - enqueue_task(p, array);
1959     + if (queued) {
1960     + enqueue_task(p, rq);
1961     inc_raw_weighted_load(rq, p);
1962     /*
1963     * If the task increased its priority or is running and
1964     @@ -3996,7 +3961,7 @@ asmlinkage long sys_nice(int increment)
1965     *
1966     * This is the priority value as seen by users in /proc.
1967     * RT tasks are offset by -200. Normal tasks are centered
1968     - * around 0, value goes from -16 to +15.
1969     + * around 0, value goes from 0 to +39.
1970     */
1971     int task_prio(const struct task_struct *p)
1972     {
1973     @@ -4043,19 +4008,14 @@ static inline struct task_struct *find_p
1974     /* Actually do priority change: must hold rq lock. */
1975     static void __setscheduler(struct task_struct *p, int policy, int prio)
1976     {
1977     - BUG_ON(p->array);
1978     + BUG_ON(task_queued(p));
1979    
1980     p->policy = policy;
1981     p->rt_priority = prio;
1982     p->normal_prio = normal_prio(p);
1983     /* we are holding p->pi_lock already */
1984     p->prio = rt_mutex_getprio(p);
1985     - /*
1986     - * SCHED_BATCH tasks are treated as perpetual CPU hogs:
1987     - */
1988     - if (policy == SCHED_BATCH)
1989     - p->sleep_avg = 0;
1990     - set_load_weight(p);
1991     + set_quota(p);
1992     }
1993    
1994     /**
1995     @@ -4069,8 +4029,7 @@ static void __setscheduler(struct task_s
1996     int sched_setscheduler(struct task_struct *p, int policy,
1997     struct sched_param *param)
1998     {
1999     - int retval, oldprio, oldpolicy = -1;
2000     - struct prio_array *array;
2001     + int queued, retval, oldprio, oldpolicy = -1;
2002     unsigned long flags;
2003     struct rq *rq;
2004    
2005     @@ -4144,12 +4103,12 @@ recheck:
2006     spin_unlock_irqrestore(&p->pi_lock, flags);
2007     goto recheck;
2008     }
2009     - array = p->array;
2010     - if (array)
2011     + queued = task_queued(p);
2012     + if (queued)
2013     deactivate_task(p, rq);
2014     oldprio = p->prio;
2015     __setscheduler(p, policy, param->sched_priority);
2016     - if (array) {
2017     + if (queued) {
2018     __activate_task(p, rq);
2019     /*
2020     * Reschedule if we are currently running on this runqueue and
2021     @@ -4159,8 +4118,8 @@ recheck:
2022     if (task_running(rq, p)) {
2023     if (p->prio > oldprio)
2024     resched_task(rq->curr);
2025     - } else if (TASK_PREEMPTS_CURR(p, rq))
2026     - resched_task(rq->curr);
2027     + } else
2028     + try_preempt(p, rq);
2029     }
2030     __task_rq_unlock(rq);
2031     spin_unlock_irqrestore(&p->pi_lock, flags);
2032     @@ -4433,40 +4392,27 @@ asmlinkage long sys_sched_getaffinity(pi
2033     * sys_sched_yield - yield the current processor to other threads.
2034     *
2035     * This function yields the current CPU by moving the calling thread
2036     - * to the expired array. If there are no other threads running on this
2037     - * CPU then this function will return.
2038     + * to the expired array if SCHED_NORMAL or the end of its current priority
2039     + * queue if a realtime task. If there are no other threads running on this
2040     + * cpu this function will return.
2041     */
2042     asmlinkage long sys_sched_yield(void)
2043     {
2044     struct rq *rq = this_rq_lock();
2045     - struct prio_array *array = current->array, *target = rq->expired;
2046     + struct task_struct *p = current;
2047    
2048     schedstat_inc(rq, yld_cnt);
2049     - /*
2050     - * We implement yielding by moving the task into the expired
2051     - * queue.
2052     - *
2053     - * (special rule: RT tasks will just roundrobin in the active
2054     - * array.)
2055     - */
2056     - if (rt_task(current))
2057     - target = rq->active;
2058     + if (rq->nr_running == 1)
2059     + schedstat_inc(rq, yld_both_empty);
2060     + else {
2061     + struct prio_array *old_array = p->array;
2062     + int old_prio = p->prio;
2063    
2064     - if (array->nr_active == 1) {
2065     - schedstat_inc(rq, yld_act_empty);
2066     - if (!rq->expired->nr_active)
2067     - schedstat_inc(rq, yld_both_empty);
2068     - } else if (!rq->expired->nr_active)
2069     - schedstat_inc(rq, yld_exp_empty);
2070     -
2071     - if (array != target) {
2072     - dequeue_task(current, array);
2073     - enqueue_task(current, target);
2074     - } else
2075     - /*
2076     - * requeue_task is cheaper so perform that if possible.
2077     - */
2078     - requeue_task(current, array);
2079     + /* p->prio will be updated in requeue_task via queue_expired */
2080     + if (!rt_task(p))
2081     + p->array = rq->expired;
2082     + requeue_task(p, rq, old_array, old_prio);
2083     + }
2084    
2085     /*
2086     * Since we are going to call schedule() anyway, there's
2087     @@ -4676,8 +4622,8 @@ long sys_sched_rr_get_interval(pid_t pid
2088     if (retval)
2089     goto out_unlock;
2090    
2091     - jiffies_to_timespec(p->policy == SCHED_FIFO ?
2092     - 0 : task_timeslice(p), &t);
2093     + t = ns_to_timespec(p->policy == SCHED_FIFO ? 0 :
2094     + MS_TO_NS(task_timeslice(p)));
2095     read_unlock(&tasklist_lock);
2096     retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
2097     out_nounlock:
2098     @@ -4771,10 +4717,10 @@ void __cpuinit init_idle(struct task_str
2099     struct rq *rq = cpu_rq(cpu);
2100     unsigned long flags;
2101    
2102     - idle->timestamp = sched_clock();
2103     - idle->sleep_avg = 0;
2104     - idle->array = NULL;
2105     - idle->prio = idle->normal_prio = MAX_PRIO;
2106     + bitmap_zero(idle->bitmap, PRIO_RANGE);
2107     + idle->timestamp = idle->last_ran = sched_clock();
2108     + idle->array = rq->active;
2109     + idle->prio = idle->normal_prio = NICE_TO_PRIO(0);
2110     idle->state = TASK_RUNNING;
2111     idle->cpus_allowed = cpumask_of_cpu(cpu);
2112     set_task_cpu(idle, cpu);
2113     @@ -4893,7 +4839,7 @@ static int __migrate_task(struct task_st
2114     goto out;
2115    
2116     set_task_cpu(p, dest_cpu);
2117     - if (p->array) {
2118     + if (task_queued(p)) {
2119     /*
2120     * Sync timestamp with rq_dest's before activating.
2121     * The same thing could be achieved by doing this step
2122     @@ -4904,8 +4850,7 @@ static int __migrate_task(struct task_st
2123     + rq_dest->most_recent_timestamp;
2124     deactivate_task(p, rq_src);
2125     __activate_task(p, rq_dest);
2126     - if (TASK_PREEMPTS_CURR(p, rq_dest))
2127     - resched_task(rq_dest->curr);
2128     + try_preempt(p, rq_dest);
2129     }
2130     ret = 1;
2131     out:
2132     @@ -5194,7 +5139,7 @@ migration_call(struct notifier_block *nf
2133     /* Idle task back to normal (off runqueue, low prio) */
2134     rq = task_rq_lock(rq->idle, &flags);
2135     deactivate_task(rq->idle, rq);
2136     - rq->idle->static_prio = MAX_PRIO;
2137     + rq->idle->static_prio = NICE_TO_PRIO(0);
2138     __setscheduler(rq->idle, SCHED_NORMAL, 0);
2139     migrate_dead_tasks(cpu);
2140     task_rq_unlock(rq, &flags);
2141     @@ -6706,6 +6651,13 @@ void __init sched_init_smp(void)
2142     /* Move init over to a non-isolated CPU */
2143     if (set_cpus_allowed(current, non_isolated_cpus) < 0)
2144     BUG();
2145     +
2146     + /*
2147     + * Assume that every added cpu gives us slightly less overall latency
2148     + * allowing us to increase the base rr_interval, but in a non linear
2149     + * fashion.
2150     + */
2151     + rr_interval *= 1 + ilog2(num_online_cpus());
2152     }
2153     #else
2154     void __init sched_init_smp(void)
2155     @@ -6727,6 +6679,16 @@ void __init sched_init(void)
2156     {
2157     int i, j, k;
2158    
2159     + /* Generate the priority matrix */
2160     + for (i = 0; i < PRIO_RANGE; i++) {
2161     + bitmap_fill(prio_matrix[i], PRIO_RANGE);
2162     + j = PRIO_RANGE * PRIO_RANGE / (PRIO_RANGE - i);
2163     + for (k = 0; k <= PRIO_RANGE * (PRIO_RANGE - 1); k += j) {
2164     + __clear_bit(PRIO_RANGE - 1 - (k / PRIO_RANGE),
2165     + prio_matrix[i]);
2166     + }
2167     + }
2168     +
2169     for_each_possible_cpu(i) {
2170     struct prio_array *array;
2171     struct rq *rq;
2172     @@ -6735,11 +6697,16 @@ void __init sched_init(void)
2173     spin_lock_init(&rq->lock);
2174     lockdep_set_class(&rq->lock, &rq->rq_lock_key);
2175     rq->nr_running = 0;
2176     + rq->prio_rotation = 0;
2177     rq->active = rq->arrays;
2178     rq->expired = rq->arrays + 1;
2179     - rq->best_expired_prio = MAX_PRIO;
2180     + reset_prio_levels(rq);
2181     + rq->dyn_bitmap = rq->active->prio_bitmap;
2182     + rq->exp_bitmap = rq->expired->prio_bitmap;
2183    
2184     #ifdef CONFIG_SMP
2185     + rq->active->rq = rq;
2186     + rq->expired->rq = rq;
2187     rq->sd = NULL;
2188     for (j = 1; j < 3; j++)
2189     rq->cpu_load[j] = 0;
2190     @@ -6752,16 +6719,16 @@ void __init sched_init(void)
2191     atomic_set(&rq->nr_iowait, 0);
2192    
2193     for (j = 0; j < 2; j++) {
2194     +
2195     array = rq->arrays + j;
2196     - for (k = 0; k < MAX_PRIO; k++) {
2197     + for (k = 0; k < MAX_PRIO; k++)
2198     INIT_LIST_HEAD(array->queue + k);
2199     - __clear_bit(k, array->bitmap);
2200     - }
2201     - // delimiter for bitsearch
2202     - __set_bit(MAX_PRIO, array->bitmap);
2203     + bitmap_zero(array->prio_bitmap, MAX_PRIO);
2204     + /* delimiter for bitsearch */
2205     + __set_bit(MAX_PRIO, array->prio_bitmap);
2206     }
2207     - }
2208    
2209     + }
2210     set_load_weight(&init_task);
2211    
2212     #ifdef CONFIG_SMP
2213     @@ -6815,10 +6782,10 @@ EXPORT_SYMBOL(__might_sleep);
2214     #ifdef CONFIG_MAGIC_SYSRQ
2215     void normalize_rt_tasks(void)
2216     {
2217     - struct prio_array *array;
2218     struct task_struct *p;
2219     unsigned long flags;
2220     struct rq *rq;
2221     + int queued;
2222    
2223     read_lock_irq(&tasklist_lock);
2224     for_each_process(p) {
2225     @@ -6828,11 +6795,11 @@ void normalize_rt_tasks(void)
2226     spin_lock_irqsave(&p->pi_lock, flags);
2227     rq = __task_rq_lock(p);
2228    
2229     - array = p->array;
2230     - if (array)
2231     + queued = task_queued(p);
2232     + if (queued)
2233     deactivate_task(p, task_rq(p));
2234     __setscheduler(p, SCHED_NORMAL, 0);
2235     - if (array) {
2236     + if (queued) {
2237     __activate_task(p, task_rq(p));
2238     resched_task(rq->curr);
2239     }
2240     Index: linux-2.6.21-ck2/Documentation/sysctl/kernel.txt
2241     ===================================================================
2242     --- linux-2.6.21-ck2.orig/Documentation/sysctl/kernel.txt 2007-02-05 22:51:59.000000000 +1100
2243     +++ linux-2.6.21-ck2/Documentation/sysctl/kernel.txt 2007-05-14 19:30:30.000000000 +1000
2244     @@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
2245     - printk
2246     - real-root-dev ==> Documentation/initrd.txt
2247     - reboot-cmd [ SPARC only ]
2248     +- rr_interval
2249     - rtsig-max
2250     - rtsig-nr
2251     - sem
2252     @@ -288,6 +289,19 @@ rebooting. ???
2253    
2254     ==============================================================
2255    
2256     +rr_interval:
2257     +
2258     +This is the smallest duration that any cpu process scheduling unit
2259     +will run for. Increasing this value can increase throughput of cpu
2260     +bound tasks substantially but at the expense of increased latencies
2261     +overall. This value is in milliseconds and the default value chosen
2262     +depends on the number of cpus available at scheduler initialisation
2263     +with a minimum of 8.
2264     +
2265     +Valid values are from 1-5000.
2266     +
2267     +==============================================================
2268     +
2269     rtsig-max & rtsig-nr:
2270    
2271     The file rtsig-max can be used to tune the maximum number
2272     Index: linux-2.6.21-ck2/kernel/sysctl.c
2273     ===================================================================
2274     --- linux-2.6.21-ck2.orig/kernel/sysctl.c 2007-05-03 22:20:57.000000000 +1000
2275     +++ linux-2.6.21-ck2/kernel/sysctl.c 2007-05-14 19:30:30.000000000 +1000
2276     @@ -76,6 +76,7 @@ extern int pid_max_min, pid_max_max;
2277     extern int sysctl_drop_caches;
2278     extern int percpu_pagelist_fraction;
2279     extern int compat_log;
2280     +extern int rr_interval;
2281    
2282     /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
2283     static int maxolduid = 65535;
2284     @@ -159,6 +160,14 @@ int sysctl_legacy_va_layout;
2285     #endif
2286    
2287    
2288     +/* Constants for minimum and maximum testing.
2289     + We use these as one-element integer vectors. */
2290     +static int __read_mostly zero;
2291     +static int __read_mostly one = 1;
2292     +static int __read_mostly one_hundred = 100;
2293     +static int __read_mostly five_thousand = 5000;
2294     +
2295     +
2296     /* The default sysctl tables: */
2297    
2298     static ctl_table root_table[] = {
2299     @@ -499,6 +508,17 @@ static ctl_table kern_table[] = {
2300     .mode = 0444,
2301     .proc_handler = &proc_dointvec,
2302     },
2303     + {
2304     + .ctl_name = CTL_UNNUMBERED,
2305     + .procname = "rr_interval",
2306     + .data = &rr_interval,
2307     + .maxlen = sizeof (int),
2308     + .mode = 0644,
2309     + .proc_handler = &proc_dointvec_minmax,
2310     + .strategy = &sysctl_intvec,
2311     + .extra1 = &one,
2312     + .extra2 = &five_thousand,
2313     + },
2314     #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
2315     {
2316     .ctl_name = KERN_UNKNOWN_NMI_PANIC,
2317     @@ -607,12 +627,6 @@ static ctl_table kern_table[] = {
2318     { .ctl_name = 0 }
2319     };
2320    
2321     -/* Constants for minimum and maximum testing in vm_table.
2322     - We use these as one-element integer vectors. */
2323     -static int zero;
2324     -static int one_hundred = 100;
2325     -
2326     -
2327     static ctl_table vm_table[] = {
2328     {
2329     .ctl_name = VM_OVERCOMMIT_MEMORY,
2330     Index: linux-2.6.21-ck2/fs/pipe.c
2331     ===================================================================
2332     --- linux-2.6.21-ck2.orig/fs/pipe.c 2007-05-03 22:20:56.000000000 +1000
2333     +++ linux-2.6.21-ck2/fs/pipe.c 2007-05-14 19:30:30.000000000 +1000
2334     @@ -41,12 +41,7 @@ void pipe_wait(struct pipe_inode_info *p
2335     {
2336     DEFINE_WAIT(wait);
2337    
2338     - /*
2339     - * Pipes are system-local resources, so sleeping on them
2340     - * is considered a noninteractive wait:
2341     - */
2342     - prepare_to_wait(&pipe->wait, &wait,
2343     - TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE);
2344     + prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
2345     if (pipe->inode)
2346     mutex_unlock(&pipe->inode->i_mutex);
2347     schedule();
2348     Index: linux-2.6.21-ck2/Documentation/sched-design.txt
2349     ===================================================================
2350     --- linux-2.6.21-ck2.orig/Documentation/sched-design.txt 2006-11-30 11:30:31.000000000 +1100
2351     +++ linux-2.6.21-ck2/Documentation/sched-design.txt 2007-05-14 19:30:30.000000000 +1000
2352     @@ -1,11 +1,14 @@
2353     - Goals, Design and Implementation of the
2354     - new ultra-scalable O(1) scheduler
2355     + Goals, Design and Implementation of the ultra-scalable O(1) scheduler by
2356     + Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by
2357     + Con Kolivas.
2358    
2359    
2360     - This is an edited version of an email Ingo Molnar sent to
2361     - lkml on 4 Jan 2002. It describes the goals, design, and
2362     - implementation of Ingo's new ultra-scalable O(1) scheduler.
2363     - Last Updated: 18 April 2002.
2364     + This was originally an edited version of an email Ingo Molnar sent to
2365     + lkml on 4 Jan 2002. It describes the goals, design, and implementation
2366     + of Ingo's ultra-scalable O(1) scheduler. It now contains a description
2367     + of the Staircase Deadline priority scheduler that was built on this
2368     + design.
2369     + Last Updated: Fri, 4 May 2007
2370    
2371    
2372     Goal
2373     @@ -163,3 +166,222 @@ certain code paths and data constructs.
2374     code is smaller than the old one.
2375    
2376     Ingo
2377     +
2378     +
2379     +Staircase Deadline cpu scheduler policy
2380     +================================================
2381     +
2382     +Design summary
2383     +==============
2384     +
2385     +A novel design which incorporates a foreground-background descending priority
2386     +system (the staircase) via a bandwidth allocation matrix according to nice
2387     +level.
2388     +
2389     +
2390     +Features
2391     +========
2392     +
2393     +A starvation free, strict fairness O(1) scalable design with interactivity
2394     +as good as the above restrictions can provide. There is no interactivity
2395     +estimator, no sleep/run measurements and only simple fixed accounting.
2396     +The design has strict enough a design and accounting that task behaviour
2397     +can be modelled and maximum scheduling latencies can be predicted by
2398     +the virtual deadline mechanism that manages runqueues. The prime concern
2399     +in this design is to maintain fairness at all costs determined by nice level,
2400     +yet to maintain as good interactivity as can be allowed within the
2401     +constraints of strict fairness.
2402     +
2403     +
2404     +Design description
2405     +==================
2406     +
2407     +SD works off the principle of providing each task a quota of runtime that it is
2408     +allowed to run at a number of priority levels determined by its static priority
2409     +(ie. its nice level). If the task uses up its quota it has its priority
2410     +decremented to the next level determined by a priority matrix. Once every
2411     +runtime quota has been consumed of every priority level, a task is queued on the
2412     +"expired" array. When no other tasks exist with quota, the expired array is
2413     +activated and fresh quotas are handed out. This is all done in O(1).
2414     +
2415     +Design details
2416     +==============
2417     +
2418     +Each task keeps a record of its own entitlement of cpu time. Most of the rest of
2419     +these details apply to non-realtime tasks as rt task management is straight
2420     +forward.
2421     +
2422     +Each runqueue keeps a record of what major epoch it is up to in the
2423     +rq->prio_rotation field which is incremented on each major epoch. It also
2424     +keeps a record of the current prio_level for each static priority task.
2425     +
2426     +Each task keeps a record of what major runqueue epoch it was last running
2427     +on in p->rotation. It also keeps a record of what priority levels it has
2428     +already been allocated quota from during this epoch in a bitmap p->bitmap.
2429     +
2430     +The only tunable that determines all other details is the RR_INTERVAL. This
2431     +is set to 8ms, and is scaled gently upwards with more cpus. This value is
2432     +tunable via a /proc interface.
2433     +
2434     +All tasks are initially given a quota based on RR_INTERVAL. This is equal to
2435     +RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and
2436     +progressively larger for nice values from -1 to -20. This is assigned to
2437     +p->quota and only changes with changes in nice level.
2438     +
2439     +As a task is first queued, it checks in recalc_task_prio to see if it has run at
2440     +this runqueue's current priority rotation. If it has not, it will have its
2441     +p->prio level set according to the first slot in a "priority matrix" and will be
2442     +given a p->time_slice equal to the p->quota, and has its allocation bitmap bit
2443     +set in p->bitmap for this prio level. It is then queued on the current active
2444     +priority array.
2445     +
2446     +If a task has already been running during this major epoch, and it has
2447     +p->time_slice left and the rq->prio_quota for the task's p->prio still
2448     +has quota, it will be placed back on the active array, but no more quota
2449     +will be added.
2450     +
2451     +If a task has been running during this major epoch, but does not have
2452     +p->time_slice left, it will find the next lowest priority in its bitmap that it
2453     +has not been allocated quota from. It then gets the a full quota in
2454     +p->time_slice. It is then queued on the current active priority array at the
2455     +newly determined lower priority.
2456     +
2457     +If a task has been running during this major epoch, and does not have
2458     +any entitlement left in p->bitmap and no time_slice left, it will have its
2459     +bitmap cleared, and be queued at its best prio again, but on the expired
2460     +priority array.
2461     +
2462     +When a task is queued, it has its relevant bit set in the array->prio_bitmap.
2463     +
2464     +p->time_slice is stored in nanosconds and is updated via update_cpu_clock on
2465     +schedule() and scheduler_tick. If p->time_slice is below zero then the
2466     +recalc_task_prio is readjusted and the task rescheduled.
2467     +
2468     +
2469     +Priority Matrix
2470     +===============
2471     +
2472     +In order to minimise the latencies between tasks of different nice levels
2473     +running concurrently, the dynamic priority slots where different nice levels
2474     +are queued are dithered instead of being sequential. What this means is that
2475     +there are 40 priority slots where a task may run during one major rotation,
2476     +and the allocation of slots is dependant on nice level. In the
2477     +following table, a zero represents a slot where the task may run.
2478     +
2479     +PRIORITY:0..................20.................39
2480     +nice -20 0000000000000000000000000000000000000000
2481     +nice -10 1000100010001000100010001000100010010000
2482     +nice 0 1010101010101010101010101010101010101010
2483     +nice 5 1011010110110101101101011011010110110110
2484     +nice 10 1110111011101110111011101110111011101110
2485     +nice 15 1111111011111110111111101111111011111110
2486     +nice 19 1111111111111111111111111111111111111110
2487     +
2488     +As can be seen, a nice -20 task runs in every priority slot whereas a nice 19
2489     +task only runs one slot per major rotation. This dithered table allows for the
2490     +smallest possible maximum latencies between tasks of varying nice levels, thus
2491     +allowing vastly different nice levels to be used.
2492     +
2493     +SCHED_BATCH tasks are managed slightly differently, receiving only the top
2494     +slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but
2495     +slightly higher latencies.
2496     +
2497     +
2498     +Modelling deadline behaviour
2499     +============================
2500     +
2501     +As the accounting in this design is hard and not modified by sleep average
2502     +calculations or interactivity modifiers, it is possible to accurately
2503     +predict the maximum latency that a task may experience under different
2504     +conditions. This is a virtual deadline mechanism enforced by mandatory
2505     +timeslice expiration and not outside bandwidth measurement.
2506     +
2507     +The maximum duration a task can run during one major epoch is determined by its
2508     +nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL
2509     +duration during each epoch. Nice 10 tasks can run at 9 priority levels for each
2510     +epoch, and so on. The table in the priority matrix above demonstrates how this
2511     +is enforced.
2512     +
2513     +Therefore the maximum duration a runqueue epoch can take is determined by
2514     +the number of tasks running, and their nice level. After that, the maximum
2515     +duration it can take before a task can wait before it get scheduled is
2516     +determined by the position of its first slot on the matrix.
2517     +
2518     +In the following examples, these are _worst case scenarios_ and would rarely
2519     +occur, but can be modelled nonetheless to determine the maximum possible
2520     +latency.
2521     +
2522     +So for example, if two nice 0 tasks are running, and one has just expired as
2523     +another is activated for the first time receiving a full quota for this
2524     +runqueue rotation, the first task will wait:
2525     +
2526     +nr_tasks * max_duration + nice_difference * rr_interval
2527     +1 * 19 * RR_INTERVAL + 0 = 152ms
2528     +
2529     +In the presence of a nice 10 task, a nice 0 task would wait a maximum of
2530     +1 * 10 * RR_INTERVAL + 0 = 80ms
2531     +
2532     +In the presence of a nice 0 task, a nice 10 task would wait a maximum of
2533     +1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms
2534     +
2535     +More useful than these values, though, are the average latencies which are
2536     +a matter of determining the average distance between priority slots of
2537     +different nice values and multiplying them by the tasks' quota. For example
2538     +in the presence of a nice -10 task, a nice 0 task will wait either one or
2539     +two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL,
2540     +this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or
2541     +20 and 40ms respectively (on uniprocessor at 1000HZ).
2542     +
2543     +
2544     +Achieving interactivity
2545     +=======================
2546     +
2547     +A requirement of this scheduler design was to achieve good interactivity
2548     +despite being a completely fair deadline based design. The disadvantage of
2549     +designs that try to achieve interactivity is that they usually do so at
2550     +the expense of maintaining fairness. As cpu speeds increase, the requirement
2551     +for some sort of metered unfairness towards interactive tasks becomes a less
2552     +desirable phenomenon, but low latency and fairness remains mandatory to
2553     +good interactive performance.
2554     +
2555     +This design relies on the fact that interactive tasks, by their nature,
2556     +sleep often. Most fair scheduling designs end up penalising such tasks
2557     +indirectly giving them less than their fair possible share because of the
2558     +sleep, and have to use a mechanism of bonusing their priority to offset
2559     +this based on the duration they sleep. This becomes increasingly inaccurate
2560     +as the number of running tasks rises and more tasks spend time waiting on
2561     +runqueues rather than sleeping, and it is impossible to tell whether the
2562     +task that's waiting on a runqueue only intends to run for a short period and
2563     +then sleep again after than runqueue wait. Furthermore, all such designs rely
2564     +on a period of time to pass to accumulate some form of statistic on the task
2565     +before deciding on how much to give them preference. The shorter this period,
2566     +the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
2567     +longer this period, the longer it takes for interactive tasks to get low
2568     +scheduling latencies and fair cpu.
2569     +
2570     +This design does not measure sleep time at all. Interactive tasks that sleep
2571     +often will wake up having consumed very little if any of their quota for
2572     +the current major priority rotation. The longer they have slept, the less
2573     +likely they are to even be on the current major priority rotation. Once
2574     +woken up, though, they get to use up a their full quota for that epoch,
2575     +whether part of a quota remains or a full quota. Overall, however, they
2576     +can still only run as much cpu time for that epoch as any other task of the
2577     +same nice level. This means that two tasks behaving completely differently
2578     +from fully cpu bound to waking/sleeping extremely frequently will still
2579     +get the same quota of cpu, but the latter will be using its quota for that
2580     +epoch in bursts rather than continuously. This guarantees that interactive
2581     +tasks get the same amount of cpu as cpu bound ones.
2582     +
2583     +The other requirement of interactive tasks is also to obtain low latencies
2584     +for when they are scheduled. Unlike fully cpu bound tasks and the maximum
2585     +latencies possible described in the modelling deadline behaviour section
2586     +above, tasks that sleep will wake up with quota available usually at the
2587     +current runqueue's priority_level or better. This means that the most latency
2588     +they are likely to see is one RR_INTERVAL, and often they will preempt the
2589     +current task if it is not of a sleeping nature. This then guarantees very
2590     +low latency for interactive tasks, and the lowest latencies for the least
2591     +cpu bound tasks.
2592     +
2593     +
2594     +Fri, 4 May 2007
2595     +Con Kolivas <kernel@kolivas.org>
2596     Index: linux-2.6.21-ck2/kernel/softirq.c
2597     ===================================================================
2598     --- linux-2.6.21-ck2.orig/kernel/softirq.c 2007-05-03 22:20:57.000000000 +1000
2599     +++ linux-2.6.21-ck2/kernel/softirq.c 2007-05-14 19:30:30.000000000 +1000
2600     @@ -488,7 +488,7 @@ void __init softirq_init(void)
2601    
2602     static int ksoftirqd(void * __bind_cpu)
2603     {
2604     - set_user_nice(current, 19);
2605     + set_user_nice(current, 15);
2606     current->flags |= PF_NOFREEZE;
2607    
2608     set_current_state(TASK_INTERRUPTIBLE);