Magellan Linux

Contents of /trunk/kernel26-alx/patches-2.6.21-r14/0001-2.6.21-sd-0.48.patch

Parent Directory Parent Directory | Revision Log Revision Log


Revision 447 - (show annotations) (download)
Tue Jan 22 17:55:52 2008 UTC (16 years, 3 months ago) by niro
File size: 89890 byte(s)
-2.6.21-alx-r14 - fixed some natsemi errors on wys terminals

1 Staircase Deadline cpu scheduler policy
2 ================================================
3
4 Design summary
5 ==============
6
7 A novel design which incorporates a foreground-background descending priority
8 system (the staircase) via a bandwidth allocation matrix according to nice
9 level.
10
11
12 Features
13 ========
14
15 A starvation free, strict fairness O(1) scalable design with interactivity
16 as good as the above restrictions can provide. There is no interactivity
17 estimator, no sleep/run measurements and only simple fixed accounting.
18 The design has strict enough a design and accounting that task behaviour
19 can be modelled and maximum scheduling latencies can be predicted by
20 the virtual deadline mechanism that manages runqueues. The prime concern
21 in this design is to maintain fairness at all costs determined by nice level,
22 yet to maintain as good interactivity as can be allowed within the
23 constraints of strict fairness.
24
25
26 Design description
27 ==================
28
29 SD works off the principle of providing each task a quota of runtime that it is
30 allowed to run at a number of priority levels determined by its static priority
31 (ie. its nice level). If the task uses up its quota it has its priority
32 decremented to the next level determined by a priority matrix. Once every
33 runtime quota has been consumed of every priority level, a task is queued on the
34 "expired" array. When no other tasks exist with quota, the expired array is
35 activated and fresh quotas are handed out. This is all done in O(1).
36
37 Design details
38 ==============
39
40 Each task keeps a record of its own entitlement of cpu time. Most of the rest of
41 these details apply to non-realtime tasks as rt task management is straight
42 forward.
43
44 Each runqueue keeps a record of what major epoch it is up to in the
45 rq->prio_rotation field which is incremented on each major epoch. It also
46 keeps a record of the current prio_level for each static priority task.
47
48 Each task keeps a record of what major runqueue epoch it was last running
49 on in p->rotation. It also keeps a record of what priority levels it has
50 already been allocated quota from during this epoch in a bitmap p->bitmap.
51
52 The only tunable that determines all other details is the RR_INTERVAL. This
53 is set to 8ms, and is scaled gently upwards with more cpus. This value is
54 tunable via a /proc interface.
55
56 All tasks are initially given a quota based on RR_INTERVAL. This is equal to
57 RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and
58 progressively larger for nice values from -1 to -20. This is assigned to
59 p->quota and only changes with changes in nice level.
60
61 As a task is first queued, it checks in recalc_task_prio to see if it has run at
62 this runqueue's current priority rotation. If it has not, it will have its
63 p->prio level set according to the first slot in a "priority matrix" and will be
64 given a p->time_slice equal to the p->quota, and has its allocation bitmap bit
65 set in p->bitmap for this prio level. It is then queued on the current active
66 priority array.
67
68 If a task has already been running during this major epoch, and it has
69 p->time_slice left and the rq->prio_quota for the task's p->prio still
70 has quota, it will be placed back on the active array, but no more quota
71 will be added.
72
73 If a task has been running during this major epoch, but does not have
74 p->time_slice left, it will find the next lowest priority in its bitmap that it
75 has not been allocated quota from. It then gets the a full quota in
76 p->time_slice. It is then queued on the current active priority array at the
77 newly determined lower priority.
78
79 If a task has been running during this major epoch, and does not have
80 any entitlement left in p->bitmap and no time_slice left, it will have its
81 bitmap cleared, and be queued at its best prio again, but on the expired
82 priority array.
83
84 When a task is queued, it has its relevant bit set in the array->prio_bitmap.
85
86 p->time_slice is stored in nanosconds and is updated via update_cpu_clock on
87 schedule() and scheduler_tick. If p->time_slice is below zero then the
88 recalc_task_prio is readjusted and the task rescheduled.
89
90
91 Priority Matrix
92 ===============
93
94 In order to minimise the latencies between tasks of different nice levels
95 running concurrently, the dynamic priority slots where different nice levels
96 are queued are dithered instead of being sequential. What this means is that
97 there are 40 priority slots where a task may run during one major rotation,
98 and the allocation of slots is dependant on nice level. In the
99 following table, a zero represents a slot where the task may run.
100
101 PRIORITY:0..................20.................39
102 nice -20 0000000000000000000000000000000000000000
103 nice -10 1000100010001000100010001000100010010000
104 nice 0 1010101010101010101010101010101010101010
105 nice 5 1011010110110101101101011011010110110110
106 nice 10 1110111011101110111011101110111011101110
107 nice 15 1111111011111110111111101111111011111110
108 nice 19 1111111111111111111111111111111111111110
109
110 As can be seen, a nice -20 task runs in every priority slot whereas a nice 19
111 task only runs one slot per major rotation. This dithered table allows for the
112 smallest possible maximum latencies between tasks of varying nice levels, thus
113 allowing vastly different nice levels to be used.
114
115 SCHED_BATCH tasks are managed slightly differently, receiving only the top
116 slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but
117 slightly higher latencies.
118
119
120 Modelling deadline behaviour
121 ============================
122
123 As the accounting in this design is hard and not modified by sleep average
124 calculations or interactivity modifiers, it is possible to accurately
125 predict the maximum latency that a task may experience under different
126 conditions. This is a virtual deadline mechanism enforced by mandatory
127 timeslice expiration and not outside bandwidth measurement.
128
129 The maximum duration a task can run during one major epoch is determined by its
130 nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL
131 duration during each epoch. Nice 10 tasks can run at 9 priority levels for each
132 epoch, and so on. The table in the priority matrix above demonstrates how this
133 is enforced.
134
135 Therefore the maximum duration a runqueue epoch can take is determined by
136 the number of tasks running, and their nice level. After that, the maximum
137 duration it can take before a task can wait before it get scheduled is
138 determined by the position of its first slot on the matrix.
139
140 In the following examples, these are _worst case scenarios_ and would rarely
141 occur, but can be modelled nonetheless to determine the maximum possible
142 latency.
143
144 So for example, if two nice 0 tasks are running, and one has just expired as
145 another is activated for the first time receiving a full quota for this
146 runqueue rotation, the first task will wait:
147
148 nr_tasks * max_duration + nice_difference * rr_interval
149 1 * 19 * RR_INTERVAL + 0 = 152ms
150
151 In the presence of a nice 10 task, a nice 0 task would wait a maximum of
152 1 * 10 * RR_INTERVAL + 0 = 80ms
153
154 In the presence of a nice 0 task, a nice 10 task would wait a maximum of
155 1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms
156
157 More useful than these values, though, are the average latencies which are
158 a matter of determining the average distance between priority slots of
159 different nice values and multiplying them by the tasks' quota. For example
160 in the presence of a nice -10 task, a nice 0 task will wait either one or
161 two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL,
162 this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or
163 20 and 40ms respectively (on uniprocessor at 1000HZ).
164
165
166 Achieving interactivity
167 =======================
168
169 A requirement of this scheduler design was to achieve good interactivity
170 despite being a completely fair deadline based design. The disadvantage of
171 designs that try to achieve interactivity is that they usually do so at
172 the expense of maintaining fairness. As cpu speeds increase, the requirement
173 for some sort of metered unfairness towards interactive tasks becomes a less
174 desirable phenomenon, but low latency and fairness remains mandatory to
175 good interactive performance.
176
177 This design relies on the fact that interactive tasks, by their nature,
178 sleep often. Most fair scheduling designs end up penalising such tasks
179 indirectly giving them less than their fair possible share because of the
180 sleep, and have to use a mechanism of bonusing their priority to offset
181 this based on the duration they sleep. This becomes increasingly inaccurate
182 as the number of running tasks rises and more tasks spend time waiting on
183 runqueues rather than sleeping, and it is impossible to tell whether the
184 task that's waiting on a runqueue only intends to run for a short period and
185 then sleep again after than runqueue wait. Furthermore, all such designs rely
186 on a period of time to pass to accumulate some form of statistic on the task
187 before deciding on how much to give them preference. The shorter this period,
188 the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
189 longer this period, the longer it takes for interactive tasks to get low
190 scheduling latencies and fair cpu.
191
192 This design does not measure sleep time at all. Interactive tasks that sleep
193 often will wake up having consumed very little if any of their quota for
194 the current major priority rotation. The longer they have slept, the less
195 likely they are to even be on the current major priority rotation. Once
196 woken up, though, they get to use up a their full quota for that epoch,
197 whether part of a quota remains or a full quota. Overall, however, they
198 can still only run as much cpu time for that epoch as any other task of the
199 same nice level. This means that two tasks behaving completely differently
200 from fully cpu bound to waking/sleeping extremely frequently will still
201 get the same quota of cpu, but the latter will be using its quota for that
202 epoch in bursts rather than continuously. This guarantees that interactive
203 tasks get the same amount of cpu as cpu bound ones.
204
205 The other requirement of interactive tasks is also to obtain low latencies
206 for when they are scheduled. Unlike fully cpu bound tasks and the maximum
207 latencies possible described in the modelling deadline behaviour section
208 above, tasks that sleep will wake up with quota available usually at the
209 current runqueue's priority_level or better. This means that the most latency
210 they are likely to see is one RR_INTERVAL, and often they will preempt the
211 current task if it is not of a sleeping nature. This then guarantees very
212 low latency for interactive tasks, and the lowest latencies for the least
213 cpu bound tasks.
214
215
216 Fri, 4 May 2007
217
218 Signed-off-by: Con Kolivas <kernel@kolivas.org>
219
220 ---
221 Documentation/sched-design.txt | 234 +++++++
222 Documentation/sysctl/kernel.txt | 14
223 fs/pipe.c | 7
224 fs/proc/array.c | 2
225 include/linux/init_task.h | 4
226 include/linux/sched.h | 32 -
227 kernel/sched.c | 1277 +++++++++++++++++++---------------------
228 kernel/softirq.c | 2
229 kernel/sysctl.c | 26
230 kernel/workqueue.c | 2
231 10 files changed, 908 insertions(+), 692 deletions(-)
232
233 Index: linux-2.6.21-ck2/kernel/workqueue.c
234 ===================================================================
235 --- linux-2.6.21-ck2.orig/kernel/workqueue.c 2007-05-03 22:20:57.000000000 +1000
236 +++ linux-2.6.21-ck2/kernel/workqueue.c 2007-05-14 19:30:30.000000000 +1000
237 @@ -355,8 +355,6 @@ static int worker_thread(void *__cwq)
238 if (!cwq->freezeable)
239 current->flags |= PF_NOFREEZE;
240
241 - set_user_nice(current, -5);
242 -
243 /* Block and flush all signals */
244 sigfillset(&blocked);
245 sigprocmask(SIG_BLOCK, &blocked, NULL);
246 Index: linux-2.6.21-ck2/fs/proc/array.c
247 ===================================================================
248 --- linux-2.6.21-ck2.orig/fs/proc/array.c 2007-05-03 22:20:56.000000000 +1000
249 +++ linux-2.6.21-ck2/fs/proc/array.c 2007-05-14 19:30:30.000000000 +1000
250 @@ -165,7 +165,6 @@ static inline char * task_state(struct t
251 rcu_read_lock();
252 buffer += sprintf(buffer,
253 "State:\t%s\n"
254 - "SleepAVG:\t%lu%%\n"
255 "Tgid:\t%d\n"
256 "Pid:\t%d\n"
257 "PPid:\t%d\n"
258 @@ -173,7 +172,6 @@ static inline char * task_state(struct t
259 "Uid:\t%d\t%d\t%d\t%d\n"
260 "Gid:\t%d\t%d\t%d\t%d\n",
261 get_task_state(p),
262 - (p->sleep_avg/1024)*100/(1020000000/1024),
263 p->tgid, p->pid,
264 pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
265 pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0,
266 Index: linux-2.6.21-ck2/include/linux/init_task.h
267 ===================================================================
268 --- linux-2.6.21-ck2.orig/include/linux/init_task.h 2007-05-03 22:20:57.000000000 +1000
269 +++ linux-2.6.21-ck2/include/linux/init_task.h 2007-05-14 19:30:30.000000000 +1000
270 @@ -102,13 +102,15 @@ extern struct group_info init_groups;
271 .prio = MAX_PRIO-20, \
272 .static_prio = MAX_PRIO-20, \
273 .normal_prio = MAX_PRIO-20, \
274 + .rotation = 0, \
275 .policy = SCHED_NORMAL, \
276 .cpus_allowed = CPU_MASK_ALL, \
277 .mm = NULL, \
278 .active_mm = &init_mm, \
279 .run_list = LIST_HEAD_INIT(tsk.run_list), \
280 .ioprio = 0, \
281 - .time_slice = HZ, \
282 + .time_slice = 1000000000, \
283 + .quota = 1000000000, \
284 .tasks = LIST_HEAD_INIT(tsk.tasks), \
285 .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \
286 .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \
287 Index: linux-2.6.21-ck2/include/linux/sched.h
288 ===================================================================
289 --- linux-2.6.21-ck2.orig/include/linux/sched.h 2007-05-03 22:20:57.000000000 +1000
290 +++ linux-2.6.21-ck2/include/linux/sched.h 2007-05-14 19:30:30.000000000 +1000
291 @@ -149,8 +149,7 @@ extern unsigned long weighted_cpuload(co
292 #define EXIT_ZOMBIE 16
293 #define EXIT_DEAD 32
294 /* in tsk->state again */
295 -#define TASK_NONINTERACTIVE 64
296 -#define TASK_DEAD 128
297 +#define TASK_DEAD 64
298
299 #define __set_task_state(tsk, state_value) \
300 do { (tsk)->state = (state_value); } while (0)
301 @@ -522,8 +521,9 @@ struct signal_struct {
302
303 #define MAX_USER_RT_PRIO 100
304 #define MAX_RT_PRIO MAX_USER_RT_PRIO
305 +#define PRIO_RANGE (40)
306
307 -#define MAX_PRIO (MAX_RT_PRIO + 40)
308 +#define MAX_PRIO (MAX_RT_PRIO + PRIO_RANGE)
309
310 #define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO)
311 #define rt_task(p) rt_prio((p)->prio)
312 @@ -788,13 +788,6 @@ struct mempolicy;
313 struct pipe_inode_info;
314 struct uts_namespace;
315
316 -enum sleep_type {
317 - SLEEP_NORMAL,
318 - SLEEP_NONINTERACTIVE,
319 - SLEEP_INTERACTIVE,
320 - SLEEP_INTERRUPTED,
321 -};
322 -
323 struct prio_array;
324
325 struct task_struct {
326 @@ -814,20 +807,33 @@ struct task_struct {
327 int load_weight; /* for niceness load balancing purposes */
328 int prio, static_prio, normal_prio;
329 struct list_head run_list;
330 + /*
331 + * This bitmap shows what priorities this task has received quota
332 + * from for this major priority rotation on its current runqueue.
333 + */
334 + DECLARE_BITMAP(bitmap, PRIO_RANGE + 1);
335 struct prio_array *array;
336 + /* Which major runqueue rotation did this task run */
337 + unsigned long rotation;
338
339 unsigned short ioprio;
340 #ifdef CONFIG_BLK_DEV_IO_TRACE
341 unsigned int btrace_seq;
342 #endif
343 - unsigned long sleep_avg;
344 unsigned long long timestamp, last_ran;
345 unsigned long long sched_time; /* sched_clock time spent running */
346 - enum sleep_type sleep_type;
347
348 unsigned long policy;
349 cpumask_t cpus_allowed;
350 - unsigned int time_slice, first_time_slice;
351 + /*
352 + * How much this task is entitled to run at the current priority
353 + * before being requeued at a lower priority.
354 + */
355 + int time_slice;
356 + /* Is this the very first time_slice this task has ever run. */
357 + unsigned int first_time_slice;
358 + /* How much this task receives at each priority level */
359 + int quota;
360
361 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
362 struct sched_info sched_info;
363 Index: linux-2.6.21-ck2/kernel/sched.c
364 ===================================================================
365 --- linux-2.6.21-ck2.orig/kernel/sched.c 2007-05-03 22:20:57.000000000 +1000
366 +++ linux-2.6.21-ck2/kernel/sched.c 2007-05-14 19:30:30.000000000 +1000
367 @@ -16,6 +16,7 @@
368 * by Davide Libenzi, preemptible kernel bits by Robert Love.
369 * 2003-09-03 Interactivity tuning by Con Kolivas.
370 * 2004-04-02 Scheduler domains code by Nick Piggin
371 + * 2007-03-02 Staircase deadline scheduling policy by Con Kolivas
372 */
373
374 #include <linux/mm.h>
375 @@ -52,6 +53,7 @@
376 #include <linux/tsacct_kern.h>
377 #include <linux/kprobes.h>
378 #include <linux/delayacct.h>
379 +#include <linux/log2.h>
380 #include <asm/tlb.h>
381
382 #include <asm/unistd.h>
383 @@ -83,126 +85,72 @@ unsigned long long __attribute__((weak))
384 #define USER_PRIO(p) ((p)-MAX_RT_PRIO)
385 #define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)
386 #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
387 +#define SCHED_PRIO(p) ((p)+MAX_RT_PRIO)
388
389 -/*
390 - * Some helpers for converting nanosecond timing to jiffy resolution
391 - */
392 -#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
393 +/* Some helpers for converting to/from various scales.*/
394 #define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))
395 -
396 -/*
397 - * These are the 'tuning knobs' of the scheduler:
398 - *
399 - * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger),
400 - * default timeslice is 100 msecs, maximum timeslice is 800 msecs.
401 - * Timeslices get refilled after they expire.
402 - */
403 -#define MIN_TIMESLICE max(5 * HZ / 1000, 1)
404 -#define DEF_TIMESLICE (100 * HZ / 1000)
405 -#define ON_RUNQUEUE_WEIGHT 30
406 -#define CHILD_PENALTY 95
407 -#define PARENT_PENALTY 100
408 -#define EXIT_WEIGHT 3
409 -#define PRIO_BONUS_RATIO 25
410 -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
411 -#define INTERACTIVE_DELTA 2
412 -#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS)
413 -#define STARVATION_LIMIT (MAX_SLEEP_AVG)
414 -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
415 -
416 -/*
417 - * If a task is 'interactive' then we reinsert it in the active
418 - * array after it has expired its current timeslice. (it will not
419 - * continue to run immediately, it will still roundrobin with
420 - * other interactive tasks.)
421 - *
422 - * This part scales the interactivity limit depending on niceness.
423 - *
424 - * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
425 - * Here are a few examples of different nice levels:
426 - *
427 - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
428 - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
429 - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0]
430 - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
431 - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
432 - *
433 - * (the X axis represents the possible -5 ... 0 ... +5 dynamic
434 - * priority range a task can explore, a value of '1' means the
435 - * task is rated interactive.)
436 - *
437 - * Ie. nice +19 tasks can never get 'interactive' enough to be
438 - * reinserted into the active array. And only heavily CPU-hog nice -20
439 - * tasks will be expired. Default nice 0 tasks are somewhere between,
440 - * it takes some effort for them to get interactive, but it's not
441 - * too hard.
442 - */
443 -
444 -#define CURRENT_BONUS(p) \
445 - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
446 - MAX_SLEEP_AVG)
447 -
448 -#define GRANULARITY (10 * HZ / 1000 ? : 1)
449 -
450 -#ifdef CONFIG_SMP
451 -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \
452 - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
453 - num_online_cpus())
454 -#else
455 -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \
456 - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
457 -#endif
458 -
459 -#define SCALE(v1,v1_max,v2_max) \
460 - (v1) * (v2_max) / (v1_max)
461 -
462 -#define DELTA(p) \
463 - (SCALE(TASK_NICE(p) + 20, 40, MAX_BONUS) - 20 * MAX_BONUS / 40 + \
464 - INTERACTIVE_DELTA)
465 -
466 -#define TASK_INTERACTIVE(p) \
467 - ((p)->prio <= (p)->static_prio - DELTA(p))
468 -
469 -#define INTERACTIVE_SLEEP(p) \
470 - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
471 - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
472 -
473 -#define TASK_PREEMPTS_CURR(p, rq) \
474 - ((p)->prio < (rq)->curr->prio)
475 -
476 -#define SCALE_PRIO(x, prio) \
477 - max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
478 -
479 -static unsigned int static_prio_timeslice(int static_prio)
480 -{
481 - if (static_prio < NICE_TO_PRIO(0))
482 - return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio);
483 - else
484 - return SCALE_PRIO(DEF_TIMESLICE, static_prio);
485 -}
486 -
487 -/*
488 - * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
489 - * to time slice values: [800ms ... 100ms ... 5ms]
490 - *
491 - * The higher a thread's priority, the bigger timeslices
492 - * it gets during one round of execution. But even the lowest
493 - * priority thread gets MIN_TIMESLICE worth of execution time.
494 +#define MS_TO_NS(TIME) ((TIME) * 1000000)
495 +#define MS_TO_US(TIME) ((TIME) * 1000)
496 +#define US_TO_MS(TIME) ((TIME) / 1000)
497 +
498 +#define TASK_PREEMPTS_CURR(p, curr) ((p)->prio < (curr)->prio)
499 +
500 +/*
501 + * This is the time all tasks within the same priority round robin.
502 + * Value is in ms and set to a minimum of 8ms. Scales with number of cpus.
503 + * Tunable via /proc interface.
504 + */
505 +int rr_interval __read_mostly = 8;
506 +
507 +/*
508 + * This contains a bitmap for each dynamic priority level with empty slots
509 + * for the valid priorities each different nice level can have. It allows
510 + * us to stagger the slots where differing priorities run in a way that
511 + * keeps latency differences between different nice levels at a minimum.
512 + * The purpose of a pre-generated matrix is for rapid lookup of next slot in
513 + * O(1) time without having to recalculate every time priority gets demoted.
514 + * All nice levels use priority slot 39 as this allows less niced tasks to
515 + * get all priority slots better than that before expiration is forced.
516 + * ie, where 0 means a slot for that priority, priority running from left to
517 + * right is from prio 0 to prio 39:
518 + * nice -20 0000000000000000000000000000000000000000
519 + * nice -10 1000100010001000100010001000100010010000
520 + * nice 0 1010101010101010101010101010101010101010
521 + * nice 5 1011010110110101101101011011010110110110
522 + * nice 10 1110111011101110111011101110111011101110
523 + * nice 15 1111111011111110111111101111111011111110
524 + * nice 19 1111111111111111111111111111111111111110
525 */
526 +static unsigned long prio_matrix[PRIO_RANGE][BITS_TO_LONGS(PRIO_RANGE)]
527 + __read_mostly;
528
529 -static inline unsigned int task_timeslice(struct task_struct *p)
530 -{
531 - return static_prio_timeslice(p->static_prio);
532 -}
533 +struct rq;
534
535 /*
536 * These are the runqueue data structures:
537 */
538 -
539 struct prio_array {
540 - unsigned int nr_active;
541 - DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
542 + /* Tasks queued at each priority */
543 struct list_head queue[MAX_PRIO];
544 +
545 + /*
546 + * The bitmap of priorities queued for this array. While the expired
547 + * array will never have realtime tasks on it, it is simpler to have
548 + * equal sized bitmaps for a cheap array swap. Include 1 bit for
549 + * delimiter.
550 + */
551 + DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1);
552 +
553 + /*
554 + * The best static priority (of the dynamic priority tasks) queued
555 + * this array.
556 + */
557 + int best_static_prio;
558 +
559 +#ifdef CONFIG_SMP
560 + /* For convenience looks back at rq */
561 + struct rq *rq;
562 +#endif
563 };
564
565 /*
566 @@ -234,14 +182,24 @@ struct rq {
567 */
568 unsigned long nr_uninterruptible;
569
570 - unsigned long expired_timestamp;
571 /* Cached timestamp set by update_cpu_clock() */
572 unsigned long long most_recent_timestamp;
573 struct task_struct *curr, *idle;
574 unsigned long next_balance;
575 struct mm_struct *prev_mm;
576 +
577 struct prio_array *active, *expired, arrays[2];
578 - int best_expired_prio;
579 + unsigned long *dyn_bitmap, *exp_bitmap;
580 +
581 + /*
582 + * The current dynamic priority level this runqueue is at per static
583 + * priority level.
584 + */
585 + int prio_level[PRIO_RANGE];
586 +
587 + /* How many times we have rotated the priority queue */
588 + unsigned long prio_rotation;
589 +
590 atomic_t nr_iowait;
591
592 #ifdef CONFIG_SMP
593 @@ -579,12 +537,9 @@ static inline struct rq *this_rq_lock(vo
594 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
595 /*
596 * Called when a process is dequeued from the active array and given
597 - * the cpu. We should note that with the exception of interactive
598 - * tasks, the expired queue will become the active queue after the active
599 - * queue is empty, without explicitly dequeuing and requeuing tasks in the
600 - * expired queue. (Interactive tasks may be requeued directly to the
601 - * active queue, thus delaying tasks in the expired queue from running;
602 - * see scheduler_tick()).
603 + * the cpu. We should note that the expired queue will become the active
604 + * queue after the active queue is empty, without explicitly dequeuing and
605 + * requeuing tasks in the expired queue.
606 *
607 * This function is only called from sched_info_arrive(), rather than
608 * dequeue_task(). Even though a task may be queued and dequeued multiple
609 @@ -682,71 +637,227 @@ sched_info_switch(struct task_struct *pr
610 #define sched_info_switch(t, next) do { } while (0)
611 #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
612
613 +static inline int task_queued(struct task_struct *task)
614 +{
615 + return !list_empty(&task->run_list);
616 +}
617 +
618 +static inline void set_dynamic_bit(struct task_struct *p, struct rq *rq)
619 +{
620 + __set_bit(p->prio, p->array->prio_bitmap);
621 +}
622 +
623 /*
624 - * Adding/removing a task to/from a priority array:
625 + * Removing from a runqueue.
626 */
627 -static void dequeue_task(struct task_struct *p, struct prio_array *array)
628 +static void dequeue_task(struct task_struct *p, struct rq *rq)
629 {
630 - array->nr_active--;
631 - list_del(&p->run_list);
632 - if (list_empty(array->queue + p->prio))
633 - __clear_bit(p->prio, array->bitmap);
634 + list_del_init(&p->run_list);
635 + if (list_empty(p->array->queue + p->prio))
636 + __clear_bit(p->prio, p->array->prio_bitmap);
637 }
638
639 -static void enqueue_task(struct task_struct *p, struct prio_array *array)
640 +static void reset_first_time_slice(struct task_struct *p)
641 {
642 - sched_info_queued(p);
643 - list_add_tail(&p->run_list, array->queue + p->prio);
644 - __set_bit(p->prio, array->bitmap);
645 - array->nr_active++;
646 + if (unlikely(p->first_time_slice))
647 + p->first_time_slice = 0;
648 +}
649 +
650 +/*
651 + * The task is being queued on a fresh array so it has its entitlement
652 + * bitmap cleared.
653 + */
654 +static void task_new_array(struct task_struct *p, struct rq *rq,
655 + struct prio_array *array)
656 +{
657 + bitmap_zero(p->bitmap, PRIO_RANGE);
658 + p->rotation = rq->prio_rotation;
659 + p->time_slice = p->quota;
660 p->array = array;
661 + reset_first_time_slice(p);
662 +}
663 +
664 +/* Find the first slot from the relevant prio_matrix entry */
665 +static int first_prio_slot(struct task_struct *p)
666 +{
667 + if (unlikely(p->policy == SCHED_BATCH))
668 + return p->static_prio;
669 + return SCHED_PRIO(find_first_zero_bit(
670 + prio_matrix[USER_PRIO(p->static_prio)], PRIO_RANGE));
671 }
672
673 /*
674 - * Put task to the end of the run list without the overhead of dequeue
675 - * followed by enqueue.
676 + * Find the first unused slot by this task that is also in its prio_matrix
677 + * level. SCHED_BATCH tasks do not use the priority matrix. They only take
678 + * priority slots from their static_prio and above.
679 */
680 -static void requeue_task(struct task_struct *p, struct prio_array *array)
681 +static int next_entitled_slot(struct task_struct *p, struct rq *rq)
682 {
683 - list_move_tail(&p->run_list, array->queue + p->prio);
684 + int search_prio = MAX_RT_PRIO, uprio = USER_PRIO(p->static_prio);
685 + struct prio_array *array = rq->active;
686 + DECLARE_BITMAP(tmp, PRIO_RANGE);
687 +
688 + /*
689 + * Go straight to expiration if there are higher priority tasks
690 + * already expired.
691 + */
692 + if (p->static_prio > rq->expired->best_static_prio)
693 + return MAX_PRIO;
694 + if (!rq->prio_level[uprio])
695 + rq->prio_level[uprio] = MAX_RT_PRIO;
696 + /*
697 + * Only priorities equal to the prio_level and above for their
698 + * static_prio are acceptable, and only if it's not better than
699 + * a queued better static_prio's prio_level.
700 + */
701 + if (p->static_prio < array->best_static_prio) {
702 + if (likely(p->policy != SCHED_BATCH))
703 + array->best_static_prio = p->static_prio;
704 + } else if (p->static_prio == array->best_static_prio) {
705 + search_prio = rq->prio_level[uprio];
706 + } else {
707 + int i;
708 +
709 + search_prio = rq->prio_level[uprio];
710 + /* A bound O(n) function, worst case n is 40 */
711 + for (i = array->best_static_prio; i <= p->static_prio ; i++) {
712 + if (!rq->prio_level[USER_PRIO(i)])
713 + rq->prio_level[USER_PRIO(i)] = MAX_RT_PRIO;
714 + search_prio = max(search_prio,
715 + rq->prio_level[USER_PRIO(i)]);
716 + }
717 + }
718 + if (unlikely(p->policy == SCHED_BATCH)) {
719 + search_prio = max(search_prio, p->static_prio);
720 + return SCHED_PRIO(find_next_zero_bit(p->bitmap, PRIO_RANGE,
721 + USER_PRIO(search_prio)));
722 + }
723 + bitmap_or(tmp, p->bitmap, prio_matrix[uprio], PRIO_RANGE);
724 + return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
725 + USER_PRIO(search_prio)));
726 +}
727 +
728 +static void queue_expired(struct task_struct *p, struct rq *rq)
729 +{
730 + task_new_array(p, rq, rq->expired);
731 + p->prio = p->normal_prio = first_prio_slot(p);
732 + if (p->static_prio < rq->expired->best_static_prio)
733 + rq->expired->best_static_prio = p->static_prio;
734 + reset_first_time_slice(p);
735 }
736
737 -static inline void
738 -enqueue_task_head(struct task_struct *p, struct prio_array *array)
739 +#ifdef CONFIG_SMP
740 +/*
741 + * If we're waking up a task that was previously on a different runqueue,
742 + * update its data appropriately. Note we may be reading data from src_rq->
743 + * outside of lock, but the occasional inaccurate result should be harmless.
744 + */
745 + static void update_if_moved(struct task_struct *p, struct rq *rq)
746 +{
747 + struct rq *src_rq = p->array->rq;
748 +
749 + if (src_rq == rq)
750 + return;
751 + /*
752 + * Only need to set p->array when p->rotation == rq->prio_rotation as
753 + * they will be set in recalc_task_prio when != rq->prio_rotation.
754 + */
755 + if (p->rotation == src_rq->prio_rotation) {
756 + p->rotation = rq->prio_rotation;
757 + if (p->array == src_rq->expired)
758 + p->array = rq->expired;
759 + else
760 + p->array = rq->active;
761 + } else
762 + p->rotation = 0;
763 +}
764 +#else
765 +static inline void update_if_moved(struct task_struct *p, struct rq *rq)
766 {
767 - list_add(&p->run_list, array->queue + p->prio);
768 - __set_bit(p->prio, array->bitmap);
769 - array->nr_active++;
770 - p->array = array;
771 }
772 +#endif
773
774 /*
775 - * __normal_prio - return the priority that is based on the static
776 - * priority but is modified by bonuses/penalties.
777 - *
778 - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
779 - * into the -5 ... 0 ... +5 bonus/penalty range.
780 - *
781 - * We use 25% of the full 0...39 priority range so that:
782 - *
783 - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
784 - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
785 - *
786 - * Both properties are important to certain workloads.
787 + * recalc_task_prio determines what priority a non rt_task will be
788 + * queued at. If the task has already been running during this runqueue's
789 + * major rotation (rq->prio_rotation) then it continues at the same
790 + * priority if it has tick entitlement left. If it does not have entitlement
791 + * left, it finds the next priority slot according to its nice value that it
792 + * has not extracted quota from. If it has not run during this major
793 + * rotation, it starts at the next_entitled_slot and has its bitmap quota
794 + * cleared. If it does not have any slots left it has all its slots reset and
795 + * is queued on the expired at its first_prio_slot.
796 */
797 +static void recalc_task_prio(struct task_struct *p, struct rq *rq)
798 +{
799 + struct prio_array *array = rq->active;
800 + int queue_prio;
801
802 -static inline int __normal_prio(struct task_struct *p)
803 + update_if_moved(p, rq);
804 + if (p->rotation == rq->prio_rotation) {
805 + if (p->array == array) {
806 + if (p->time_slice > 0)
807 + return;
808 + p->time_slice = p->quota;
809 + } else if (p->array == rq->expired) {
810 + queue_expired(p, rq);
811 + return;
812 + } else
813 + task_new_array(p, rq, array);
814 + } else
815 + task_new_array(p, rq, array);
816 +
817 + queue_prio = next_entitled_slot(p, rq);
818 + if (queue_prio >= MAX_PRIO) {
819 + queue_expired(p, rq);
820 + return;
821 + }
822 + p->prio = p->normal_prio = queue_prio;
823 + __set_bit(USER_PRIO(p->prio), p->bitmap);
824 +}
825 +
826 +/*
827 + * Adding to a runqueue. The dynamic priority queue that it is added to is
828 + * determined by recalc_task_prio() above.
829 + */
830 +static inline void __enqueue_task(struct task_struct *p, struct rq *rq)
831 +{
832 + if (rt_task(p))
833 + p->array = rq->active;
834 + else
835 + recalc_task_prio(p, rq);
836 +
837 + sched_info_queued(p);
838 + set_dynamic_bit(p, rq);
839 +}
840 +
841 +static void enqueue_task(struct task_struct *p, struct rq *rq)
842 {
843 - int bonus, prio;
844 + __enqueue_task(p, rq);
845 + list_add_tail(&p->run_list, p->array->queue + p->prio);
846 +}
847
848 - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
849 +static inline void enqueue_task_head(struct task_struct *p, struct rq *rq)
850 +{
851 + __enqueue_task(p, rq);
852 + list_add(&p->run_list, p->array->queue + p->prio);
853 +}
854
855 - prio = p->static_prio - bonus;
856 - if (prio < MAX_RT_PRIO)
857 - prio = MAX_RT_PRIO;
858 - if (prio > MAX_PRIO-1)
859 - prio = MAX_PRIO-1;
860 - return prio;
861 +/*
862 + * requeue_task is only called when p->static_prio does not change. p->prio
863 + * can change with dynamic tasks.
864 + */
865 +static void requeue_task(struct task_struct *p, struct rq *rq,
866 + struct prio_array *old_array, int old_prio)
867 +{
868 + if (p->array == rq->expired)
869 + queue_expired(p, rq);
870 + list_move_tail(&p->run_list, p->array->queue + p->prio);
871 + if (!rt_task(p)) {
872 + if (list_empty(old_array->queue + old_prio))
873 + __clear_bit(old_prio, old_array->prio_bitmap);
874 + set_dynamic_bit(p, rq);
875 + }
876 }
877
878 /*
879 @@ -759,17 +870,24 @@ static inline int __normal_prio(struct t
880 */
881
882 /*
883 - * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE
884 - * If static_prio_timeslice() is ever changed to break this assumption then
885 - * this code will need modification
886 - */
887 -#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE
888 -#define LOAD_WEIGHT(lp) \
889 - (((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO)
890 -#define PRIO_TO_LOAD_WEIGHT(prio) \
891 - LOAD_WEIGHT(static_prio_timeslice(prio))
892 -#define RTPRIO_TO_LOAD_WEIGHT(rp) \
893 - (PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp))
894 + * task_timeslice - the total duration a task can run during one major
895 + * rotation. Returns value in milliseconds as the smallest value can be 1.
896 + */
897 +static int task_timeslice(struct task_struct *p)
898 +{
899 + int slice = p->quota; /* quota is in us */
900 +
901 + if (!rt_task(p))
902 + slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * slice;
903 + return US_TO_MS(slice);
904 +}
905 +
906 +/*
907 + * The load weight is basically the task_timeslice in ms. Realtime tasks are
908 + * special cased to be proportionately larger than nice -20 by their
909 + * rt_priority. The weight for rt tasks can only be arbitrary at best.
910 + */
911 +#define RTPRIO_TO_LOAD_WEIGHT(rp) (rr_interval * 20 * (40 + rp))
912
913 static void set_load_weight(struct task_struct *p)
914 {
915 @@ -786,7 +904,7 @@ static void set_load_weight(struct task_
916 #endif
917 p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
918 } else
919 - p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
920 + p->load_weight = task_timeslice(p);
921 }
922
923 static inline void
924 @@ -814,28 +932,38 @@ static inline void dec_nr_running(struct
925 }
926
927 /*
928 - * Calculate the expected normal priority: i.e. priority
929 - * without taking RT-inheritance into account. Might be
930 - * boosted by interactivity modifiers. Changes upon fork,
931 - * setprio syscalls, and whenever the interactivity
932 - * estimator recalculates.
933 + * __activate_task - move a task to the runqueue.
934 */
935 -static inline int normal_prio(struct task_struct *p)
936 +static inline void __activate_task(struct task_struct *p, struct rq *rq)
937 +{
938 + enqueue_task(p, rq);
939 + inc_nr_running(p, rq);
940 +}
941 +
942 +/*
943 + * __activate_idle_task - move idle task to the _front_ of runqueue.
944 + */
945 +static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
946 {
947 - int prio;
948 + enqueue_task_head(p, rq);
949 + inc_nr_running(p, rq);
950 +}
951
952 +static inline int normal_prio(struct task_struct *p)
953 +{
954 if (has_rt_policy(p))
955 - prio = MAX_RT_PRIO-1 - p->rt_priority;
956 + return MAX_RT_PRIO-1 - p->rt_priority;
957 + /* Other tasks all have normal_prio set in recalc_task_prio */
958 + if (likely(p->prio >= MAX_RT_PRIO && p->prio < MAX_PRIO))
959 + return p->prio;
960 else
961 - prio = __normal_prio(p);
962 - return prio;
963 + return p->static_prio;
964 }
965
966 /*
967 * Calculate the current priority, i.e. the priority
968 * taken into account by the scheduler. This value might
969 - * be boosted by RT tasks, or might be boosted by
970 - * interactivity modifiers. Will be RT if the task got
971 + * be boosted by RT tasks as it will be RT if the task got
972 * RT-boosted. If not then it returns p->normal_prio.
973 */
974 static int effective_prio(struct task_struct *p)
975 @@ -852,111 +980,41 @@ static int effective_prio(struct task_st
976 }
977
978 /*
979 - * __activate_task - move a task to the runqueue.
980 + * All tasks have quotas based on rr_interval. RT tasks all get rr_interval.
981 + * From nice 1 to 19 they are smaller than it only if they are at least one
982 + * tick still. Below nice 0 they get progressively larger.
983 + * ie nice -6..0 = rr_interval. nice -10 = 2.5 * rr_interval
984 + * nice -20 = 10 * rr_interval. nice 1-19 = rr_interval / 2.
985 + * Value returned is in microseconds.
986 */
987 -static void __activate_task(struct task_struct *p, struct rq *rq)
988 +static inline unsigned int rr_quota(struct task_struct *p)
989 {
990 - struct prio_array *target = rq->active;
991 -
992 - if (batch_task(p))
993 - target = rq->expired;
994 - enqueue_task(p, target);
995 - inc_nr_running(p, rq);
996 -}
997 + int nice = TASK_NICE(p), rr = rr_interval;
998
999 -/*
1000 - * __activate_idle_task - move idle task to the _front_ of runqueue.
1001 - */
1002 -static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
1003 -{
1004 - enqueue_task_head(p, rq->active);
1005 - inc_nr_running(p, rq);
1006 + if (!rt_task(p)) {
1007 + if (nice < -6) {
1008 + rr *= nice * nice;
1009 + rr /= 40;
1010 + } else if (nice > 0)
1011 + rr = rr / 2 ? : 1;
1012 + }
1013 + return MS_TO_US(rr);
1014 }
1015
1016 -/*
1017 - * Recalculate p->normal_prio and p->prio after having slept,
1018 - * updating the sleep-average too:
1019 - */
1020 -static int recalc_task_prio(struct task_struct *p, unsigned long long now)
1021 +/* Every time we set the quota we need to set the load weight */
1022 +static void set_quota(struct task_struct *p)
1023 {
1024 - /* Caller must always ensure 'now >= p->timestamp' */
1025 - unsigned long sleep_time = now - p->timestamp;
1026 -
1027 - if (batch_task(p))
1028 - sleep_time = 0;
1029 -
1030 - if (likely(sleep_time > 0)) {
1031 - /*
1032 - * This ceiling is set to the lowest priority that would allow
1033 - * a task to be reinserted into the active array on timeslice
1034 - * completion.
1035 - */
1036 - unsigned long ceiling = INTERACTIVE_SLEEP(p);
1037 -
1038 - if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
1039 - /*
1040 - * Prevents user tasks from achieving best priority
1041 - * with one single large enough sleep.
1042 - */
1043 - p->sleep_avg = ceiling;
1044 - /*
1045 - * Using INTERACTIVE_SLEEP() as a ceiling places a
1046 - * nice(0) task 1ms sleep away from promotion, and
1047 - * gives it 700ms to round-robin with no chance of
1048 - * being demoted. This is more than generous, so
1049 - * mark this sleep as non-interactive to prevent the
1050 - * on-runqueue bonus logic from intervening should
1051 - * this task not receive cpu immediately.
1052 - */
1053 - p->sleep_type = SLEEP_NONINTERACTIVE;
1054 - } else {
1055 - /*
1056 - * Tasks waking from uninterruptible sleep are
1057 - * limited in their sleep_avg rise as they
1058 - * are likely to be waiting on I/O
1059 - */
1060 - if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
1061 - if (p->sleep_avg >= ceiling)
1062 - sleep_time = 0;
1063 - else if (p->sleep_avg + sleep_time >=
1064 - ceiling) {
1065 - p->sleep_avg = ceiling;
1066 - sleep_time = 0;
1067 - }
1068 - }
1069 -
1070 - /*
1071 - * This code gives a bonus to interactive tasks.
1072 - *
1073 - * The boost works by updating the 'average sleep time'
1074 - * value here, based on ->timestamp. The more time a
1075 - * task spends sleeping, the higher the average gets -
1076 - * and the higher the priority boost gets as well.
1077 - */
1078 - p->sleep_avg += sleep_time;
1079 -
1080 - }
1081 - if (p->sleep_avg > NS_MAX_SLEEP_AVG)
1082 - p->sleep_avg = NS_MAX_SLEEP_AVG;
1083 - }
1084 -
1085 - return effective_prio(p);
1086 + p->quota = rr_quota(p);
1087 + set_load_weight(p);
1088 }
1089
1090 /*
1091 * activate_task - move a task to the runqueue and do priority recalculation
1092 - *
1093 - * Update all the scheduling statistics stuff. (sleep average
1094 - * calculation, priority modifiers, etc.)
1095 */
1096 static void activate_task(struct task_struct *p, struct rq *rq, int local)
1097 {
1098 - unsigned long long now;
1099 -
1100 - if (rt_task(p))
1101 - goto out;
1102 + unsigned long long now = sched_clock();
1103
1104 - now = sched_clock();
1105 #ifdef CONFIG_SMP
1106 if (!local) {
1107 /* Compensate for drifting sched_clock */
1108 @@ -977,32 +1035,9 @@ static void activate_task(struct task_st
1109 (now - p->timestamp) >> 20);
1110 }
1111
1112 - p->prio = recalc_task_prio(p, now);
1113 -
1114 - /*
1115 - * This checks to make sure it's not an uninterruptible task
1116 - * that is now waking up.
1117 - */
1118 - if (p->sleep_type == SLEEP_NORMAL) {
1119 - /*
1120 - * Tasks which were woken up by interrupts (ie. hw events)
1121 - * are most likely of interactive nature. So we give them
1122 - * the credit of extending their sleep time to the period
1123 - * of time they spend on the runqueue, waiting for execution
1124 - * on a CPU, first time around:
1125 - */
1126 - if (in_interrupt())
1127 - p->sleep_type = SLEEP_INTERRUPTED;
1128 - else {
1129 - /*
1130 - * Normal first-time wakeups get a credit too for
1131 - * on-runqueue time, but it will be weighted down:
1132 - */
1133 - p->sleep_type = SLEEP_INTERACTIVE;
1134 - }
1135 - }
1136 + set_quota(p);
1137 + p->prio = effective_prio(p);
1138 p->timestamp = now;
1139 -out:
1140 __activate_task(p, rq);
1141 }
1142
1143 @@ -1012,8 +1047,7 @@ out:
1144 static void deactivate_task(struct task_struct *p, struct rq *rq)
1145 {
1146 dec_nr_running(p, rq);
1147 - dequeue_task(p, p->array);
1148 - p->array = NULL;
1149 + dequeue_task(p, rq);
1150 }
1151
1152 /*
1153 @@ -1095,7 +1129,7 @@ migrate_task(struct task_struct *p, int
1154 * If the task is not on a runqueue (and not running), then
1155 * it is sufficient to simply update the task's cpu field.
1156 */
1157 - if (!p->array && !task_running(rq, p)) {
1158 + if (!task_queued(p) && !task_running(rq, p)) {
1159 set_task_cpu(p, dest_cpu);
1160 return 0;
1161 }
1162 @@ -1126,7 +1160,7 @@ void wait_task_inactive(struct task_stru
1163 repeat:
1164 rq = task_rq_lock(p, &flags);
1165 /* Must be off runqueue entirely, not preempted. */
1166 - if (unlikely(p->array || task_running(rq, p))) {
1167 + if (unlikely(task_queued(p) || task_running(rq, p))) {
1168 /* If it's preempted, we yield. It could be a while. */
1169 preempted = !task_running(rq, p);
1170 task_rq_unlock(rq, &flags);
1171 @@ -1391,6 +1425,31 @@ static inline int wake_idle(int cpu, str
1172 }
1173 #endif
1174
1175 +/*
1176 + * We need to have a special definition for an idle runqueue when testing
1177 + * for preemption on CONFIG_HOTPLUG_CPU as the idle task may be scheduled as
1178 + * a realtime task in sched_idle_next.
1179 + */
1180 +#ifdef CONFIG_HOTPLUG_CPU
1181 +#define rq_idle(rq) ((rq)->curr == (rq)->idle && !rt_task((rq)->curr))
1182 +#else
1183 +#define rq_idle(rq) ((rq)->curr == (rq)->idle)
1184 +#endif
1185 +
1186 +static inline int task_preempts_curr(struct task_struct *p, struct rq *rq)
1187 +{
1188 + struct task_struct *curr = rq->curr;
1189 +
1190 + return ((p->array == task_rq(p)->active &&
1191 + TASK_PREEMPTS_CURR(p, curr)) || rq_idle(rq));
1192 +}
1193 +
1194 +static inline void try_preempt(struct task_struct *p, struct rq *rq)
1195 +{
1196 + if (task_preempts_curr(p, rq))
1197 + resched_task(rq->curr);
1198 +}
1199 +
1200 /***
1201 * try_to_wake_up - wake up a thread
1202 * @p: the to-be-woken-up thread
1203 @@ -1422,7 +1481,7 @@ static int try_to_wake_up(struct task_st
1204 if (!(old_state & state))
1205 goto out;
1206
1207 - if (p->array)
1208 + if (task_queued(p))
1209 goto out_running;
1210
1211 cpu = task_cpu(p);
1212 @@ -1515,7 +1574,7 @@ out_set_cpu:
1213 old_state = p->state;
1214 if (!(old_state & state))
1215 goto out;
1216 - if (p->array)
1217 + if (task_queued(p))
1218 goto out_running;
1219
1220 this_cpu = smp_processor_id();
1221 @@ -1524,25 +1583,9 @@ out_set_cpu:
1222
1223 out_activate:
1224 #endif /* CONFIG_SMP */
1225 - if (old_state == TASK_UNINTERRUPTIBLE) {
1226 + if (old_state == TASK_UNINTERRUPTIBLE)
1227 rq->nr_uninterruptible--;
1228 - /*
1229 - * Tasks on involuntary sleep don't earn
1230 - * sleep_avg beyond just interactive state.
1231 - */
1232 - p->sleep_type = SLEEP_NONINTERACTIVE;
1233 - } else
1234 -
1235 - /*
1236 - * Tasks that have marked their sleep as noninteractive get
1237 - * woken up with their sleep average not weighted in an
1238 - * interactive way.
1239 - */
1240 - if (old_state & TASK_NONINTERACTIVE)
1241 - p->sleep_type = SLEEP_NONINTERACTIVE;
1242 -
1243
1244 - activate_task(p, rq, cpu == this_cpu);
1245 /*
1246 * Sync wakeups (i.e. those types of wakeups where the waker
1247 * has indicated that it will leave the CPU in short order)
1248 @@ -1551,10 +1594,9 @@ out_activate:
1249 * the waker guarantees that the freshly woken up task is going
1250 * to be considered on this CPU.)
1251 */
1252 - if (!sync || cpu != this_cpu) {
1253 - if (TASK_PREEMPTS_CURR(p, rq))
1254 - resched_task(rq->curr);
1255 - }
1256 + activate_task(p, rq, cpu == this_cpu);
1257 + if (!sync || cpu != this_cpu)
1258 + try_preempt(p, rq);
1259 success = 1;
1260
1261 out_running:
1262 @@ -1577,7 +1619,6 @@ int fastcall wake_up_state(struct task_s
1263 return try_to_wake_up(p, state, 0);
1264 }
1265
1266 -static void task_running_tick(struct rq *rq, struct task_struct *p);
1267 /*
1268 * Perform scheduler related setup for a newly forked process p.
1269 * p is forked by current.
1270 @@ -1605,7 +1646,6 @@ void fastcall sched_fork(struct task_str
1271 p->prio = current->normal_prio;
1272
1273 INIT_LIST_HEAD(&p->run_list);
1274 - p->array = NULL;
1275 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
1276 if (unlikely(sched_info_on()))
1277 memset(&p->sched_info, 0, sizeof(p->sched_info));
1278 @@ -1617,30 +1657,31 @@ void fastcall sched_fork(struct task_str
1279 /* Want to start with kernel preemption disabled. */
1280 task_thread_info(p)->preempt_count = 1;
1281 #endif
1282 + if (unlikely(p->policy == SCHED_FIFO))
1283 + goto out;
1284 /*
1285 * Share the timeslice between parent and child, thus the
1286 * total amount of pending timeslices in the system doesn't change,
1287 * resulting in more scheduling fairness.
1288 */
1289 local_irq_disable();
1290 - p->time_slice = (current->time_slice + 1) >> 1;
1291 - /*
1292 - * The remainder of the first timeslice might be recovered by
1293 - * the parent if the child exits early enough.
1294 - */
1295 - p->first_time_slice = 1;
1296 - current->time_slice >>= 1;
1297 - p->timestamp = sched_clock();
1298 - if (unlikely(!current->time_slice)) {
1299 + if (current->time_slice > 0) {
1300 + current->time_slice /= 2;
1301 + if (current->time_slice)
1302 + p->time_slice = current->time_slice;
1303 + else
1304 + p->time_slice = 1;
1305 /*
1306 - * This case is rare, it happens when the parent has only
1307 - * a single jiffy left from its timeslice. Taking the
1308 - * runqueue lock is not a problem.
1309 + * The remainder of the first timeslice might be recovered by
1310 + * the parent if the child exits early enough.
1311 */
1312 - current->time_slice = 1;
1313 - task_running_tick(cpu_rq(cpu), current);
1314 - }
1315 + p->first_time_slice = 1;
1316 + } else
1317 + p->time_slice = 0;
1318 +
1319 + p->timestamp = sched_clock();
1320 local_irq_enable();
1321 +out:
1322 put_cpu();
1323 }
1324
1325 @@ -1662,38 +1703,16 @@ void fastcall wake_up_new_task(struct ta
1326 this_cpu = smp_processor_id();
1327 cpu = task_cpu(p);
1328
1329 - /*
1330 - * We decrease the sleep average of forking parents
1331 - * and children as well, to keep max-interactive tasks
1332 - * from forking tasks that are max-interactive. The parent
1333 - * (current) is done further down, under its lock.
1334 - */
1335 - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
1336 - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
1337 -
1338 - p->prio = effective_prio(p);
1339 -
1340 if (likely(cpu == this_cpu)) {
1341 + activate_task(p, rq, 1);
1342 if (!(clone_flags & CLONE_VM)) {
1343 /*
1344 * The VM isn't cloned, so we're in a good position to
1345 * do child-runs-first in anticipation of an exec. This
1346 * usually avoids a lot of COW overhead.
1347 */
1348 - if (unlikely(!current->array))
1349 - __activate_task(p, rq);
1350 - else {
1351 - p->prio = current->prio;
1352 - p->normal_prio = current->normal_prio;
1353 - list_add_tail(&p->run_list, &current->run_list);
1354 - p->array = current->array;
1355 - p->array->nr_active++;
1356 - inc_nr_running(p, rq);
1357 - }
1358 set_need_resched();
1359 - } else
1360 - /* Run child last */
1361 - __activate_task(p, rq);
1362 + }
1363 /*
1364 * We skip the following code due to cpu == this_cpu
1365 *
1366 @@ -1710,19 +1729,16 @@ void fastcall wake_up_new_task(struct ta
1367 */
1368 p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
1369 + rq->most_recent_timestamp;
1370 - __activate_task(p, rq);
1371 - if (TASK_PREEMPTS_CURR(p, rq))
1372 - resched_task(rq->curr);
1373 + activate_task(p, rq, 0);
1374 + try_preempt(p, rq);
1375
1376 /*
1377 * Parent and child are on different CPUs, now get the
1378 - * parent runqueue to update the parent's ->sleep_avg:
1379 + * parent runqueue to update the parent's ->flags:
1380 */
1381 task_rq_unlock(rq, &flags);
1382 this_rq = task_rq_lock(current, &flags);
1383 }
1384 - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
1385 - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
1386 task_rq_unlock(this_rq, &flags);
1387 }
1388
1389 @@ -1737,23 +1753,17 @@ void fastcall wake_up_new_task(struct ta
1390 */
1391 void fastcall sched_exit(struct task_struct *p)
1392 {
1393 + struct task_struct *parent;
1394 unsigned long flags;
1395 struct rq *rq;
1396
1397 - /*
1398 - * If the child was a (relative-) CPU hog then decrease
1399 - * the sleep_avg of the parent as well.
1400 - */
1401 - rq = task_rq_lock(p->parent, &flags);
1402 - if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
1403 - p->parent->time_slice += p->time_slice;
1404 - if (unlikely(p->parent->time_slice > task_timeslice(p)))
1405 - p->parent->time_slice = task_timeslice(p);
1406 - }
1407 - if (p->sleep_avg < p->parent->sleep_avg)
1408 - p->parent->sleep_avg = p->parent->sleep_avg /
1409 - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
1410 - (EXIT_WEIGHT + 1);
1411 + parent = p->parent;
1412 + rq = task_rq_lock(parent, &flags);
1413 + if (p->first_time_slice > 0 && task_cpu(p) == task_cpu(parent)) {
1414 + parent->time_slice += p->time_slice;
1415 + if (unlikely(parent->time_slice > parent->quota))
1416 + parent->time_slice = parent->quota;
1417 + }
1418 task_rq_unlock(rq, &flags);
1419 }
1420
1421 @@ -2085,23 +2095,17 @@ void sched_exec(void)
1422 * pull_task - move a task from a remote runqueue to the local runqueue.
1423 * Both runqueues must be locked.
1424 */
1425 -static void pull_task(struct rq *src_rq, struct prio_array *src_array,
1426 - struct task_struct *p, struct rq *this_rq,
1427 - struct prio_array *this_array, int this_cpu)
1428 +static void pull_task(struct rq *src_rq, struct task_struct *p,
1429 + struct rq *this_rq, int this_cpu)
1430 {
1431 - dequeue_task(p, src_array);
1432 + dequeue_task(p, src_rq);
1433 dec_nr_running(p, src_rq);
1434 set_task_cpu(p, this_cpu);
1435 inc_nr_running(p, this_rq);
1436 - enqueue_task(p, this_array);
1437 + enqueue_task(p, this_rq);
1438 p->timestamp = (p->timestamp - src_rq->most_recent_timestamp)
1439 + this_rq->most_recent_timestamp;
1440 - /*
1441 - * Note that idle threads have a prio of MAX_PRIO, for this test
1442 - * to be always true for them.
1443 - */
1444 - if (TASK_PREEMPTS_CURR(p, this_rq))
1445 - resched_task(this_rq->curr);
1446 + try_preempt(p, this_rq);
1447 }
1448
1449 /*
1450 @@ -2144,7 +2148,16 @@ int can_migrate_task(struct task_struct
1451 return 1;
1452 }
1453
1454 -#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
1455 +static inline int rq_best_prio(struct rq *rq)
1456 +{
1457 + int best_prio, exp_prio;
1458 +
1459 + best_prio = sched_find_first_bit(rq->dyn_bitmap);
1460 + exp_prio = find_next_bit(rq->exp_bitmap, MAX_PRIO, MAX_RT_PRIO);
1461 + if (unlikely(best_prio > exp_prio))
1462 + best_prio = exp_prio;
1463 + return best_prio;
1464 +}
1465
1466 /*
1467 * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
1468 @@ -2160,7 +2173,7 @@ static int move_tasks(struct rq *this_rq
1469 {
1470 int idx, pulled = 0, pinned = 0, this_best_prio, best_prio,
1471 best_prio_seen, skip_for_load;
1472 - struct prio_array *array, *dst_array;
1473 + struct prio_array *array;
1474 struct list_head *head, *curr;
1475 struct task_struct *tmp;
1476 long rem_load_move;
1477 @@ -2187,26 +2200,21 @@ static int move_tasks(struct rq *this_rq
1478 * be cache-cold, thus switching CPUs has the least effect
1479 * on them.
1480 */
1481 - if (busiest->expired->nr_active) {
1482 - array = busiest->expired;
1483 - dst_array = this_rq->expired;
1484 - } else {
1485 - array = busiest->active;
1486 - dst_array = this_rq->active;
1487 - }
1488 -
1489 + array = busiest->expired;
1490 new_array:
1491 - /* Start searching at priority 0: */
1492 - idx = 0;
1493 + /* Expired arrays don't have RT tasks so they're always MAX_RT_PRIO+ */
1494 + if (array == busiest->expired)
1495 + idx = MAX_RT_PRIO;
1496 + else
1497 + idx = 0;
1498 skip_bitmap:
1499 if (!idx)
1500 - idx = sched_find_first_bit(array->bitmap);
1501 + idx = sched_find_first_bit(array->prio_bitmap);
1502 else
1503 - idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
1504 + idx = find_next_bit(array->prio_bitmap, MAX_PRIO, idx);
1505 if (idx >= MAX_PRIO) {
1506 - if (array == busiest->expired && busiest->active->nr_active) {
1507 + if (array == busiest->expired) {
1508 array = busiest->active;
1509 - dst_array = this_rq->active;
1510 goto new_array;
1511 }
1512 goto out;
1513 @@ -2237,7 +2245,7 @@ skip_queue:
1514 goto skip_bitmap;
1515 }
1516
1517 - pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
1518 + pull_task(busiest, tmp, this_rq, this_cpu);
1519 pulled++;
1520 rem_load_move -= tmp->load_weight;
1521
1522 @@ -3013,11 +3021,36 @@ EXPORT_PER_CPU_SYMBOL(kstat);
1523 /*
1524 * This is called on clock ticks and on context switches.
1525 * Bank in p->sched_time the ns elapsed since the last tick or switch.
1526 + * CPU scheduler quota accounting is also performed here in microseconds.
1527 + * The value returned from sched_clock() occasionally gives bogus values so
1528 + * some sanity checking is required.
1529 */
1530 -static inline void
1531 -update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
1532 +static void
1533 +update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now,
1534 + int tick)
1535 {
1536 - p->sched_time += now - p->last_ran;
1537 + long time_diff = now - p->last_ran;
1538 +
1539 + if (tick) {
1540 + /*
1541 + * Called from scheduler_tick() there should be less than two
1542 + * jiffies worth, and not negative/overflow.
1543 + */
1544 + if (time_diff > JIFFIES_TO_NS(2) || time_diff < 0)
1545 + time_diff = JIFFIES_TO_NS(1);
1546 + } else {
1547 + /*
1548 + * Called from context_switch there should be less than one
1549 + * jiffy worth, and not negative/overflow. There should be
1550 + * some time banked here so use a nominal 1us.
1551 + */
1552 + if (time_diff > JIFFIES_TO_NS(1) || time_diff < 1)
1553 + time_diff = 1000;
1554 + }
1555 + /* time_slice accounting is done in usecs to avoid overflow on 32bit */
1556 + if (p != rq->idle && p->policy != SCHED_FIFO)
1557 + p->time_slice -= time_diff / 1000;
1558 + p->sched_time += time_diff;
1559 p->last_ran = rq->most_recent_timestamp = now;
1560 }
1561
1562 @@ -3038,27 +3071,6 @@ unsigned long long current_sched_time(co
1563 }
1564
1565 /*
1566 - * We place interactive tasks back into the active array, if possible.
1567 - *
1568 - * To guarantee that this does not starve expired tasks we ignore the
1569 - * interactivity of a task if the first expired task had to wait more
1570 - * than a 'reasonable' amount of time. This deadline timeout is
1571 - * load-dependent, as the frequency of array switched decreases with
1572 - * increasing number of running tasks. We also ignore the interactivity
1573 - * if a better static_prio task has expired:
1574 - */
1575 -static inline int expired_starving(struct rq *rq)
1576 -{
1577 - if (rq->curr->static_prio > rq->best_expired_prio)
1578 - return 1;
1579 - if (!STARVATION_LIMIT || !rq->expired_timestamp)
1580 - return 0;
1581 - if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
1582 - return 1;
1583 - return 0;
1584 -}
1585 -
1586 -/*
1587 * Account user cpu time to a process.
1588 * @p: the process that the cpu time gets accounted to
1589 * @hardirq_offset: the offset to subtract from hardirq_count()
1590 @@ -3131,87 +3143,47 @@ void account_steal_time(struct task_stru
1591 cpustat->steal = cputime64_add(cpustat->steal, tmp);
1592 }
1593
1594 -static void task_running_tick(struct rq *rq, struct task_struct *p)
1595 +/*
1596 + * The task has used up its quota of running in this prio_level so it must be
1597 + * dropped a priority level, all managed by recalc_task_prio().
1598 + */
1599 +static void task_expired_entitlement(struct rq *rq, struct task_struct *p)
1600 {
1601 - if (p->array != rq->active) {
1602 - /* Task has expired but was not scheduled yet */
1603 - set_tsk_need_resched(p);
1604 + int overrun;
1605 +
1606 + reset_first_time_slice(p);
1607 + if (rt_task(p)) {
1608 + p->time_slice += p->quota;
1609 + list_move_tail(&p->run_list, p->array->queue + p->prio);
1610 return;
1611 }
1612 - spin_lock(&rq->lock);
1613 + overrun = p->time_slice;
1614 + dequeue_task(p, rq);
1615 + enqueue_task(p, rq);
1616 /*
1617 - * The task was running during this tick - update the
1618 - * time slice counter. Note: we do not update a thread's
1619 - * priority until it either goes to sleep or uses up its
1620 - * timeslice. This makes it possible for interactive tasks
1621 - * to use up their timeslices at their highest priority levels.
1622 + * Subtract any extra time this task ran over its time_slice; ie
1623 + * overrun will either be 0 or negative.
1624 */
1625 - if (rt_task(p)) {
1626 - /*
1627 - * RR tasks need a special form of timeslice management.
1628 - * FIFO tasks have no timeslices.
1629 - */
1630 - if ((p->policy == SCHED_RR) && !--p->time_slice) {
1631 - p->time_slice = task_timeslice(p);
1632 - p->first_time_slice = 0;
1633 - set_tsk_need_resched(p);
1634 -
1635 - /* put it at the end of the queue: */
1636 - requeue_task(p, rq->active);
1637 - }
1638 - goto out_unlock;
1639 - }
1640 - if (!--p->time_slice) {
1641 - dequeue_task(p, rq->active);
1642 - set_tsk_need_resched(p);
1643 - p->prio = effective_prio(p);
1644 - p->time_slice = task_timeslice(p);
1645 - p->first_time_slice = 0;
1646 -
1647 - if (!rq->expired_timestamp)
1648 - rq->expired_timestamp = jiffies;
1649 - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
1650 - enqueue_task(p, rq->expired);
1651 - if (p->static_prio < rq->best_expired_prio)
1652 - rq->best_expired_prio = p->static_prio;
1653 - } else
1654 - enqueue_task(p, rq->active);
1655 - } else {
1656 - /*
1657 - * Prevent a too long timeslice allowing a task to monopolize
1658 - * the CPU. We do this by splitting up the timeslice into
1659 - * smaller pieces.
1660 - *
1661 - * Note: this does not mean the task's timeslices expire or
1662 - * get lost in any way, they just might be preempted by
1663 - * another task of equal priority. (one with higher
1664 - * priority would have preempted this task already.) We
1665 - * requeue this task to the end of the list on this priority
1666 - * level, which is in essence a round-robin of tasks with
1667 - * equal priority.
1668 - *
1669 - * This only applies to tasks in the interactive
1670 - * delta range with at least TIMESLICE_GRANULARITY to requeue.
1671 - */
1672 - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
1673 - p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
1674 - (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
1675 - (p->array == rq->active)) {
1676 + p->time_slice += overrun;
1677 +}
1678
1679 - requeue_task(p, rq->active);
1680 - set_tsk_need_resched(p);
1681 - }
1682 - }
1683 -out_unlock:
1684 +/* This manages tasks that have run out of timeslice during a scheduler_tick */
1685 +static void task_running_tick(struct rq *rq, struct task_struct *p)
1686 +{
1687 + /* SCHED_FIFO tasks never run out of timeslice. */
1688 + if (p->time_slice > 0 || p->policy == SCHED_FIFO)
1689 + return;
1690 + /* p->time_slice <= 0 */
1691 + spin_lock(&rq->lock);
1692 + if (likely(task_queued(p)))
1693 + task_expired_entitlement(rq, p);
1694 + set_tsk_need_resched(p);
1695 spin_unlock(&rq->lock);
1696 }
1697
1698 /*
1699 * This function gets called by the timer code, with HZ frequency.
1700 * We call it with interrupts disabled.
1701 - *
1702 - * It also gets called by the fork code, when changing the parent's
1703 - * timeslices.
1704 */
1705 void scheduler_tick(void)
1706 {
1707 @@ -3220,7 +3192,7 @@ void scheduler_tick(void)
1708 int cpu = smp_processor_id();
1709 struct rq *rq = cpu_rq(cpu);
1710
1711 - update_cpu_clock(p, rq, now);
1712 + update_cpu_clock(p, rq, now, 1);
1713
1714 if (p != rq->idle)
1715 task_running_tick(rq, p);
1716 @@ -3269,10 +3241,55 @@ EXPORT_SYMBOL(sub_preempt_count);
1717
1718 #endif
1719
1720 -static inline int interactive_sleep(enum sleep_type sleep_type)
1721 +static void reset_prio_levels(struct rq *rq)
1722 {
1723 - return (sleep_type == SLEEP_INTERACTIVE ||
1724 - sleep_type == SLEEP_INTERRUPTED);
1725 + rq->active->best_static_prio = MAX_PRIO - 1;
1726 + rq->expired->best_static_prio = MAX_PRIO - 1;
1727 + memset(rq->prio_level, 0, sizeof(int) * PRIO_RANGE);
1728 +}
1729 +
1730 +/*
1731 + * next_dynamic_task finds the next suitable dynamic task.
1732 + */
1733 +static inline struct task_struct *next_dynamic_task(struct rq *rq, int idx)
1734 +{
1735 + struct prio_array *array = rq->active;
1736 + struct task_struct *next;
1737 + struct list_head *queue;
1738 + int nstatic;
1739 +
1740 +retry:
1741 + if (idx >= MAX_PRIO) {
1742 + /* There are no more tasks in the active array. Swap arrays */
1743 + array = rq->expired;
1744 + rq->expired = rq->active;
1745 + rq->active = array;
1746 + rq->exp_bitmap = rq->expired->prio_bitmap;
1747 + rq->dyn_bitmap = rq->active->prio_bitmap;
1748 + rq->prio_rotation++;
1749 + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO);
1750 + reset_prio_levels(rq);
1751 + }
1752 + queue = array->queue + idx;
1753 + next = list_entry(queue->next, struct task_struct, run_list);
1754 + if (unlikely(next->time_slice <= 0)) {
1755 + /*
1756 + * Unlucky enough that this task ran out of time_slice
1757 + * before it hit a scheduler_tick so it should have its
1758 + * priority reassessed and choose another task (possibly
1759 + * the same one)
1760 + */
1761 + task_expired_entitlement(rq, next);
1762 + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO);
1763 + goto retry;
1764 + }
1765 + next->rotation = rq->prio_rotation;
1766 + nstatic = next->static_prio;
1767 + if (nstatic < array->best_static_prio)
1768 + array->best_static_prio = nstatic;
1769 + if (idx > rq->prio_level[USER_PRIO(nstatic)])
1770 + rq->prio_level[USER_PRIO(nstatic)] = idx;
1771 + return next;
1772 }
1773
1774 /*
1775 @@ -3281,13 +3298,11 @@ static inline int interactive_sleep(enum
1776 asmlinkage void __sched schedule(void)
1777 {
1778 struct task_struct *prev, *next;
1779 - struct prio_array *array;
1780 struct list_head *queue;
1781 unsigned long long now;
1782 - unsigned long run_time;
1783 - int cpu, idx, new_prio;
1784 long *switch_count;
1785 struct rq *rq;
1786 + int cpu, idx;
1787
1788 /*
1789 * Test if we are atomic. Since do_exit() needs to call into
1790 @@ -3323,18 +3338,6 @@ need_resched_nonpreemptible:
1791
1792 schedstat_inc(rq, sched_cnt);
1793 now = sched_clock();
1794 - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
1795 - run_time = now - prev->timestamp;
1796 - if (unlikely((long long)(now - prev->timestamp) < 0))
1797 - run_time = 0;
1798 - } else
1799 - run_time = NS_MAX_SLEEP_AVG;
1800 -
1801 - /*
1802 - * Tasks charged proportionately less run_time at high sleep_avg to
1803 - * delay them losing their interactive status
1804 - */
1805 - run_time /= (CURRENT_BONUS(prev) ? : 1);
1806
1807 spin_lock_irq(&rq->lock);
1808
1809 @@ -3356,59 +3359,29 @@ need_resched_nonpreemptible:
1810 idle_balance(cpu, rq);
1811 if (!rq->nr_running) {
1812 next = rq->idle;
1813 - rq->expired_timestamp = 0;
1814 goto switch_tasks;
1815 }
1816 }
1817
1818 - array = rq->active;
1819 - if (unlikely(!array->nr_active)) {
1820 - /*
1821 - * Switch the active and expired arrays.
1822 - */
1823 - schedstat_inc(rq, sched_switch);
1824 - rq->active = rq->expired;
1825 - rq->expired = array;
1826 - array = rq->active;
1827 - rq->expired_timestamp = 0;
1828 - rq->best_expired_prio = MAX_PRIO;
1829 - }
1830 -
1831 - idx = sched_find_first_bit(array->bitmap);
1832 - queue = array->queue + idx;
1833 - next = list_entry(queue->next, struct task_struct, run_list);
1834 -
1835 - if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
1836 - unsigned long long delta = now - next->timestamp;
1837 - if (unlikely((long long)(now - next->timestamp) < 0))
1838 - delta = 0;
1839 -
1840 - if (next->sleep_type == SLEEP_INTERACTIVE)
1841 - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
1842 -
1843 - array = next->array;
1844 - new_prio = recalc_task_prio(next, next->timestamp + delta);
1845 -
1846 - if (unlikely(next->prio != new_prio)) {
1847 - dequeue_task(next, array);
1848 - next->prio = new_prio;
1849 - enqueue_task(next, array);
1850 - }
1851 + idx = sched_find_first_bit(rq->dyn_bitmap);
1852 + if (!rt_prio(idx))
1853 + next = next_dynamic_task(rq, idx);
1854 + else {
1855 + queue = rq->active->queue + idx;
1856 + next = list_entry(queue->next, struct task_struct, run_list);
1857 }
1858 - next->sleep_type = SLEEP_NORMAL;
1859 switch_tasks:
1860 - if (next == rq->idle)
1861 + if (next == rq->idle) {
1862 + reset_prio_levels(rq);
1863 + rq->prio_rotation++;
1864 schedstat_inc(rq, sched_goidle);
1865 + }
1866 prefetch(next);
1867 prefetch_stack(next);
1868 clear_tsk_need_resched(prev);
1869 rcu_qsctr_inc(task_cpu(prev));
1870
1871 - update_cpu_clock(prev, rq, now);
1872 -
1873 - prev->sleep_avg -= run_time;
1874 - if ((long)prev->sleep_avg <= 0)
1875 - prev->sleep_avg = 0;
1876 + update_cpu_clock(prev, rq, now, 0);
1877 prev->timestamp = prev->last_ran = now;
1878
1879 sched_info_switch(prev, next);
1880 @@ -3844,29 +3817,22 @@ EXPORT_SYMBOL(sleep_on_timeout);
1881 */
1882 void rt_mutex_setprio(struct task_struct *p, int prio)
1883 {
1884 - struct prio_array *array;
1885 unsigned long flags;
1886 + int queued, oldprio;
1887 struct rq *rq;
1888 - int oldprio;
1889
1890 BUG_ON(prio < 0 || prio > MAX_PRIO);
1891
1892 rq = task_rq_lock(p, &flags);
1893
1894 oldprio = p->prio;
1895 - array = p->array;
1896 - if (array)
1897 - dequeue_task(p, array);
1898 + queued = task_queued(p);
1899 + if (queued)
1900 + dequeue_task(p, rq);
1901 p->prio = prio;
1902
1903 - if (array) {
1904 - /*
1905 - * If changing to an RT priority then queue it
1906 - * in the active array!
1907 - */
1908 - if (rt_task(p))
1909 - array = rq->active;
1910 - enqueue_task(p, array);
1911 + if (queued) {
1912 + enqueue_task(p, rq);
1913 /*
1914 * Reschedule if we are currently running on this runqueue and
1915 * our priority decreased, or if we are not currently running on
1916 @@ -3875,8 +3841,8 @@ void rt_mutex_setprio(struct task_struct
1917 if (task_running(rq, p)) {
1918 if (p->prio > oldprio)
1919 resched_task(rq->curr);
1920 - } else if (TASK_PREEMPTS_CURR(p, rq))
1921 - resched_task(rq->curr);
1922 + } else
1923 + try_preempt(p, rq);
1924 }
1925 task_rq_unlock(rq, &flags);
1926 }
1927 @@ -3885,8 +3851,7 @@ void rt_mutex_setprio(struct task_struct
1928
1929 void set_user_nice(struct task_struct *p, long nice)
1930 {
1931 - struct prio_array *array;
1932 - int old_prio, delta;
1933 + int queued, old_prio,delta;
1934 unsigned long flags;
1935 struct rq *rq;
1936
1937 @@ -3907,20 +3872,20 @@ void set_user_nice(struct task_struct *p
1938 p->static_prio = NICE_TO_PRIO(nice);
1939 goto out_unlock;
1940 }
1941 - array = p->array;
1942 - if (array) {
1943 - dequeue_task(p, array);
1944 + queued = task_queued(p);
1945 + if (queued) {
1946 + dequeue_task(p, rq);
1947 dec_raw_weighted_load(rq, p);
1948 }
1949
1950 p->static_prio = NICE_TO_PRIO(nice);
1951 - set_load_weight(p);
1952 old_prio = p->prio;
1953 p->prio = effective_prio(p);
1954 + set_quota(p);
1955 delta = p->prio - old_prio;
1956
1957 - if (array) {
1958 - enqueue_task(p, array);
1959 + if (queued) {
1960 + enqueue_task(p, rq);
1961 inc_raw_weighted_load(rq, p);
1962 /*
1963 * If the task increased its priority or is running and
1964 @@ -3996,7 +3961,7 @@ asmlinkage long sys_nice(int increment)
1965 *
1966 * This is the priority value as seen by users in /proc.
1967 * RT tasks are offset by -200. Normal tasks are centered
1968 - * around 0, value goes from -16 to +15.
1969 + * around 0, value goes from 0 to +39.
1970 */
1971 int task_prio(const struct task_struct *p)
1972 {
1973 @@ -4043,19 +4008,14 @@ static inline struct task_struct *find_p
1974 /* Actually do priority change: must hold rq lock. */
1975 static void __setscheduler(struct task_struct *p, int policy, int prio)
1976 {
1977 - BUG_ON(p->array);
1978 + BUG_ON(task_queued(p));
1979
1980 p->policy = policy;
1981 p->rt_priority = prio;
1982 p->normal_prio = normal_prio(p);
1983 /* we are holding p->pi_lock already */
1984 p->prio = rt_mutex_getprio(p);
1985 - /*
1986 - * SCHED_BATCH tasks are treated as perpetual CPU hogs:
1987 - */
1988 - if (policy == SCHED_BATCH)
1989 - p->sleep_avg = 0;
1990 - set_load_weight(p);
1991 + set_quota(p);
1992 }
1993
1994 /**
1995 @@ -4069,8 +4029,7 @@ static void __setscheduler(struct task_s
1996 int sched_setscheduler(struct task_struct *p, int policy,
1997 struct sched_param *param)
1998 {
1999 - int retval, oldprio, oldpolicy = -1;
2000 - struct prio_array *array;
2001 + int queued, retval, oldprio, oldpolicy = -1;
2002 unsigned long flags;
2003 struct rq *rq;
2004
2005 @@ -4144,12 +4103,12 @@ recheck:
2006 spin_unlock_irqrestore(&p->pi_lock, flags);
2007 goto recheck;
2008 }
2009 - array = p->array;
2010 - if (array)
2011 + queued = task_queued(p);
2012 + if (queued)
2013 deactivate_task(p, rq);
2014 oldprio = p->prio;
2015 __setscheduler(p, policy, param->sched_priority);
2016 - if (array) {
2017 + if (queued) {
2018 __activate_task(p, rq);
2019 /*
2020 * Reschedule if we are currently running on this runqueue and
2021 @@ -4159,8 +4118,8 @@ recheck:
2022 if (task_running(rq, p)) {
2023 if (p->prio > oldprio)
2024 resched_task(rq->curr);
2025 - } else if (TASK_PREEMPTS_CURR(p, rq))
2026 - resched_task(rq->curr);
2027 + } else
2028 + try_preempt(p, rq);
2029 }
2030 __task_rq_unlock(rq);
2031 spin_unlock_irqrestore(&p->pi_lock, flags);
2032 @@ -4433,40 +4392,27 @@ asmlinkage long sys_sched_getaffinity(pi
2033 * sys_sched_yield - yield the current processor to other threads.
2034 *
2035 * This function yields the current CPU by moving the calling thread
2036 - * to the expired array. If there are no other threads running on this
2037 - * CPU then this function will return.
2038 + * to the expired array if SCHED_NORMAL or the end of its current priority
2039 + * queue if a realtime task. If there are no other threads running on this
2040 + * cpu this function will return.
2041 */
2042 asmlinkage long sys_sched_yield(void)
2043 {
2044 struct rq *rq = this_rq_lock();
2045 - struct prio_array *array = current->array, *target = rq->expired;
2046 + struct task_struct *p = current;
2047
2048 schedstat_inc(rq, yld_cnt);
2049 - /*
2050 - * We implement yielding by moving the task into the expired
2051 - * queue.
2052 - *
2053 - * (special rule: RT tasks will just roundrobin in the active
2054 - * array.)
2055 - */
2056 - if (rt_task(current))
2057 - target = rq->active;
2058 + if (rq->nr_running == 1)
2059 + schedstat_inc(rq, yld_both_empty);
2060 + else {
2061 + struct prio_array *old_array = p->array;
2062 + int old_prio = p->prio;
2063
2064 - if (array->nr_active == 1) {
2065 - schedstat_inc(rq, yld_act_empty);
2066 - if (!rq->expired->nr_active)
2067 - schedstat_inc(rq, yld_both_empty);
2068 - } else if (!rq->expired->nr_active)
2069 - schedstat_inc(rq, yld_exp_empty);
2070 -
2071 - if (array != target) {
2072 - dequeue_task(current, array);
2073 - enqueue_task(current, target);
2074 - } else
2075 - /*
2076 - * requeue_task is cheaper so perform that if possible.
2077 - */
2078 - requeue_task(current, array);
2079 + /* p->prio will be updated in requeue_task via queue_expired */
2080 + if (!rt_task(p))
2081 + p->array = rq->expired;
2082 + requeue_task(p, rq, old_array, old_prio);
2083 + }
2084
2085 /*
2086 * Since we are going to call schedule() anyway, there's
2087 @@ -4676,8 +4622,8 @@ long sys_sched_rr_get_interval(pid_t pid
2088 if (retval)
2089 goto out_unlock;
2090
2091 - jiffies_to_timespec(p->policy == SCHED_FIFO ?
2092 - 0 : task_timeslice(p), &t);
2093 + t = ns_to_timespec(p->policy == SCHED_FIFO ? 0 :
2094 + MS_TO_NS(task_timeslice(p)));
2095 read_unlock(&tasklist_lock);
2096 retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
2097 out_nounlock:
2098 @@ -4771,10 +4717,10 @@ void __cpuinit init_idle(struct task_str
2099 struct rq *rq = cpu_rq(cpu);
2100 unsigned long flags;
2101
2102 - idle->timestamp = sched_clock();
2103 - idle->sleep_avg = 0;
2104 - idle->array = NULL;
2105 - idle->prio = idle->normal_prio = MAX_PRIO;
2106 + bitmap_zero(idle->bitmap, PRIO_RANGE);
2107 + idle->timestamp = idle->last_ran = sched_clock();
2108 + idle->array = rq->active;
2109 + idle->prio = idle->normal_prio = NICE_TO_PRIO(0);
2110 idle->state = TASK_RUNNING;
2111 idle->cpus_allowed = cpumask_of_cpu(cpu);
2112 set_task_cpu(idle, cpu);
2113 @@ -4893,7 +4839,7 @@ static int __migrate_task(struct task_st
2114 goto out;
2115
2116 set_task_cpu(p, dest_cpu);
2117 - if (p->array) {
2118 + if (task_queued(p)) {
2119 /*
2120 * Sync timestamp with rq_dest's before activating.
2121 * The same thing could be achieved by doing this step
2122 @@ -4904,8 +4850,7 @@ static int __migrate_task(struct task_st
2123 + rq_dest->most_recent_timestamp;
2124 deactivate_task(p, rq_src);
2125 __activate_task(p, rq_dest);
2126 - if (TASK_PREEMPTS_CURR(p, rq_dest))
2127 - resched_task(rq_dest->curr);
2128 + try_preempt(p, rq_dest);
2129 }
2130 ret = 1;
2131 out:
2132 @@ -5194,7 +5139,7 @@ migration_call(struct notifier_block *nf
2133 /* Idle task back to normal (off runqueue, low prio) */
2134 rq = task_rq_lock(rq->idle, &flags);
2135 deactivate_task(rq->idle, rq);
2136 - rq->idle->static_prio = MAX_PRIO;
2137 + rq->idle->static_prio = NICE_TO_PRIO(0);
2138 __setscheduler(rq->idle, SCHED_NORMAL, 0);
2139 migrate_dead_tasks(cpu);
2140 task_rq_unlock(rq, &flags);
2141 @@ -6706,6 +6651,13 @@ void __init sched_init_smp(void)
2142 /* Move init over to a non-isolated CPU */
2143 if (set_cpus_allowed(current, non_isolated_cpus) < 0)
2144 BUG();
2145 +
2146 + /*
2147 + * Assume that every added cpu gives us slightly less overall latency
2148 + * allowing us to increase the base rr_interval, but in a non linear
2149 + * fashion.
2150 + */
2151 + rr_interval *= 1 + ilog2(num_online_cpus());
2152 }
2153 #else
2154 void __init sched_init_smp(void)
2155 @@ -6727,6 +6679,16 @@ void __init sched_init(void)
2156 {
2157 int i, j, k;
2158
2159 + /* Generate the priority matrix */
2160 + for (i = 0; i < PRIO_RANGE; i++) {
2161 + bitmap_fill(prio_matrix[i], PRIO_RANGE);
2162 + j = PRIO_RANGE * PRIO_RANGE / (PRIO_RANGE - i);
2163 + for (k = 0; k <= PRIO_RANGE * (PRIO_RANGE - 1); k += j) {
2164 + __clear_bit(PRIO_RANGE - 1 - (k / PRIO_RANGE),
2165 + prio_matrix[i]);
2166 + }
2167 + }
2168 +
2169 for_each_possible_cpu(i) {
2170 struct prio_array *array;
2171 struct rq *rq;
2172 @@ -6735,11 +6697,16 @@ void __init sched_init(void)
2173 spin_lock_init(&rq->lock);
2174 lockdep_set_class(&rq->lock, &rq->rq_lock_key);
2175 rq->nr_running = 0;
2176 + rq->prio_rotation = 0;
2177 rq->active = rq->arrays;
2178 rq->expired = rq->arrays + 1;
2179 - rq->best_expired_prio = MAX_PRIO;
2180 + reset_prio_levels(rq);
2181 + rq->dyn_bitmap = rq->active->prio_bitmap;
2182 + rq->exp_bitmap = rq->expired->prio_bitmap;
2183
2184 #ifdef CONFIG_SMP
2185 + rq->active->rq = rq;
2186 + rq->expired->rq = rq;
2187 rq->sd = NULL;
2188 for (j = 1; j < 3; j++)
2189 rq->cpu_load[j] = 0;
2190 @@ -6752,16 +6719,16 @@ void __init sched_init(void)
2191 atomic_set(&rq->nr_iowait, 0);
2192
2193 for (j = 0; j < 2; j++) {
2194 +
2195 array = rq->arrays + j;
2196 - for (k = 0; k < MAX_PRIO; k++) {
2197 + for (k = 0; k < MAX_PRIO; k++)
2198 INIT_LIST_HEAD(array->queue + k);
2199 - __clear_bit(k, array->bitmap);
2200 - }
2201 - // delimiter for bitsearch
2202 - __set_bit(MAX_PRIO, array->bitmap);
2203 + bitmap_zero(array->prio_bitmap, MAX_PRIO);
2204 + /* delimiter for bitsearch */
2205 + __set_bit(MAX_PRIO, array->prio_bitmap);
2206 }
2207 - }
2208
2209 + }
2210 set_load_weight(&init_task);
2211
2212 #ifdef CONFIG_SMP
2213 @@ -6815,10 +6782,10 @@ EXPORT_SYMBOL(__might_sleep);
2214 #ifdef CONFIG_MAGIC_SYSRQ
2215 void normalize_rt_tasks(void)
2216 {
2217 - struct prio_array *array;
2218 struct task_struct *p;
2219 unsigned long flags;
2220 struct rq *rq;
2221 + int queued;
2222
2223 read_lock_irq(&tasklist_lock);
2224 for_each_process(p) {
2225 @@ -6828,11 +6795,11 @@ void normalize_rt_tasks(void)
2226 spin_lock_irqsave(&p->pi_lock, flags);
2227 rq = __task_rq_lock(p);
2228
2229 - array = p->array;
2230 - if (array)
2231 + queued = task_queued(p);
2232 + if (queued)
2233 deactivate_task(p, task_rq(p));
2234 __setscheduler(p, SCHED_NORMAL, 0);
2235 - if (array) {
2236 + if (queued) {
2237 __activate_task(p, task_rq(p));
2238 resched_task(rq->curr);
2239 }
2240 Index: linux-2.6.21-ck2/Documentation/sysctl/kernel.txt
2241 ===================================================================
2242 --- linux-2.6.21-ck2.orig/Documentation/sysctl/kernel.txt 2007-02-05 22:51:59.000000000 +1100
2243 +++ linux-2.6.21-ck2/Documentation/sysctl/kernel.txt 2007-05-14 19:30:30.000000000 +1000
2244 @@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
2245 - printk
2246 - real-root-dev ==> Documentation/initrd.txt
2247 - reboot-cmd [ SPARC only ]
2248 +- rr_interval
2249 - rtsig-max
2250 - rtsig-nr
2251 - sem
2252 @@ -288,6 +289,19 @@ rebooting. ???
2253
2254 ==============================================================
2255
2256 +rr_interval:
2257 +
2258 +This is the smallest duration that any cpu process scheduling unit
2259 +will run for. Increasing this value can increase throughput of cpu
2260 +bound tasks substantially but at the expense of increased latencies
2261 +overall. This value is in milliseconds and the default value chosen
2262 +depends on the number of cpus available at scheduler initialisation
2263 +with a minimum of 8.
2264 +
2265 +Valid values are from 1-5000.
2266 +
2267 +==============================================================
2268 +
2269 rtsig-max & rtsig-nr:
2270
2271 The file rtsig-max can be used to tune the maximum number
2272 Index: linux-2.6.21-ck2/kernel/sysctl.c
2273 ===================================================================
2274 --- linux-2.6.21-ck2.orig/kernel/sysctl.c 2007-05-03 22:20:57.000000000 +1000
2275 +++ linux-2.6.21-ck2/kernel/sysctl.c 2007-05-14 19:30:30.000000000 +1000
2276 @@ -76,6 +76,7 @@ extern int pid_max_min, pid_max_max;
2277 extern int sysctl_drop_caches;
2278 extern int percpu_pagelist_fraction;
2279 extern int compat_log;
2280 +extern int rr_interval;
2281
2282 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
2283 static int maxolduid = 65535;
2284 @@ -159,6 +160,14 @@ int sysctl_legacy_va_layout;
2285 #endif
2286
2287
2288 +/* Constants for minimum and maximum testing.
2289 + We use these as one-element integer vectors. */
2290 +static int __read_mostly zero;
2291 +static int __read_mostly one = 1;
2292 +static int __read_mostly one_hundred = 100;
2293 +static int __read_mostly five_thousand = 5000;
2294 +
2295 +
2296 /* The default sysctl tables: */
2297
2298 static ctl_table root_table[] = {
2299 @@ -499,6 +508,17 @@ static ctl_table kern_table[] = {
2300 .mode = 0444,
2301 .proc_handler = &proc_dointvec,
2302 },
2303 + {
2304 + .ctl_name = CTL_UNNUMBERED,
2305 + .procname = "rr_interval",
2306 + .data = &rr_interval,
2307 + .maxlen = sizeof (int),
2308 + .mode = 0644,
2309 + .proc_handler = &proc_dointvec_minmax,
2310 + .strategy = &sysctl_intvec,
2311 + .extra1 = &one,
2312 + .extra2 = &five_thousand,
2313 + },
2314 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
2315 {
2316 .ctl_name = KERN_UNKNOWN_NMI_PANIC,
2317 @@ -607,12 +627,6 @@ static ctl_table kern_table[] = {
2318 { .ctl_name = 0 }
2319 };
2320
2321 -/* Constants for minimum and maximum testing in vm_table.
2322 - We use these as one-element integer vectors. */
2323 -static int zero;
2324 -static int one_hundred = 100;
2325 -
2326 -
2327 static ctl_table vm_table[] = {
2328 {
2329 .ctl_name = VM_OVERCOMMIT_MEMORY,
2330 Index: linux-2.6.21-ck2/fs/pipe.c
2331 ===================================================================
2332 --- linux-2.6.21-ck2.orig/fs/pipe.c 2007-05-03 22:20:56.000000000 +1000
2333 +++ linux-2.6.21-ck2/fs/pipe.c 2007-05-14 19:30:30.000000000 +1000
2334 @@ -41,12 +41,7 @@ void pipe_wait(struct pipe_inode_info *p
2335 {
2336 DEFINE_WAIT(wait);
2337
2338 - /*
2339 - * Pipes are system-local resources, so sleeping on them
2340 - * is considered a noninteractive wait:
2341 - */
2342 - prepare_to_wait(&pipe->wait, &wait,
2343 - TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE);
2344 + prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
2345 if (pipe->inode)
2346 mutex_unlock(&pipe->inode->i_mutex);
2347 schedule();
2348 Index: linux-2.6.21-ck2/Documentation/sched-design.txt
2349 ===================================================================
2350 --- linux-2.6.21-ck2.orig/Documentation/sched-design.txt 2006-11-30 11:30:31.000000000 +1100
2351 +++ linux-2.6.21-ck2/Documentation/sched-design.txt 2007-05-14 19:30:30.000000000 +1000
2352 @@ -1,11 +1,14 @@
2353 - Goals, Design and Implementation of the
2354 - new ultra-scalable O(1) scheduler
2355 + Goals, Design and Implementation of the ultra-scalable O(1) scheduler by
2356 + Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by
2357 + Con Kolivas.
2358
2359
2360 - This is an edited version of an email Ingo Molnar sent to
2361 - lkml on 4 Jan 2002. It describes the goals, design, and
2362 - implementation of Ingo's new ultra-scalable O(1) scheduler.
2363 - Last Updated: 18 April 2002.
2364 + This was originally an edited version of an email Ingo Molnar sent to
2365 + lkml on 4 Jan 2002. It describes the goals, design, and implementation
2366 + of Ingo's ultra-scalable O(1) scheduler. It now contains a description
2367 + of the Staircase Deadline priority scheduler that was built on this
2368 + design.
2369 + Last Updated: Fri, 4 May 2007
2370
2371
2372 Goal
2373 @@ -163,3 +166,222 @@ certain code paths and data constructs.
2374 code is smaller than the old one.
2375
2376 Ingo
2377 +
2378 +
2379 +Staircase Deadline cpu scheduler policy
2380 +================================================
2381 +
2382 +Design summary
2383 +==============
2384 +
2385 +A novel design which incorporates a foreground-background descending priority
2386 +system (the staircase) via a bandwidth allocation matrix according to nice
2387 +level.
2388 +
2389 +
2390 +Features
2391 +========
2392 +
2393 +A starvation free, strict fairness O(1) scalable design with interactivity
2394 +as good as the above restrictions can provide. There is no interactivity
2395 +estimator, no sleep/run measurements and only simple fixed accounting.
2396 +The design has strict enough a design and accounting that task behaviour
2397 +can be modelled and maximum scheduling latencies can be predicted by
2398 +the virtual deadline mechanism that manages runqueues. The prime concern
2399 +in this design is to maintain fairness at all costs determined by nice level,
2400 +yet to maintain as good interactivity as can be allowed within the
2401 +constraints of strict fairness.
2402 +
2403 +
2404 +Design description
2405 +==================
2406 +
2407 +SD works off the principle of providing each task a quota of runtime that it is
2408 +allowed to run at a number of priority levels determined by its static priority
2409 +(ie. its nice level). If the task uses up its quota it has its priority
2410 +decremented to the next level determined by a priority matrix. Once every
2411 +runtime quota has been consumed of every priority level, a task is queued on the
2412 +"expired" array. When no other tasks exist with quota, the expired array is
2413 +activated and fresh quotas are handed out. This is all done in O(1).
2414 +
2415 +Design details
2416 +==============
2417 +
2418 +Each task keeps a record of its own entitlement of cpu time. Most of the rest of
2419 +these details apply to non-realtime tasks as rt task management is straight
2420 +forward.
2421 +
2422 +Each runqueue keeps a record of what major epoch it is up to in the
2423 +rq->prio_rotation field which is incremented on each major epoch. It also
2424 +keeps a record of the current prio_level for each static priority task.
2425 +
2426 +Each task keeps a record of what major runqueue epoch it was last running
2427 +on in p->rotation. It also keeps a record of what priority levels it has
2428 +already been allocated quota from during this epoch in a bitmap p->bitmap.
2429 +
2430 +The only tunable that determines all other details is the RR_INTERVAL. This
2431 +is set to 8ms, and is scaled gently upwards with more cpus. This value is
2432 +tunable via a /proc interface.
2433 +
2434 +All tasks are initially given a quota based on RR_INTERVAL. This is equal to
2435 +RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and
2436 +progressively larger for nice values from -1 to -20. This is assigned to
2437 +p->quota and only changes with changes in nice level.
2438 +
2439 +As a task is first queued, it checks in recalc_task_prio to see if it has run at
2440 +this runqueue's current priority rotation. If it has not, it will have its
2441 +p->prio level set according to the first slot in a "priority matrix" and will be
2442 +given a p->time_slice equal to the p->quota, and has its allocation bitmap bit
2443 +set in p->bitmap for this prio level. It is then queued on the current active
2444 +priority array.
2445 +
2446 +If a task has already been running during this major epoch, and it has
2447 +p->time_slice left and the rq->prio_quota for the task's p->prio still
2448 +has quota, it will be placed back on the active array, but no more quota
2449 +will be added.
2450 +
2451 +If a task has been running during this major epoch, but does not have
2452 +p->time_slice left, it will find the next lowest priority in its bitmap that it
2453 +has not been allocated quota from. It then gets the a full quota in
2454 +p->time_slice. It is then queued on the current active priority array at the
2455 +newly determined lower priority.
2456 +
2457 +If a task has been running during this major epoch, and does not have
2458 +any entitlement left in p->bitmap and no time_slice left, it will have its
2459 +bitmap cleared, and be queued at its best prio again, but on the expired
2460 +priority array.
2461 +
2462 +When a task is queued, it has its relevant bit set in the array->prio_bitmap.
2463 +
2464 +p->time_slice is stored in nanosconds and is updated via update_cpu_clock on
2465 +schedule() and scheduler_tick. If p->time_slice is below zero then the
2466 +recalc_task_prio is readjusted and the task rescheduled.
2467 +
2468 +
2469 +Priority Matrix
2470 +===============
2471 +
2472 +In order to minimise the latencies between tasks of different nice levels
2473 +running concurrently, the dynamic priority slots where different nice levels
2474 +are queued are dithered instead of being sequential. What this means is that
2475 +there are 40 priority slots where a task may run during one major rotation,
2476 +and the allocation of slots is dependant on nice level. In the
2477 +following table, a zero represents a slot where the task may run.
2478 +
2479 +PRIORITY:0..................20.................39
2480 +nice -20 0000000000000000000000000000000000000000
2481 +nice -10 1000100010001000100010001000100010010000
2482 +nice 0 1010101010101010101010101010101010101010
2483 +nice 5 1011010110110101101101011011010110110110
2484 +nice 10 1110111011101110111011101110111011101110
2485 +nice 15 1111111011111110111111101111111011111110
2486 +nice 19 1111111111111111111111111111111111111110
2487 +
2488 +As can be seen, a nice -20 task runs in every priority slot whereas a nice 19
2489 +task only runs one slot per major rotation. This dithered table allows for the
2490 +smallest possible maximum latencies between tasks of varying nice levels, thus
2491 +allowing vastly different nice levels to be used.
2492 +
2493 +SCHED_BATCH tasks are managed slightly differently, receiving only the top
2494 +slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but
2495 +slightly higher latencies.
2496 +
2497 +
2498 +Modelling deadline behaviour
2499 +============================
2500 +
2501 +As the accounting in this design is hard and not modified by sleep average
2502 +calculations or interactivity modifiers, it is possible to accurately
2503 +predict the maximum latency that a task may experience under different
2504 +conditions. This is a virtual deadline mechanism enforced by mandatory
2505 +timeslice expiration and not outside bandwidth measurement.
2506 +
2507 +The maximum duration a task can run during one major epoch is determined by its
2508 +nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL
2509 +duration during each epoch. Nice 10 tasks can run at 9 priority levels for each
2510 +epoch, and so on. The table in the priority matrix above demonstrates how this
2511 +is enforced.
2512 +
2513 +Therefore the maximum duration a runqueue epoch can take is determined by
2514 +the number of tasks running, and their nice level. After that, the maximum
2515 +duration it can take before a task can wait before it get scheduled is
2516 +determined by the position of its first slot on the matrix.
2517 +
2518 +In the following examples, these are _worst case scenarios_ and would rarely
2519 +occur, but can be modelled nonetheless to determine the maximum possible
2520 +latency.
2521 +
2522 +So for example, if two nice 0 tasks are running, and one has just expired as
2523 +another is activated for the first time receiving a full quota for this
2524 +runqueue rotation, the first task will wait:
2525 +
2526 +nr_tasks * max_duration + nice_difference * rr_interval
2527 +1 * 19 * RR_INTERVAL + 0 = 152ms
2528 +
2529 +In the presence of a nice 10 task, a nice 0 task would wait a maximum of
2530 +1 * 10 * RR_INTERVAL + 0 = 80ms
2531 +
2532 +In the presence of a nice 0 task, a nice 10 task would wait a maximum of
2533 +1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms
2534 +
2535 +More useful than these values, though, are the average latencies which are
2536 +a matter of determining the average distance between priority slots of
2537 +different nice values and multiplying them by the tasks' quota. For example
2538 +in the presence of a nice -10 task, a nice 0 task will wait either one or
2539 +two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL,
2540 +this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or
2541 +20 and 40ms respectively (on uniprocessor at 1000HZ).
2542 +
2543 +
2544 +Achieving interactivity
2545 +=======================
2546 +
2547 +A requirement of this scheduler design was to achieve good interactivity
2548 +despite being a completely fair deadline based design. The disadvantage of
2549 +designs that try to achieve interactivity is that they usually do so at
2550 +the expense of maintaining fairness. As cpu speeds increase, the requirement
2551 +for some sort of metered unfairness towards interactive tasks becomes a less
2552 +desirable phenomenon, but low latency and fairness remains mandatory to
2553 +good interactive performance.
2554 +
2555 +This design relies on the fact that interactive tasks, by their nature,
2556 +sleep often. Most fair scheduling designs end up penalising such tasks
2557 +indirectly giving them less than their fair possible share because of the
2558 +sleep, and have to use a mechanism of bonusing their priority to offset
2559 +this based on the duration they sleep. This becomes increasingly inaccurate
2560 +as the number of running tasks rises and more tasks spend time waiting on
2561 +runqueues rather than sleeping, and it is impossible to tell whether the
2562 +task that's waiting on a runqueue only intends to run for a short period and
2563 +then sleep again after than runqueue wait. Furthermore, all such designs rely
2564 +on a period of time to pass to accumulate some form of statistic on the task
2565 +before deciding on how much to give them preference. The shorter this period,
2566 +the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
2567 +longer this period, the longer it takes for interactive tasks to get low
2568 +scheduling latencies and fair cpu.
2569 +
2570 +This design does not measure sleep time at all. Interactive tasks that sleep
2571 +often will wake up having consumed very little if any of their quota for
2572 +the current major priority rotation. The longer they have slept, the less
2573 +likely they are to even be on the current major priority rotation. Once
2574 +woken up, though, they get to use up a their full quota for that epoch,
2575 +whether part of a quota remains or a full quota. Overall, however, they
2576 +can still only run as much cpu time for that epoch as any other task of the
2577 +same nice level. This means that two tasks behaving completely differently
2578 +from fully cpu bound to waking/sleeping extremely frequently will still
2579 +get the same quota of cpu, but the latter will be using its quota for that
2580 +epoch in bursts rather than continuously. This guarantees that interactive
2581 +tasks get the same amount of cpu as cpu bound ones.
2582 +
2583 +The other requirement of interactive tasks is also to obtain low latencies
2584 +for when they are scheduled. Unlike fully cpu bound tasks and the maximum
2585 +latencies possible described in the modelling deadline behaviour section
2586 +above, tasks that sleep will wake up with quota available usually at the
2587 +current runqueue's priority_level or better. This means that the most latency
2588 +they are likely to see is one RR_INTERVAL, and often they will preempt the
2589 +current task if it is not of a sleeping nature. This then guarantees very
2590 +low latency for interactive tasks, and the lowest latencies for the least
2591 +cpu bound tasks.
2592 +
2593 +
2594 +Fri, 4 May 2007
2595 +Con Kolivas <kernel@kolivas.org>
2596 Index: linux-2.6.21-ck2/kernel/softirq.c
2597 ===================================================================
2598 --- linux-2.6.21-ck2.orig/kernel/softirq.c 2007-05-03 22:20:57.000000000 +1000
2599 +++ linux-2.6.21-ck2/kernel/softirq.c 2007-05-14 19:30:30.000000000 +1000
2600 @@ -488,7 +488,7 @@ void __init softirq_init(void)
2601
2602 static int ksoftirqd(void * __bind_cpu)
2603 {
2604 - set_user_nice(current, 19);
2605 + set_user_nice(current, 15);
2606 current->flags |= PF_NOFREEZE;
2607
2608 set_current_state(TASK_INTERRUPTIBLE);