Annotation of /trunk/kernel26-alx/patches-2.6.21-r13/0001-2.6.21-sd-0.48.patch
Parent Directory | Revision Log
Revision 445 -
(hide annotations)
(download)
Tue Jan 15 00:44:37 2008 UTC (16 years, 8 months ago) by niro
File size: 89890 byte(s)
Tue Jan 15 00:44:37 2008 UTC (16 years, 8 months ago) by niro
File size: 89890 byte(s)
-added patches for 2.6.21-alx-r13
1 | niro | 445 | Staircase Deadline cpu scheduler policy |
2 | ================================================ | ||
3 | |||
4 | Design summary | ||
5 | ============== | ||
6 | |||
7 | A novel design which incorporates a foreground-background descending priority | ||
8 | system (the staircase) via a bandwidth allocation matrix according to nice | ||
9 | level. | ||
10 | |||
11 | |||
12 | Features | ||
13 | ======== | ||
14 | |||
15 | A starvation free, strict fairness O(1) scalable design with interactivity | ||
16 | as good as the above restrictions can provide. There is no interactivity | ||
17 | estimator, no sleep/run measurements and only simple fixed accounting. | ||
18 | The design has strict enough a design and accounting that task behaviour | ||
19 | can be modelled and maximum scheduling latencies can be predicted by | ||
20 | the virtual deadline mechanism that manages runqueues. The prime concern | ||
21 | in this design is to maintain fairness at all costs determined by nice level, | ||
22 | yet to maintain as good interactivity as can be allowed within the | ||
23 | constraints of strict fairness. | ||
24 | |||
25 | |||
26 | Design description | ||
27 | ================== | ||
28 | |||
29 | SD works off the principle of providing each task a quota of runtime that it is | ||
30 | allowed to run at a number of priority levels determined by its static priority | ||
31 | (ie. its nice level). If the task uses up its quota it has its priority | ||
32 | decremented to the next level determined by a priority matrix. Once every | ||
33 | runtime quota has been consumed of every priority level, a task is queued on the | ||
34 | "expired" array. When no other tasks exist with quota, the expired array is | ||
35 | activated and fresh quotas are handed out. This is all done in O(1). | ||
36 | |||
37 | Design details | ||
38 | ============== | ||
39 | |||
40 | Each task keeps a record of its own entitlement of cpu time. Most of the rest of | ||
41 | these details apply to non-realtime tasks as rt task management is straight | ||
42 | forward. | ||
43 | |||
44 | Each runqueue keeps a record of what major epoch it is up to in the | ||
45 | rq->prio_rotation field which is incremented on each major epoch. It also | ||
46 | keeps a record of the current prio_level for each static priority task. | ||
47 | |||
48 | Each task keeps a record of what major runqueue epoch it was last running | ||
49 | on in p->rotation. It also keeps a record of what priority levels it has | ||
50 | already been allocated quota from during this epoch in a bitmap p->bitmap. | ||
51 | |||
52 | The only tunable that determines all other details is the RR_INTERVAL. This | ||
53 | is set to 8ms, and is scaled gently upwards with more cpus. This value is | ||
54 | tunable via a /proc interface. | ||
55 | |||
56 | All tasks are initially given a quota based on RR_INTERVAL. This is equal to | ||
57 | RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and | ||
58 | progressively larger for nice values from -1 to -20. This is assigned to | ||
59 | p->quota and only changes with changes in nice level. | ||
60 | |||
61 | As a task is first queued, it checks in recalc_task_prio to see if it has run at | ||
62 | this runqueue's current priority rotation. If it has not, it will have its | ||
63 | p->prio level set according to the first slot in a "priority matrix" and will be | ||
64 | given a p->time_slice equal to the p->quota, and has its allocation bitmap bit | ||
65 | set in p->bitmap for this prio level. It is then queued on the current active | ||
66 | priority array. | ||
67 | |||
68 | If a task has already been running during this major epoch, and it has | ||
69 | p->time_slice left and the rq->prio_quota for the task's p->prio still | ||
70 | has quota, it will be placed back on the active array, but no more quota | ||
71 | will be added. | ||
72 | |||
73 | If a task has been running during this major epoch, but does not have | ||
74 | p->time_slice left, it will find the next lowest priority in its bitmap that it | ||
75 | has not been allocated quota from. It then gets the a full quota in | ||
76 | p->time_slice. It is then queued on the current active priority array at the | ||
77 | newly determined lower priority. | ||
78 | |||
79 | If a task has been running during this major epoch, and does not have | ||
80 | any entitlement left in p->bitmap and no time_slice left, it will have its | ||
81 | bitmap cleared, and be queued at its best prio again, but on the expired | ||
82 | priority array. | ||
83 | |||
84 | When a task is queued, it has its relevant bit set in the array->prio_bitmap. | ||
85 | |||
86 | p->time_slice is stored in nanosconds and is updated via update_cpu_clock on | ||
87 | schedule() and scheduler_tick. If p->time_slice is below zero then the | ||
88 | recalc_task_prio is readjusted and the task rescheduled. | ||
89 | |||
90 | |||
91 | Priority Matrix | ||
92 | =============== | ||
93 | |||
94 | In order to minimise the latencies between tasks of different nice levels | ||
95 | running concurrently, the dynamic priority slots where different nice levels | ||
96 | are queued are dithered instead of being sequential. What this means is that | ||
97 | there are 40 priority slots where a task may run during one major rotation, | ||
98 | and the allocation of slots is dependant on nice level. In the | ||
99 | following table, a zero represents a slot where the task may run. | ||
100 | |||
101 | PRIORITY:0..................20.................39 | ||
102 | nice -20 0000000000000000000000000000000000000000 | ||
103 | nice -10 1000100010001000100010001000100010010000 | ||
104 | nice 0 1010101010101010101010101010101010101010 | ||
105 | nice 5 1011010110110101101101011011010110110110 | ||
106 | nice 10 1110111011101110111011101110111011101110 | ||
107 | nice 15 1111111011111110111111101111111011111110 | ||
108 | nice 19 1111111111111111111111111111111111111110 | ||
109 | |||
110 | As can be seen, a nice -20 task runs in every priority slot whereas a nice 19 | ||
111 | task only runs one slot per major rotation. This dithered table allows for the | ||
112 | smallest possible maximum latencies between tasks of varying nice levels, thus | ||
113 | allowing vastly different nice levels to be used. | ||
114 | |||
115 | SCHED_BATCH tasks are managed slightly differently, receiving only the top | ||
116 | slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but | ||
117 | slightly higher latencies. | ||
118 | |||
119 | |||
120 | Modelling deadline behaviour | ||
121 | ============================ | ||
122 | |||
123 | As the accounting in this design is hard and not modified by sleep average | ||
124 | calculations or interactivity modifiers, it is possible to accurately | ||
125 | predict the maximum latency that a task may experience under different | ||
126 | conditions. This is a virtual deadline mechanism enforced by mandatory | ||
127 | timeslice expiration and not outside bandwidth measurement. | ||
128 | |||
129 | The maximum duration a task can run during one major epoch is determined by its | ||
130 | nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL | ||
131 | duration during each epoch. Nice 10 tasks can run at 9 priority levels for each | ||
132 | epoch, and so on. The table in the priority matrix above demonstrates how this | ||
133 | is enforced. | ||
134 | |||
135 | Therefore the maximum duration a runqueue epoch can take is determined by | ||
136 | the number of tasks running, and their nice level. After that, the maximum | ||
137 | duration it can take before a task can wait before it get scheduled is | ||
138 | determined by the position of its first slot on the matrix. | ||
139 | |||
140 | In the following examples, these are _worst case scenarios_ and would rarely | ||
141 | occur, but can be modelled nonetheless to determine the maximum possible | ||
142 | latency. | ||
143 | |||
144 | So for example, if two nice 0 tasks are running, and one has just expired as | ||
145 | another is activated for the first time receiving a full quota for this | ||
146 | runqueue rotation, the first task will wait: | ||
147 | |||
148 | nr_tasks * max_duration + nice_difference * rr_interval | ||
149 | 1 * 19 * RR_INTERVAL + 0 = 152ms | ||
150 | |||
151 | In the presence of a nice 10 task, a nice 0 task would wait a maximum of | ||
152 | 1 * 10 * RR_INTERVAL + 0 = 80ms | ||
153 | |||
154 | In the presence of a nice 0 task, a nice 10 task would wait a maximum of | ||
155 | 1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms | ||
156 | |||
157 | More useful than these values, though, are the average latencies which are | ||
158 | a matter of determining the average distance between priority slots of | ||
159 | different nice values and multiplying them by the tasks' quota. For example | ||
160 | in the presence of a nice -10 task, a nice 0 task will wait either one or | ||
161 | two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL, | ||
162 | this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or | ||
163 | 20 and 40ms respectively (on uniprocessor at 1000HZ). | ||
164 | |||
165 | |||
166 | Achieving interactivity | ||
167 | ======================= | ||
168 | |||
169 | A requirement of this scheduler design was to achieve good interactivity | ||
170 | despite being a completely fair deadline based design. The disadvantage of | ||
171 | designs that try to achieve interactivity is that they usually do so at | ||
172 | the expense of maintaining fairness. As cpu speeds increase, the requirement | ||
173 | for some sort of metered unfairness towards interactive tasks becomes a less | ||
174 | desirable phenomenon, but low latency and fairness remains mandatory to | ||
175 | good interactive performance. | ||
176 | |||
177 | This design relies on the fact that interactive tasks, by their nature, | ||
178 | sleep often. Most fair scheduling designs end up penalising such tasks | ||
179 | indirectly giving them less than their fair possible share because of the | ||
180 | sleep, and have to use a mechanism of bonusing their priority to offset | ||
181 | this based on the duration they sleep. This becomes increasingly inaccurate | ||
182 | as the number of running tasks rises and more tasks spend time waiting on | ||
183 | runqueues rather than sleeping, and it is impossible to tell whether the | ||
184 | task that's waiting on a runqueue only intends to run for a short period and | ||
185 | then sleep again after than runqueue wait. Furthermore, all such designs rely | ||
186 | on a period of time to pass to accumulate some form of statistic on the task | ||
187 | before deciding on how much to give them preference. The shorter this period, | ||
188 | the more rapidly bursts of cpu ruin the interactive tasks behaviour. The | ||
189 | longer this period, the longer it takes for interactive tasks to get low | ||
190 | scheduling latencies and fair cpu. | ||
191 | |||
192 | This design does not measure sleep time at all. Interactive tasks that sleep | ||
193 | often will wake up having consumed very little if any of their quota for | ||
194 | the current major priority rotation. The longer they have slept, the less | ||
195 | likely they are to even be on the current major priority rotation. Once | ||
196 | woken up, though, they get to use up a their full quota for that epoch, | ||
197 | whether part of a quota remains or a full quota. Overall, however, they | ||
198 | can still only run as much cpu time for that epoch as any other task of the | ||
199 | same nice level. This means that two tasks behaving completely differently | ||
200 | from fully cpu bound to waking/sleeping extremely frequently will still | ||
201 | get the same quota of cpu, but the latter will be using its quota for that | ||
202 | epoch in bursts rather than continuously. This guarantees that interactive | ||
203 | tasks get the same amount of cpu as cpu bound ones. | ||
204 | |||
205 | The other requirement of interactive tasks is also to obtain low latencies | ||
206 | for when they are scheduled. Unlike fully cpu bound tasks and the maximum | ||
207 | latencies possible described in the modelling deadline behaviour section | ||
208 | above, tasks that sleep will wake up with quota available usually at the | ||
209 | current runqueue's priority_level or better. This means that the most latency | ||
210 | they are likely to see is one RR_INTERVAL, and often they will preempt the | ||
211 | current task if it is not of a sleeping nature. This then guarantees very | ||
212 | low latency for interactive tasks, and the lowest latencies for the least | ||
213 | cpu bound tasks. | ||
214 | |||
215 | |||
216 | Fri, 4 May 2007 | ||
217 | |||
218 | Signed-off-by: Con Kolivas <kernel@kolivas.org> | ||
219 | |||
220 | --- | ||
221 | Documentation/sched-design.txt | 234 +++++++ | ||
222 | Documentation/sysctl/kernel.txt | 14 | ||
223 | fs/pipe.c | 7 | ||
224 | fs/proc/array.c | 2 | ||
225 | include/linux/init_task.h | 4 | ||
226 | include/linux/sched.h | 32 - | ||
227 | kernel/sched.c | 1277 +++++++++++++++++++--------------------- | ||
228 | kernel/softirq.c | 2 | ||
229 | kernel/sysctl.c | 26 | ||
230 | kernel/workqueue.c | 2 | ||
231 | 10 files changed, 908 insertions(+), 692 deletions(-) | ||
232 | |||
233 | Index: linux-2.6.21-ck2/kernel/workqueue.c | ||
234 | =================================================================== | ||
235 | --- linux-2.6.21-ck2.orig/kernel/workqueue.c 2007-05-03 22:20:57.000000000 +1000 | ||
236 | +++ linux-2.6.21-ck2/kernel/workqueue.c 2007-05-14 19:30:30.000000000 +1000 | ||
237 | @@ -355,8 +355,6 @@ static int worker_thread(void *__cwq) | ||
238 | if (!cwq->freezeable) | ||
239 | current->flags |= PF_NOFREEZE; | ||
240 | |||
241 | - set_user_nice(current, -5); | ||
242 | - | ||
243 | /* Block and flush all signals */ | ||
244 | sigfillset(&blocked); | ||
245 | sigprocmask(SIG_BLOCK, &blocked, NULL); | ||
246 | Index: linux-2.6.21-ck2/fs/proc/array.c | ||
247 | =================================================================== | ||
248 | --- linux-2.6.21-ck2.orig/fs/proc/array.c 2007-05-03 22:20:56.000000000 +1000 | ||
249 | +++ linux-2.6.21-ck2/fs/proc/array.c 2007-05-14 19:30:30.000000000 +1000 | ||
250 | @@ -165,7 +165,6 @@ static inline char * task_state(struct t | ||
251 | rcu_read_lock(); | ||
252 | buffer += sprintf(buffer, | ||
253 | "State:\t%s\n" | ||
254 | - "SleepAVG:\t%lu%%\n" | ||
255 | "Tgid:\t%d\n" | ||
256 | "Pid:\t%d\n" | ||
257 | "PPid:\t%d\n" | ||
258 | @@ -173,7 +172,6 @@ static inline char * task_state(struct t | ||
259 | "Uid:\t%d\t%d\t%d\t%d\n" | ||
260 | "Gid:\t%d\t%d\t%d\t%d\n", | ||
261 | get_task_state(p), | ||
262 | - (p->sleep_avg/1024)*100/(1020000000/1024), | ||
263 | p->tgid, p->pid, | ||
264 | pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0, | ||
265 | pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0, | ||
266 | Index: linux-2.6.21-ck2/include/linux/init_task.h | ||
267 | =================================================================== | ||
268 | --- linux-2.6.21-ck2.orig/include/linux/init_task.h 2007-05-03 22:20:57.000000000 +1000 | ||
269 | +++ linux-2.6.21-ck2/include/linux/init_task.h 2007-05-14 19:30:30.000000000 +1000 | ||
270 | @@ -102,13 +102,15 @@ extern struct group_info init_groups; | ||
271 | .prio = MAX_PRIO-20, \ | ||
272 | .static_prio = MAX_PRIO-20, \ | ||
273 | .normal_prio = MAX_PRIO-20, \ | ||
274 | + .rotation = 0, \ | ||
275 | .policy = SCHED_NORMAL, \ | ||
276 | .cpus_allowed = CPU_MASK_ALL, \ | ||
277 | .mm = NULL, \ | ||
278 | .active_mm = &init_mm, \ | ||
279 | .run_list = LIST_HEAD_INIT(tsk.run_list), \ | ||
280 | .ioprio = 0, \ | ||
281 | - .time_slice = HZ, \ | ||
282 | + .time_slice = 1000000000, \ | ||
283 | + .quota = 1000000000, \ | ||
284 | .tasks = LIST_HEAD_INIT(tsk.tasks), \ | ||
285 | .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \ | ||
286 | .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \ | ||
287 | Index: linux-2.6.21-ck2/include/linux/sched.h | ||
288 | =================================================================== | ||
289 | --- linux-2.6.21-ck2.orig/include/linux/sched.h 2007-05-03 22:20:57.000000000 +1000 | ||
290 | +++ linux-2.6.21-ck2/include/linux/sched.h 2007-05-14 19:30:30.000000000 +1000 | ||
291 | @@ -149,8 +149,7 @@ extern unsigned long weighted_cpuload(co | ||
292 | #define EXIT_ZOMBIE 16 | ||
293 | #define EXIT_DEAD 32 | ||
294 | /* in tsk->state again */ | ||
295 | -#define TASK_NONINTERACTIVE 64 | ||
296 | -#define TASK_DEAD 128 | ||
297 | +#define TASK_DEAD 64 | ||
298 | |||
299 | #define __set_task_state(tsk, state_value) \ | ||
300 | do { (tsk)->state = (state_value); } while (0) | ||
301 | @@ -522,8 +521,9 @@ struct signal_struct { | ||
302 | |||
303 | #define MAX_USER_RT_PRIO 100 | ||
304 | #define MAX_RT_PRIO MAX_USER_RT_PRIO | ||
305 | +#define PRIO_RANGE (40) | ||
306 | |||
307 | -#define MAX_PRIO (MAX_RT_PRIO + 40) | ||
308 | +#define MAX_PRIO (MAX_RT_PRIO + PRIO_RANGE) | ||
309 | |||
310 | #define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO) | ||
311 | #define rt_task(p) rt_prio((p)->prio) | ||
312 | @@ -788,13 +788,6 @@ struct mempolicy; | ||
313 | struct pipe_inode_info; | ||
314 | struct uts_namespace; | ||
315 | |||
316 | -enum sleep_type { | ||
317 | - SLEEP_NORMAL, | ||
318 | - SLEEP_NONINTERACTIVE, | ||
319 | - SLEEP_INTERACTIVE, | ||
320 | - SLEEP_INTERRUPTED, | ||
321 | -}; | ||
322 | - | ||
323 | struct prio_array; | ||
324 | |||
325 | struct task_struct { | ||
326 | @@ -814,20 +807,33 @@ struct task_struct { | ||
327 | int load_weight; /* for niceness load balancing purposes */ | ||
328 | int prio, static_prio, normal_prio; | ||
329 | struct list_head run_list; | ||
330 | + /* | ||
331 | + * This bitmap shows what priorities this task has received quota | ||
332 | + * from for this major priority rotation on its current runqueue. | ||
333 | + */ | ||
334 | + DECLARE_BITMAP(bitmap, PRIO_RANGE + 1); | ||
335 | struct prio_array *array; | ||
336 | + /* Which major runqueue rotation did this task run */ | ||
337 | + unsigned long rotation; | ||
338 | |||
339 | unsigned short ioprio; | ||
340 | #ifdef CONFIG_BLK_DEV_IO_TRACE | ||
341 | unsigned int btrace_seq; | ||
342 | #endif | ||
343 | - unsigned long sleep_avg; | ||
344 | unsigned long long timestamp, last_ran; | ||
345 | unsigned long long sched_time; /* sched_clock time spent running */ | ||
346 | - enum sleep_type sleep_type; | ||
347 | |||
348 | unsigned long policy; | ||
349 | cpumask_t cpus_allowed; | ||
350 | - unsigned int time_slice, first_time_slice; | ||
351 | + /* | ||
352 | + * How much this task is entitled to run at the current priority | ||
353 | + * before being requeued at a lower priority. | ||
354 | + */ | ||
355 | + int time_slice; | ||
356 | + /* Is this the very first time_slice this task has ever run. */ | ||
357 | + unsigned int first_time_slice; | ||
358 | + /* How much this task receives at each priority level */ | ||
359 | + int quota; | ||
360 | |||
361 | #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) | ||
362 | struct sched_info sched_info; | ||
363 | Index: linux-2.6.21-ck2/kernel/sched.c | ||
364 | =================================================================== | ||
365 | --- linux-2.6.21-ck2.orig/kernel/sched.c 2007-05-03 22:20:57.000000000 +1000 | ||
366 | +++ linux-2.6.21-ck2/kernel/sched.c 2007-05-14 19:30:30.000000000 +1000 | ||
367 | @@ -16,6 +16,7 @@ | ||
368 | * by Davide Libenzi, preemptible kernel bits by Robert Love. | ||
369 | * 2003-09-03 Interactivity tuning by Con Kolivas. | ||
370 | * 2004-04-02 Scheduler domains code by Nick Piggin | ||
371 | + * 2007-03-02 Staircase deadline scheduling policy by Con Kolivas | ||
372 | */ | ||
373 | |||
374 | #include <linux/mm.h> | ||
375 | @@ -52,6 +53,7 @@ | ||
376 | #include <linux/tsacct_kern.h> | ||
377 | #include <linux/kprobes.h> | ||
378 | #include <linux/delayacct.h> | ||
379 | +#include <linux/log2.h> | ||
380 | #include <asm/tlb.h> | ||
381 | |||
382 | #include <asm/unistd.h> | ||
383 | @@ -83,126 +85,72 @@ unsigned long long __attribute__((weak)) | ||
384 | #define USER_PRIO(p) ((p)-MAX_RT_PRIO) | ||
385 | #define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio) | ||
386 | #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO)) | ||
387 | +#define SCHED_PRIO(p) ((p)+MAX_RT_PRIO) | ||
388 | |||
389 | -/* | ||
390 | - * Some helpers for converting nanosecond timing to jiffy resolution | ||
391 | - */ | ||
392 | -#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ)) | ||
393 | +/* Some helpers for converting to/from various scales.*/ | ||
394 | #define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) | ||
395 | - | ||
396 | -/* | ||
397 | - * These are the 'tuning knobs' of the scheduler: | ||
398 | - * | ||
399 | - * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger), | ||
400 | - * default timeslice is 100 msecs, maximum timeslice is 800 msecs. | ||
401 | - * Timeslices get refilled after they expire. | ||
402 | - */ | ||
403 | -#define MIN_TIMESLICE max(5 * HZ / 1000, 1) | ||
404 | -#define DEF_TIMESLICE (100 * HZ / 1000) | ||
405 | -#define ON_RUNQUEUE_WEIGHT 30 | ||
406 | -#define CHILD_PENALTY 95 | ||
407 | -#define PARENT_PENALTY 100 | ||
408 | -#define EXIT_WEIGHT 3 | ||
409 | -#define PRIO_BONUS_RATIO 25 | ||
410 | -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100) | ||
411 | -#define INTERACTIVE_DELTA 2 | ||
412 | -#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS) | ||
413 | -#define STARVATION_LIMIT (MAX_SLEEP_AVG) | ||
414 | -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG)) | ||
415 | - | ||
416 | -/* | ||
417 | - * If a task is 'interactive' then we reinsert it in the active | ||
418 | - * array after it has expired its current timeslice. (it will not | ||
419 | - * continue to run immediately, it will still roundrobin with | ||
420 | - * other interactive tasks.) | ||
421 | - * | ||
422 | - * This part scales the interactivity limit depending on niceness. | ||
423 | - * | ||
424 | - * We scale it linearly, offset by the INTERACTIVE_DELTA delta. | ||
425 | - * Here are a few examples of different nice levels: | ||
426 | - * | ||
427 | - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] | ||
428 | - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] | ||
429 | - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] | ||
430 | - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] | ||
431 | - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] | ||
432 | - * | ||
433 | - * (the X axis represents the possible -5 ... 0 ... +5 dynamic | ||
434 | - * priority range a task can explore, a value of '1' means the | ||
435 | - * task is rated interactive.) | ||
436 | - * | ||
437 | - * Ie. nice +19 tasks can never get 'interactive' enough to be | ||
438 | - * reinserted into the active array. And only heavily CPU-hog nice -20 | ||
439 | - * tasks will be expired. Default nice 0 tasks are somewhere between, | ||
440 | - * it takes some effort for them to get interactive, but it's not | ||
441 | - * too hard. | ||
442 | - */ | ||
443 | - | ||
444 | -#define CURRENT_BONUS(p) \ | ||
445 | - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ | ||
446 | - MAX_SLEEP_AVG) | ||
447 | - | ||
448 | -#define GRANULARITY (10 * HZ / 1000 ? : 1) | ||
449 | - | ||
450 | -#ifdef CONFIG_SMP | ||
451 | -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \ | ||
452 | - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \ | ||
453 | - num_online_cpus()) | ||
454 | -#else | ||
455 | -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \ | ||
456 | - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1))) | ||
457 | -#endif | ||
458 | - | ||
459 | -#define SCALE(v1,v1_max,v2_max) \ | ||
460 | - (v1) * (v2_max) / (v1_max) | ||
461 | - | ||
462 | -#define DELTA(p) \ | ||
463 | - (SCALE(TASK_NICE(p) + 20, 40, MAX_BONUS) - 20 * MAX_BONUS / 40 + \ | ||
464 | - INTERACTIVE_DELTA) | ||
465 | - | ||
466 | -#define TASK_INTERACTIVE(p) \ | ||
467 | - ((p)->prio <= (p)->static_prio - DELTA(p)) | ||
468 | - | ||
469 | -#define INTERACTIVE_SLEEP(p) \ | ||
470 | - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \ | ||
471 | - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) | ||
472 | - | ||
473 | -#define TASK_PREEMPTS_CURR(p, rq) \ | ||
474 | - ((p)->prio < (rq)->curr->prio) | ||
475 | - | ||
476 | -#define SCALE_PRIO(x, prio) \ | ||
477 | - max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) | ||
478 | - | ||
479 | -static unsigned int static_prio_timeslice(int static_prio) | ||
480 | -{ | ||
481 | - if (static_prio < NICE_TO_PRIO(0)) | ||
482 | - return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio); | ||
483 | - else | ||
484 | - return SCALE_PRIO(DEF_TIMESLICE, static_prio); | ||
485 | -} | ||
486 | - | ||
487 | -/* | ||
488 | - * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] | ||
489 | - * to time slice values: [800ms ... 100ms ... 5ms] | ||
490 | - * | ||
491 | - * The higher a thread's priority, the bigger timeslices | ||
492 | - * it gets during one round of execution. But even the lowest | ||
493 | - * priority thread gets MIN_TIMESLICE worth of execution time. | ||
494 | +#define MS_TO_NS(TIME) ((TIME) * 1000000) | ||
495 | +#define MS_TO_US(TIME) ((TIME) * 1000) | ||
496 | +#define US_TO_MS(TIME) ((TIME) / 1000) | ||
497 | + | ||
498 | +#define TASK_PREEMPTS_CURR(p, curr) ((p)->prio < (curr)->prio) | ||
499 | + | ||
500 | +/* | ||
501 | + * This is the time all tasks within the same priority round robin. | ||
502 | + * Value is in ms and set to a minimum of 8ms. Scales with number of cpus. | ||
503 | + * Tunable via /proc interface. | ||
504 | + */ | ||
505 | +int rr_interval __read_mostly = 8; | ||
506 | + | ||
507 | +/* | ||
508 | + * This contains a bitmap for each dynamic priority level with empty slots | ||
509 | + * for the valid priorities each different nice level can have. It allows | ||
510 | + * us to stagger the slots where differing priorities run in a way that | ||
511 | + * keeps latency differences between different nice levels at a minimum. | ||
512 | + * The purpose of a pre-generated matrix is for rapid lookup of next slot in | ||
513 | + * O(1) time without having to recalculate every time priority gets demoted. | ||
514 | + * All nice levels use priority slot 39 as this allows less niced tasks to | ||
515 | + * get all priority slots better than that before expiration is forced. | ||
516 | + * ie, where 0 means a slot for that priority, priority running from left to | ||
517 | + * right is from prio 0 to prio 39: | ||
518 | + * nice -20 0000000000000000000000000000000000000000 | ||
519 | + * nice -10 1000100010001000100010001000100010010000 | ||
520 | + * nice 0 1010101010101010101010101010101010101010 | ||
521 | + * nice 5 1011010110110101101101011011010110110110 | ||
522 | + * nice 10 1110111011101110111011101110111011101110 | ||
523 | + * nice 15 1111111011111110111111101111111011111110 | ||
524 | + * nice 19 1111111111111111111111111111111111111110 | ||
525 | */ | ||
526 | +static unsigned long prio_matrix[PRIO_RANGE][BITS_TO_LONGS(PRIO_RANGE)] | ||
527 | + __read_mostly; | ||
528 | |||
529 | -static inline unsigned int task_timeslice(struct task_struct *p) | ||
530 | -{ | ||
531 | - return static_prio_timeslice(p->static_prio); | ||
532 | -} | ||
533 | +struct rq; | ||
534 | |||
535 | /* | ||
536 | * These are the runqueue data structures: | ||
537 | */ | ||
538 | - | ||
539 | struct prio_array { | ||
540 | - unsigned int nr_active; | ||
541 | - DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */ | ||
542 | + /* Tasks queued at each priority */ | ||
543 | struct list_head queue[MAX_PRIO]; | ||
544 | + | ||
545 | + /* | ||
546 | + * The bitmap of priorities queued for this array. While the expired | ||
547 | + * array will never have realtime tasks on it, it is simpler to have | ||
548 | + * equal sized bitmaps for a cheap array swap. Include 1 bit for | ||
549 | + * delimiter. | ||
550 | + */ | ||
551 | + DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1); | ||
552 | + | ||
553 | + /* | ||
554 | + * The best static priority (of the dynamic priority tasks) queued | ||
555 | + * this array. | ||
556 | + */ | ||
557 | + int best_static_prio; | ||
558 | + | ||
559 | +#ifdef CONFIG_SMP | ||
560 | + /* For convenience looks back at rq */ | ||
561 | + struct rq *rq; | ||
562 | +#endif | ||
563 | }; | ||
564 | |||
565 | /* | ||
566 | @@ -234,14 +182,24 @@ struct rq { | ||
567 | */ | ||
568 | unsigned long nr_uninterruptible; | ||
569 | |||
570 | - unsigned long expired_timestamp; | ||
571 | /* Cached timestamp set by update_cpu_clock() */ | ||
572 | unsigned long long most_recent_timestamp; | ||
573 | struct task_struct *curr, *idle; | ||
574 | unsigned long next_balance; | ||
575 | struct mm_struct *prev_mm; | ||
576 | + | ||
577 | struct prio_array *active, *expired, arrays[2]; | ||
578 | - int best_expired_prio; | ||
579 | + unsigned long *dyn_bitmap, *exp_bitmap; | ||
580 | + | ||
581 | + /* | ||
582 | + * The current dynamic priority level this runqueue is at per static | ||
583 | + * priority level. | ||
584 | + */ | ||
585 | + int prio_level[PRIO_RANGE]; | ||
586 | + | ||
587 | + /* How many times we have rotated the priority queue */ | ||
588 | + unsigned long prio_rotation; | ||
589 | + | ||
590 | atomic_t nr_iowait; | ||
591 | |||
592 | #ifdef CONFIG_SMP | ||
593 | @@ -579,12 +537,9 @@ static inline struct rq *this_rq_lock(vo | ||
594 | #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) | ||
595 | /* | ||
596 | * Called when a process is dequeued from the active array and given | ||
597 | - * the cpu. We should note that with the exception of interactive | ||
598 | - * tasks, the expired queue will become the active queue after the active | ||
599 | - * queue is empty, without explicitly dequeuing and requeuing tasks in the | ||
600 | - * expired queue. (Interactive tasks may be requeued directly to the | ||
601 | - * active queue, thus delaying tasks in the expired queue from running; | ||
602 | - * see scheduler_tick()). | ||
603 | + * the cpu. We should note that the expired queue will become the active | ||
604 | + * queue after the active queue is empty, without explicitly dequeuing and | ||
605 | + * requeuing tasks in the expired queue. | ||
606 | * | ||
607 | * This function is only called from sched_info_arrive(), rather than | ||
608 | * dequeue_task(). Even though a task may be queued and dequeued multiple | ||
609 | @@ -682,71 +637,227 @@ sched_info_switch(struct task_struct *pr | ||
610 | #define sched_info_switch(t, next) do { } while (0) | ||
611 | #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */ | ||
612 | |||
613 | +static inline int task_queued(struct task_struct *task) | ||
614 | +{ | ||
615 | + return !list_empty(&task->run_list); | ||
616 | +} | ||
617 | + | ||
618 | +static inline void set_dynamic_bit(struct task_struct *p, struct rq *rq) | ||
619 | +{ | ||
620 | + __set_bit(p->prio, p->array->prio_bitmap); | ||
621 | +} | ||
622 | + | ||
623 | /* | ||
624 | - * Adding/removing a task to/from a priority array: | ||
625 | + * Removing from a runqueue. | ||
626 | */ | ||
627 | -static void dequeue_task(struct task_struct *p, struct prio_array *array) | ||
628 | +static void dequeue_task(struct task_struct *p, struct rq *rq) | ||
629 | { | ||
630 | - array->nr_active--; | ||
631 | - list_del(&p->run_list); | ||
632 | - if (list_empty(array->queue + p->prio)) | ||
633 | - __clear_bit(p->prio, array->bitmap); | ||
634 | + list_del_init(&p->run_list); | ||
635 | + if (list_empty(p->array->queue + p->prio)) | ||
636 | + __clear_bit(p->prio, p->array->prio_bitmap); | ||
637 | } | ||
638 | |||
639 | -static void enqueue_task(struct task_struct *p, struct prio_array *array) | ||
640 | +static void reset_first_time_slice(struct task_struct *p) | ||
641 | { | ||
642 | - sched_info_queued(p); | ||
643 | - list_add_tail(&p->run_list, array->queue + p->prio); | ||
644 | - __set_bit(p->prio, array->bitmap); | ||
645 | - array->nr_active++; | ||
646 | + if (unlikely(p->first_time_slice)) | ||
647 | + p->first_time_slice = 0; | ||
648 | +} | ||
649 | + | ||
650 | +/* | ||
651 | + * The task is being queued on a fresh array so it has its entitlement | ||
652 | + * bitmap cleared. | ||
653 | + */ | ||
654 | +static void task_new_array(struct task_struct *p, struct rq *rq, | ||
655 | + struct prio_array *array) | ||
656 | +{ | ||
657 | + bitmap_zero(p->bitmap, PRIO_RANGE); | ||
658 | + p->rotation = rq->prio_rotation; | ||
659 | + p->time_slice = p->quota; | ||
660 | p->array = array; | ||
661 | + reset_first_time_slice(p); | ||
662 | +} | ||
663 | + | ||
664 | +/* Find the first slot from the relevant prio_matrix entry */ | ||
665 | +static int first_prio_slot(struct task_struct *p) | ||
666 | +{ | ||
667 | + if (unlikely(p->policy == SCHED_BATCH)) | ||
668 | + return p->static_prio; | ||
669 | + return SCHED_PRIO(find_first_zero_bit( | ||
670 | + prio_matrix[USER_PRIO(p->static_prio)], PRIO_RANGE)); | ||
671 | } | ||
672 | |||
673 | /* | ||
674 | - * Put task to the end of the run list without the overhead of dequeue | ||
675 | - * followed by enqueue. | ||
676 | + * Find the first unused slot by this task that is also in its prio_matrix | ||
677 | + * level. SCHED_BATCH tasks do not use the priority matrix. They only take | ||
678 | + * priority slots from their static_prio and above. | ||
679 | */ | ||
680 | -static void requeue_task(struct task_struct *p, struct prio_array *array) | ||
681 | +static int next_entitled_slot(struct task_struct *p, struct rq *rq) | ||
682 | { | ||
683 | - list_move_tail(&p->run_list, array->queue + p->prio); | ||
684 | + int search_prio = MAX_RT_PRIO, uprio = USER_PRIO(p->static_prio); | ||
685 | + struct prio_array *array = rq->active; | ||
686 | + DECLARE_BITMAP(tmp, PRIO_RANGE); | ||
687 | + | ||
688 | + /* | ||
689 | + * Go straight to expiration if there are higher priority tasks | ||
690 | + * already expired. | ||
691 | + */ | ||
692 | + if (p->static_prio > rq->expired->best_static_prio) | ||
693 | + return MAX_PRIO; | ||
694 | + if (!rq->prio_level[uprio]) | ||
695 | + rq->prio_level[uprio] = MAX_RT_PRIO; | ||
696 | + /* | ||
697 | + * Only priorities equal to the prio_level and above for their | ||
698 | + * static_prio are acceptable, and only if it's not better than | ||
699 | + * a queued better static_prio's prio_level. | ||
700 | + */ | ||
701 | + if (p->static_prio < array->best_static_prio) { | ||
702 | + if (likely(p->policy != SCHED_BATCH)) | ||
703 | + array->best_static_prio = p->static_prio; | ||
704 | + } else if (p->static_prio == array->best_static_prio) { | ||
705 | + search_prio = rq->prio_level[uprio]; | ||
706 | + } else { | ||
707 | + int i; | ||
708 | + | ||
709 | + search_prio = rq->prio_level[uprio]; | ||
710 | + /* A bound O(n) function, worst case n is 40 */ | ||
711 | + for (i = array->best_static_prio; i <= p->static_prio ; i++) { | ||
712 | + if (!rq->prio_level[USER_PRIO(i)]) | ||
713 | + rq->prio_level[USER_PRIO(i)] = MAX_RT_PRIO; | ||
714 | + search_prio = max(search_prio, | ||
715 | + rq->prio_level[USER_PRIO(i)]); | ||
716 | + } | ||
717 | + } | ||
718 | + if (unlikely(p->policy == SCHED_BATCH)) { | ||
719 | + search_prio = max(search_prio, p->static_prio); | ||
720 | + return SCHED_PRIO(find_next_zero_bit(p->bitmap, PRIO_RANGE, | ||
721 | + USER_PRIO(search_prio))); | ||
722 | + } | ||
723 | + bitmap_or(tmp, p->bitmap, prio_matrix[uprio], PRIO_RANGE); | ||
724 | + return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE, | ||
725 | + USER_PRIO(search_prio))); | ||
726 | +} | ||
727 | + | ||
728 | +static void queue_expired(struct task_struct *p, struct rq *rq) | ||
729 | +{ | ||
730 | + task_new_array(p, rq, rq->expired); | ||
731 | + p->prio = p->normal_prio = first_prio_slot(p); | ||
732 | + if (p->static_prio < rq->expired->best_static_prio) | ||
733 | + rq->expired->best_static_prio = p->static_prio; | ||
734 | + reset_first_time_slice(p); | ||
735 | } | ||
736 | |||
737 | -static inline void | ||
738 | -enqueue_task_head(struct task_struct *p, struct prio_array *array) | ||
739 | +#ifdef CONFIG_SMP | ||
740 | +/* | ||
741 | + * If we're waking up a task that was previously on a different runqueue, | ||
742 | + * update its data appropriately. Note we may be reading data from src_rq-> | ||
743 | + * outside of lock, but the occasional inaccurate result should be harmless. | ||
744 | + */ | ||
745 | + static void update_if_moved(struct task_struct *p, struct rq *rq) | ||
746 | +{ | ||
747 | + struct rq *src_rq = p->array->rq; | ||
748 | + | ||
749 | + if (src_rq == rq) | ||
750 | + return; | ||
751 | + /* | ||
752 | + * Only need to set p->array when p->rotation == rq->prio_rotation as | ||
753 | + * they will be set in recalc_task_prio when != rq->prio_rotation. | ||
754 | + */ | ||
755 | + if (p->rotation == src_rq->prio_rotation) { | ||
756 | + p->rotation = rq->prio_rotation; | ||
757 | + if (p->array == src_rq->expired) | ||
758 | + p->array = rq->expired; | ||
759 | + else | ||
760 | + p->array = rq->active; | ||
761 | + } else | ||
762 | + p->rotation = 0; | ||
763 | +} | ||
764 | +#else | ||
765 | +static inline void update_if_moved(struct task_struct *p, struct rq *rq) | ||
766 | { | ||
767 | - list_add(&p->run_list, array->queue + p->prio); | ||
768 | - __set_bit(p->prio, array->bitmap); | ||
769 | - array->nr_active++; | ||
770 | - p->array = array; | ||
771 | } | ||
772 | +#endif | ||
773 | |||
774 | /* | ||
775 | - * __normal_prio - return the priority that is based on the static | ||
776 | - * priority but is modified by bonuses/penalties. | ||
777 | - * | ||
778 | - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG] | ||
779 | - * into the -5 ... 0 ... +5 bonus/penalty range. | ||
780 | - * | ||
781 | - * We use 25% of the full 0...39 priority range so that: | ||
782 | - * | ||
783 | - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs. | ||
784 | - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks. | ||
785 | - * | ||
786 | - * Both properties are important to certain workloads. | ||
787 | + * recalc_task_prio determines what priority a non rt_task will be | ||
788 | + * queued at. If the task has already been running during this runqueue's | ||
789 | + * major rotation (rq->prio_rotation) then it continues at the same | ||
790 | + * priority if it has tick entitlement left. If it does not have entitlement | ||
791 | + * left, it finds the next priority slot according to its nice value that it | ||
792 | + * has not extracted quota from. If it has not run during this major | ||
793 | + * rotation, it starts at the next_entitled_slot and has its bitmap quota | ||
794 | + * cleared. If it does not have any slots left it has all its slots reset and | ||
795 | + * is queued on the expired at its first_prio_slot. | ||
796 | */ | ||
797 | +static void recalc_task_prio(struct task_struct *p, struct rq *rq) | ||
798 | +{ | ||
799 | + struct prio_array *array = rq->active; | ||
800 | + int queue_prio; | ||
801 | |||
802 | -static inline int __normal_prio(struct task_struct *p) | ||
803 | + update_if_moved(p, rq); | ||
804 | + if (p->rotation == rq->prio_rotation) { | ||
805 | + if (p->array == array) { | ||
806 | + if (p->time_slice > 0) | ||
807 | + return; | ||
808 | + p->time_slice = p->quota; | ||
809 | + } else if (p->array == rq->expired) { | ||
810 | + queue_expired(p, rq); | ||
811 | + return; | ||
812 | + } else | ||
813 | + task_new_array(p, rq, array); | ||
814 | + } else | ||
815 | + task_new_array(p, rq, array); | ||
816 | + | ||
817 | + queue_prio = next_entitled_slot(p, rq); | ||
818 | + if (queue_prio >= MAX_PRIO) { | ||
819 | + queue_expired(p, rq); | ||
820 | + return; | ||
821 | + } | ||
822 | + p->prio = p->normal_prio = queue_prio; | ||
823 | + __set_bit(USER_PRIO(p->prio), p->bitmap); | ||
824 | +} | ||
825 | + | ||
826 | +/* | ||
827 | + * Adding to a runqueue. The dynamic priority queue that it is added to is | ||
828 | + * determined by recalc_task_prio() above. | ||
829 | + */ | ||
830 | +static inline void __enqueue_task(struct task_struct *p, struct rq *rq) | ||
831 | +{ | ||
832 | + if (rt_task(p)) | ||
833 | + p->array = rq->active; | ||
834 | + else | ||
835 | + recalc_task_prio(p, rq); | ||
836 | + | ||
837 | + sched_info_queued(p); | ||
838 | + set_dynamic_bit(p, rq); | ||
839 | +} | ||
840 | + | ||
841 | +static void enqueue_task(struct task_struct *p, struct rq *rq) | ||
842 | { | ||
843 | - int bonus, prio; | ||
844 | + __enqueue_task(p, rq); | ||
845 | + list_add_tail(&p->run_list, p->array->queue + p->prio); | ||
846 | +} | ||
847 | |||
848 | - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; | ||
849 | +static inline void enqueue_task_head(struct task_struct *p, struct rq *rq) | ||
850 | +{ | ||
851 | + __enqueue_task(p, rq); | ||
852 | + list_add(&p->run_list, p->array->queue + p->prio); | ||
853 | +} | ||
854 | |||
855 | - prio = p->static_prio - bonus; | ||
856 | - if (prio < MAX_RT_PRIO) | ||
857 | - prio = MAX_RT_PRIO; | ||
858 | - if (prio > MAX_PRIO-1) | ||
859 | - prio = MAX_PRIO-1; | ||
860 | - return prio; | ||
861 | +/* | ||
862 | + * requeue_task is only called when p->static_prio does not change. p->prio | ||
863 | + * can change with dynamic tasks. | ||
864 | + */ | ||
865 | +static void requeue_task(struct task_struct *p, struct rq *rq, | ||
866 | + struct prio_array *old_array, int old_prio) | ||
867 | +{ | ||
868 | + if (p->array == rq->expired) | ||
869 | + queue_expired(p, rq); | ||
870 | + list_move_tail(&p->run_list, p->array->queue + p->prio); | ||
871 | + if (!rt_task(p)) { | ||
872 | + if (list_empty(old_array->queue + old_prio)) | ||
873 | + __clear_bit(old_prio, old_array->prio_bitmap); | ||
874 | + set_dynamic_bit(p, rq); | ||
875 | + } | ||
876 | } | ||
877 | |||
878 | /* | ||
879 | @@ -759,17 +870,24 @@ static inline int __normal_prio(struct t | ||
880 | */ | ||
881 | |||
882 | /* | ||
883 | - * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE | ||
884 | - * If static_prio_timeslice() is ever changed to break this assumption then | ||
885 | - * this code will need modification | ||
886 | - */ | ||
887 | -#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE | ||
888 | -#define LOAD_WEIGHT(lp) \ | ||
889 | - (((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO) | ||
890 | -#define PRIO_TO_LOAD_WEIGHT(prio) \ | ||
891 | - LOAD_WEIGHT(static_prio_timeslice(prio)) | ||
892 | -#define RTPRIO_TO_LOAD_WEIGHT(rp) \ | ||
893 | - (PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp)) | ||
894 | + * task_timeslice - the total duration a task can run during one major | ||
895 | + * rotation. Returns value in milliseconds as the smallest value can be 1. | ||
896 | + */ | ||
897 | +static int task_timeslice(struct task_struct *p) | ||
898 | +{ | ||
899 | + int slice = p->quota; /* quota is in us */ | ||
900 | + | ||
901 | + if (!rt_task(p)) | ||
902 | + slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * slice; | ||
903 | + return US_TO_MS(slice); | ||
904 | +} | ||
905 | + | ||
906 | +/* | ||
907 | + * The load weight is basically the task_timeslice in ms. Realtime tasks are | ||
908 | + * special cased to be proportionately larger than nice -20 by their | ||
909 | + * rt_priority. The weight for rt tasks can only be arbitrary at best. | ||
910 | + */ | ||
911 | +#define RTPRIO_TO_LOAD_WEIGHT(rp) (rr_interval * 20 * (40 + rp)) | ||
912 | |||
913 | static void set_load_weight(struct task_struct *p) | ||
914 | { | ||
915 | @@ -786,7 +904,7 @@ static void set_load_weight(struct task_ | ||
916 | #endif | ||
917 | p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority); | ||
918 | } else | ||
919 | - p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio); | ||
920 | + p->load_weight = task_timeslice(p); | ||
921 | } | ||
922 | |||
923 | static inline void | ||
924 | @@ -814,28 +932,38 @@ static inline void dec_nr_running(struct | ||
925 | } | ||
926 | |||
927 | /* | ||
928 | - * Calculate the expected normal priority: i.e. priority | ||
929 | - * without taking RT-inheritance into account. Might be | ||
930 | - * boosted by interactivity modifiers. Changes upon fork, | ||
931 | - * setprio syscalls, and whenever the interactivity | ||
932 | - * estimator recalculates. | ||
933 | + * __activate_task - move a task to the runqueue. | ||
934 | */ | ||
935 | -static inline int normal_prio(struct task_struct *p) | ||
936 | +static inline void __activate_task(struct task_struct *p, struct rq *rq) | ||
937 | +{ | ||
938 | + enqueue_task(p, rq); | ||
939 | + inc_nr_running(p, rq); | ||
940 | +} | ||
941 | + | ||
942 | +/* | ||
943 | + * __activate_idle_task - move idle task to the _front_ of runqueue. | ||
944 | + */ | ||
945 | +static inline void __activate_idle_task(struct task_struct *p, struct rq *rq) | ||
946 | { | ||
947 | - int prio; | ||
948 | + enqueue_task_head(p, rq); | ||
949 | + inc_nr_running(p, rq); | ||
950 | +} | ||
951 | |||
952 | +static inline int normal_prio(struct task_struct *p) | ||
953 | +{ | ||
954 | if (has_rt_policy(p)) | ||
955 | - prio = MAX_RT_PRIO-1 - p->rt_priority; | ||
956 | + return MAX_RT_PRIO-1 - p->rt_priority; | ||
957 | + /* Other tasks all have normal_prio set in recalc_task_prio */ | ||
958 | + if (likely(p->prio >= MAX_RT_PRIO && p->prio < MAX_PRIO)) | ||
959 | + return p->prio; | ||
960 | else | ||
961 | - prio = __normal_prio(p); | ||
962 | - return prio; | ||
963 | + return p->static_prio; | ||
964 | } | ||
965 | |||
966 | /* | ||
967 | * Calculate the current priority, i.e. the priority | ||
968 | * taken into account by the scheduler. This value might | ||
969 | - * be boosted by RT tasks, or might be boosted by | ||
970 | - * interactivity modifiers. Will be RT if the task got | ||
971 | + * be boosted by RT tasks as it will be RT if the task got | ||
972 | * RT-boosted. If not then it returns p->normal_prio. | ||
973 | */ | ||
974 | static int effective_prio(struct task_struct *p) | ||
975 | @@ -852,111 +980,41 @@ static int effective_prio(struct task_st | ||
976 | } | ||
977 | |||
978 | /* | ||
979 | - * __activate_task - move a task to the runqueue. | ||
980 | + * All tasks have quotas based on rr_interval. RT tasks all get rr_interval. | ||
981 | + * From nice 1 to 19 they are smaller than it only if they are at least one | ||
982 | + * tick still. Below nice 0 they get progressively larger. | ||
983 | + * ie nice -6..0 = rr_interval. nice -10 = 2.5 * rr_interval | ||
984 | + * nice -20 = 10 * rr_interval. nice 1-19 = rr_interval / 2. | ||
985 | + * Value returned is in microseconds. | ||
986 | */ | ||
987 | -static void __activate_task(struct task_struct *p, struct rq *rq) | ||
988 | +static inline unsigned int rr_quota(struct task_struct *p) | ||
989 | { | ||
990 | - struct prio_array *target = rq->active; | ||
991 | - | ||
992 | - if (batch_task(p)) | ||
993 | - target = rq->expired; | ||
994 | - enqueue_task(p, target); | ||
995 | - inc_nr_running(p, rq); | ||
996 | -} | ||
997 | + int nice = TASK_NICE(p), rr = rr_interval; | ||
998 | |||
999 | -/* | ||
1000 | - * __activate_idle_task - move idle task to the _front_ of runqueue. | ||
1001 | - */ | ||
1002 | -static inline void __activate_idle_task(struct task_struct *p, struct rq *rq) | ||
1003 | -{ | ||
1004 | - enqueue_task_head(p, rq->active); | ||
1005 | - inc_nr_running(p, rq); | ||
1006 | + if (!rt_task(p)) { | ||
1007 | + if (nice < -6) { | ||
1008 | + rr *= nice * nice; | ||
1009 | + rr /= 40; | ||
1010 | + } else if (nice > 0) | ||
1011 | + rr = rr / 2 ? : 1; | ||
1012 | + } | ||
1013 | + return MS_TO_US(rr); | ||
1014 | } | ||
1015 | |||
1016 | -/* | ||
1017 | - * Recalculate p->normal_prio and p->prio after having slept, | ||
1018 | - * updating the sleep-average too: | ||
1019 | - */ | ||
1020 | -static int recalc_task_prio(struct task_struct *p, unsigned long long now) | ||
1021 | +/* Every time we set the quota we need to set the load weight */ | ||
1022 | +static void set_quota(struct task_struct *p) | ||
1023 | { | ||
1024 | - /* Caller must always ensure 'now >= p->timestamp' */ | ||
1025 | - unsigned long sleep_time = now - p->timestamp; | ||
1026 | - | ||
1027 | - if (batch_task(p)) | ||
1028 | - sleep_time = 0; | ||
1029 | - | ||
1030 | - if (likely(sleep_time > 0)) { | ||
1031 | - /* | ||
1032 | - * This ceiling is set to the lowest priority that would allow | ||
1033 | - * a task to be reinserted into the active array on timeslice | ||
1034 | - * completion. | ||
1035 | - */ | ||
1036 | - unsigned long ceiling = INTERACTIVE_SLEEP(p); | ||
1037 | - | ||
1038 | - if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) { | ||
1039 | - /* | ||
1040 | - * Prevents user tasks from achieving best priority | ||
1041 | - * with one single large enough sleep. | ||
1042 | - */ | ||
1043 | - p->sleep_avg = ceiling; | ||
1044 | - /* | ||
1045 | - * Using INTERACTIVE_SLEEP() as a ceiling places a | ||
1046 | - * nice(0) task 1ms sleep away from promotion, and | ||
1047 | - * gives it 700ms to round-robin with no chance of | ||
1048 | - * being demoted. This is more than generous, so | ||
1049 | - * mark this sleep as non-interactive to prevent the | ||
1050 | - * on-runqueue bonus logic from intervening should | ||
1051 | - * this task not receive cpu immediately. | ||
1052 | - */ | ||
1053 | - p->sleep_type = SLEEP_NONINTERACTIVE; | ||
1054 | - } else { | ||
1055 | - /* | ||
1056 | - * Tasks waking from uninterruptible sleep are | ||
1057 | - * limited in their sleep_avg rise as they | ||
1058 | - * are likely to be waiting on I/O | ||
1059 | - */ | ||
1060 | - if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) { | ||
1061 | - if (p->sleep_avg >= ceiling) | ||
1062 | - sleep_time = 0; | ||
1063 | - else if (p->sleep_avg + sleep_time >= | ||
1064 | - ceiling) { | ||
1065 | - p->sleep_avg = ceiling; | ||
1066 | - sleep_time = 0; | ||
1067 | - } | ||
1068 | - } | ||
1069 | - | ||
1070 | - /* | ||
1071 | - * This code gives a bonus to interactive tasks. | ||
1072 | - * | ||
1073 | - * The boost works by updating the 'average sleep time' | ||
1074 | - * value here, based on ->timestamp. The more time a | ||
1075 | - * task spends sleeping, the higher the average gets - | ||
1076 | - * and the higher the priority boost gets as well. | ||
1077 | - */ | ||
1078 | - p->sleep_avg += sleep_time; | ||
1079 | - | ||
1080 | - } | ||
1081 | - if (p->sleep_avg > NS_MAX_SLEEP_AVG) | ||
1082 | - p->sleep_avg = NS_MAX_SLEEP_AVG; | ||
1083 | - } | ||
1084 | - | ||
1085 | - return effective_prio(p); | ||
1086 | + p->quota = rr_quota(p); | ||
1087 | + set_load_weight(p); | ||
1088 | } | ||
1089 | |||
1090 | /* | ||
1091 | * activate_task - move a task to the runqueue and do priority recalculation | ||
1092 | - * | ||
1093 | - * Update all the scheduling statistics stuff. (sleep average | ||
1094 | - * calculation, priority modifiers, etc.) | ||
1095 | */ | ||
1096 | static void activate_task(struct task_struct *p, struct rq *rq, int local) | ||
1097 | { | ||
1098 | - unsigned long long now; | ||
1099 | - | ||
1100 | - if (rt_task(p)) | ||
1101 | - goto out; | ||
1102 | + unsigned long long now = sched_clock(); | ||
1103 | |||
1104 | - now = sched_clock(); | ||
1105 | #ifdef CONFIG_SMP | ||
1106 | if (!local) { | ||
1107 | /* Compensate for drifting sched_clock */ | ||
1108 | @@ -977,32 +1035,9 @@ static void activate_task(struct task_st | ||
1109 | (now - p->timestamp) >> 20); | ||
1110 | } | ||
1111 | |||
1112 | - p->prio = recalc_task_prio(p, now); | ||
1113 | - | ||
1114 | - /* | ||
1115 | - * This checks to make sure it's not an uninterruptible task | ||
1116 | - * that is now waking up. | ||
1117 | - */ | ||
1118 | - if (p->sleep_type == SLEEP_NORMAL) { | ||
1119 | - /* | ||
1120 | - * Tasks which were woken up by interrupts (ie. hw events) | ||
1121 | - * are most likely of interactive nature. So we give them | ||
1122 | - * the credit of extending their sleep time to the period | ||
1123 | - * of time they spend on the runqueue, waiting for execution | ||
1124 | - * on a CPU, first time around: | ||
1125 | - */ | ||
1126 | - if (in_interrupt()) | ||
1127 | - p->sleep_type = SLEEP_INTERRUPTED; | ||
1128 | - else { | ||
1129 | - /* | ||
1130 | - * Normal first-time wakeups get a credit too for | ||
1131 | - * on-runqueue time, but it will be weighted down: | ||
1132 | - */ | ||
1133 | - p->sleep_type = SLEEP_INTERACTIVE; | ||
1134 | - } | ||
1135 | - } | ||
1136 | + set_quota(p); | ||
1137 | + p->prio = effective_prio(p); | ||
1138 | p->timestamp = now; | ||
1139 | -out: | ||
1140 | __activate_task(p, rq); | ||
1141 | } | ||
1142 | |||
1143 | @@ -1012,8 +1047,7 @@ out: | ||
1144 | static void deactivate_task(struct task_struct *p, struct rq *rq) | ||
1145 | { | ||
1146 | dec_nr_running(p, rq); | ||
1147 | - dequeue_task(p, p->array); | ||
1148 | - p->array = NULL; | ||
1149 | + dequeue_task(p, rq); | ||
1150 | } | ||
1151 | |||
1152 | /* | ||
1153 | @@ -1095,7 +1129,7 @@ migrate_task(struct task_struct *p, int | ||
1154 | * If the task is not on a runqueue (and not running), then | ||
1155 | * it is sufficient to simply update the task's cpu field. | ||
1156 | */ | ||
1157 | - if (!p->array && !task_running(rq, p)) { | ||
1158 | + if (!task_queued(p) && !task_running(rq, p)) { | ||
1159 | set_task_cpu(p, dest_cpu); | ||
1160 | return 0; | ||
1161 | } | ||
1162 | @@ -1126,7 +1160,7 @@ void wait_task_inactive(struct task_stru | ||
1163 | repeat: | ||
1164 | rq = task_rq_lock(p, &flags); | ||
1165 | /* Must be off runqueue entirely, not preempted. */ | ||
1166 | - if (unlikely(p->array || task_running(rq, p))) { | ||
1167 | + if (unlikely(task_queued(p) || task_running(rq, p))) { | ||
1168 | /* If it's preempted, we yield. It could be a while. */ | ||
1169 | preempted = !task_running(rq, p); | ||
1170 | task_rq_unlock(rq, &flags); | ||
1171 | @@ -1391,6 +1425,31 @@ static inline int wake_idle(int cpu, str | ||
1172 | } | ||
1173 | #endif | ||
1174 | |||
1175 | +/* | ||
1176 | + * We need to have a special definition for an idle runqueue when testing | ||
1177 | + * for preemption on CONFIG_HOTPLUG_CPU as the idle task may be scheduled as | ||
1178 | + * a realtime task in sched_idle_next. | ||
1179 | + */ | ||
1180 | +#ifdef CONFIG_HOTPLUG_CPU | ||
1181 | +#define rq_idle(rq) ((rq)->curr == (rq)->idle && !rt_task((rq)->curr)) | ||
1182 | +#else | ||
1183 | +#define rq_idle(rq) ((rq)->curr == (rq)->idle) | ||
1184 | +#endif | ||
1185 | + | ||
1186 | +static inline int task_preempts_curr(struct task_struct *p, struct rq *rq) | ||
1187 | +{ | ||
1188 | + struct task_struct *curr = rq->curr; | ||
1189 | + | ||
1190 | + return ((p->array == task_rq(p)->active && | ||
1191 | + TASK_PREEMPTS_CURR(p, curr)) || rq_idle(rq)); | ||
1192 | +} | ||
1193 | + | ||
1194 | +static inline void try_preempt(struct task_struct *p, struct rq *rq) | ||
1195 | +{ | ||
1196 | + if (task_preempts_curr(p, rq)) | ||
1197 | + resched_task(rq->curr); | ||
1198 | +} | ||
1199 | + | ||
1200 | /*** | ||
1201 | * try_to_wake_up - wake up a thread | ||
1202 | * @p: the to-be-woken-up thread | ||
1203 | @@ -1422,7 +1481,7 @@ static int try_to_wake_up(struct task_st | ||
1204 | if (!(old_state & state)) | ||
1205 | goto out; | ||
1206 | |||
1207 | - if (p->array) | ||
1208 | + if (task_queued(p)) | ||
1209 | goto out_running; | ||
1210 | |||
1211 | cpu = task_cpu(p); | ||
1212 | @@ -1515,7 +1574,7 @@ out_set_cpu: | ||
1213 | old_state = p->state; | ||
1214 | if (!(old_state & state)) | ||
1215 | goto out; | ||
1216 | - if (p->array) | ||
1217 | + if (task_queued(p)) | ||
1218 | goto out_running; | ||
1219 | |||
1220 | this_cpu = smp_processor_id(); | ||
1221 | @@ -1524,25 +1583,9 @@ out_set_cpu: | ||
1222 | |||
1223 | out_activate: | ||
1224 | #endif /* CONFIG_SMP */ | ||
1225 | - if (old_state == TASK_UNINTERRUPTIBLE) { | ||
1226 | + if (old_state == TASK_UNINTERRUPTIBLE) | ||
1227 | rq->nr_uninterruptible--; | ||
1228 | - /* | ||
1229 | - * Tasks on involuntary sleep don't earn | ||
1230 | - * sleep_avg beyond just interactive state. | ||
1231 | - */ | ||
1232 | - p->sleep_type = SLEEP_NONINTERACTIVE; | ||
1233 | - } else | ||
1234 | - | ||
1235 | - /* | ||
1236 | - * Tasks that have marked their sleep as noninteractive get | ||
1237 | - * woken up with their sleep average not weighted in an | ||
1238 | - * interactive way. | ||
1239 | - */ | ||
1240 | - if (old_state & TASK_NONINTERACTIVE) | ||
1241 | - p->sleep_type = SLEEP_NONINTERACTIVE; | ||
1242 | - | ||
1243 | |||
1244 | - activate_task(p, rq, cpu == this_cpu); | ||
1245 | /* | ||
1246 | * Sync wakeups (i.e. those types of wakeups where the waker | ||
1247 | * has indicated that it will leave the CPU in short order) | ||
1248 | @@ -1551,10 +1594,9 @@ out_activate: | ||
1249 | * the waker guarantees that the freshly woken up task is going | ||
1250 | * to be considered on this CPU.) | ||
1251 | */ | ||
1252 | - if (!sync || cpu != this_cpu) { | ||
1253 | - if (TASK_PREEMPTS_CURR(p, rq)) | ||
1254 | - resched_task(rq->curr); | ||
1255 | - } | ||
1256 | + activate_task(p, rq, cpu == this_cpu); | ||
1257 | + if (!sync || cpu != this_cpu) | ||
1258 | + try_preempt(p, rq); | ||
1259 | success = 1; | ||
1260 | |||
1261 | out_running: | ||
1262 | @@ -1577,7 +1619,6 @@ int fastcall wake_up_state(struct task_s | ||
1263 | return try_to_wake_up(p, state, 0); | ||
1264 | } | ||
1265 | |||
1266 | -static void task_running_tick(struct rq *rq, struct task_struct *p); | ||
1267 | /* | ||
1268 | * Perform scheduler related setup for a newly forked process p. | ||
1269 | * p is forked by current. | ||
1270 | @@ -1605,7 +1646,6 @@ void fastcall sched_fork(struct task_str | ||
1271 | p->prio = current->normal_prio; | ||
1272 | |||
1273 | INIT_LIST_HEAD(&p->run_list); | ||
1274 | - p->array = NULL; | ||
1275 | #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) | ||
1276 | if (unlikely(sched_info_on())) | ||
1277 | memset(&p->sched_info, 0, sizeof(p->sched_info)); | ||
1278 | @@ -1617,30 +1657,31 @@ void fastcall sched_fork(struct task_str | ||
1279 | /* Want to start with kernel preemption disabled. */ | ||
1280 | task_thread_info(p)->preempt_count = 1; | ||
1281 | #endif | ||
1282 | + if (unlikely(p->policy == SCHED_FIFO)) | ||
1283 | + goto out; | ||
1284 | /* | ||
1285 | * Share the timeslice between parent and child, thus the | ||
1286 | * total amount of pending timeslices in the system doesn't change, | ||
1287 | * resulting in more scheduling fairness. | ||
1288 | */ | ||
1289 | local_irq_disable(); | ||
1290 | - p->time_slice = (current->time_slice + 1) >> 1; | ||
1291 | - /* | ||
1292 | - * The remainder of the first timeslice might be recovered by | ||
1293 | - * the parent if the child exits early enough. | ||
1294 | - */ | ||
1295 | - p->first_time_slice = 1; | ||
1296 | - current->time_slice >>= 1; | ||
1297 | - p->timestamp = sched_clock(); | ||
1298 | - if (unlikely(!current->time_slice)) { | ||
1299 | + if (current->time_slice > 0) { | ||
1300 | + current->time_slice /= 2; | ||
1301 | + if (current->time_slice) | ||
1302 | + p->time_slice = current->time_slice; | ||
1303 | + else | ||
1304 | + p->time_slice = 1; | ||
1305 | /* | ||
1306 | - * This case is rare, it happens when the parent has only | ||
1307 | - * a single jiffy left from its timeslice. Taking the | ||
1308 | - * runqueue lock is not a problem. | ||
1309 | + * The remainder of the first timeslice might be recovered by | ||
1310 | + * the parent if the child exits early enough. | ||
1311 | */ | ||
1312 | - current->time_slice = 1; | ||
1313 | - task_running_tick(cpu_rq(cpu), current); | ||
1314 | - } | ||
1315 | + p->first_time_slice = 1; | ||
1316 | + } else | ||
1317 | + p->time_slice = 0; | ||
1318 | + | ||
1319 | + p->timestamp = sched_clock(); | ||
1320 | local_irq_enable(); | ||
1321 | +out: | ||
1322 | put_cpu(); | ||
1323 | } | ||
1324 | |||
1325 | @@ -1662,38 +1703,16 @@ void fastcall wake_up_new_task(struct ta | ||
1326 | this_cpu = smp_processor_id(); | ||
1327 | cpu = task_cpu(p); | ||
1328 | |||
1329 | - /* | ||
1330 | - * We decrease the sleep average of forking parents | ||
1331 | - * and children as well, to keep max-interactive tasks | ||
1332 | - * from forking tasks that are max-interactive. The parent | ||
1333 | - * (current) is done further down, under its lock. | ||
1334 | - */ | ||
1335 | - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * | ||
1336 | - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); | ||
1337 | - | ||
1338 | - p->prio = effective_prio(p); | ||
1339 | - | ||
1340 | if (likely(cpu == this_cpu)) { | ||
1341 | + activate_task(p, rq, 1); | ||
1342 | if (!(clone_flags & CLONE_VM)) { | ||
1343 | /* | ||
1344 | * The VM isn't cloned, so we're in a good position to | ||
1345 | * do child-runs-first in anticipation of an exec. This | ||
1346 | * usually avoids a lot of COW overhead. | ||
1347 | */ | ||
1348 | - if (unlikely(!current->array)) | ||
1349 | - __activate_task(p, rq); | ||
1350 | - else { | ||
1351 | - p->prio = current->prio; | ||
1352 | - p->normal_prio = current->normal_prio; | ||
1353 | - list_add_tail(&p->run_list, ¤t->run_list); | ||
1354 | - p->array = current->array; | ||
1355 | - p->array->nr_active++; | ||
1356 | - inc_nr_running(p, rq); | ||
1357 | - } | ||
1358 | set_need_resched(); | ||
1359 | - } else | ||
1360 | - /* Run child last */ | ||
1361 | - __activate_task(p, rq); | ||
1362 | + } | ||
1363 | /* | ||
1364 | * We skip the following code due to cpu == this_cpu | ||
1365 | * | ||
1366 | @@ -1710,19 +1729,16 @@ void fastcall wake_up_new_task(struct ta | ||
1367 | */ | ||
1368 | p->timestamp = (p->timestamp - this_rq->most_recent_timestamp) | ||
1369 | + rq->most_recent_timestamp; | ||
1370 | - __activate_task(p, rq); | ||
1371 | - if (TASK_PREEMPTS_CURR(p, rq)) | ||
1372 | - resched_task(rq->curr); | ||
1373 | + activate_task(p, rq, 0); | ||
1374 | + try_preempt(p, rq); | ||
1375 | |||
1376 | /* | ||
1377 | * Parent and child are on different CPUs, now get the | ||
1378 | - * parent runqueue to update the parent's ->sleep_avg: | ||
1379 | + * parent runqueue to update the parent's ->flags: | ||
1380 | */ | ||
1381 | task_rq_unlock(rq, &flags); | ||
1382 | this_rq = task_rq_lock(current, &flags); | ||
1383 | } | ||
1384 | - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * | ||
1385 | - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); | ||
1386 | task_rq_unlock(this_rq, &flags); | ||
1387 | } | ||
1388 | |||
1389 | @@ -1737,23 +1753,17 @@ void fastcall wake_up_new_task(struct ta | ||
1390 | */ | ||
1391 | void fastcall sched_exit(struct task_struct *p) | ||
1392 | { | ||
1393 | + struct task_struct *parent; | ||
1394 | unsigned long flags; | ||
1395 | struct rq *rq; | ||
1396 | |||
1397 | - /* | ||
1398 | - * If the child was a (relative-) CPU hog then decrease | ||
1399 | - * the sleep_avg of the parent as well. | ||
1400 | - */ | ||
1401 | - rq = task_rq_lock(p->parent, &flags); | ||
1402 | - if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) { | ||
1403 | - p->parent->time_slice += p->time_slice; | ||
1404 | - if (unlikely(p->parent->time_slice > task_timeslice(p))) | ||
1405 | - p->parent->time_slice = task_timeslice(p); | ||
1406 | - } | ||
1407 | - if (p->sleep_avg < p->parent->sleep_avg) | ||
1408 | - p->parent->sleep_avg = p->parent->sleep_avg / | ||
1409 | - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / | ||
1410 | - (EXIT_WEIGHT + 1); | ||
1411 | + parent = p->parent; | ||
1412 | + rq = task_rq_lock(parent, &flags); | ||
1413 | + if (p->first_time_slice > 0 && task_cpu(p) == task_cpu(parent)) { | ||
1414 | + parent->time_slice += p->time_slice; | ||
1415 | + if (unlikely(parent->time_slice > parent->quota)) | ||
1416 | + parent->time_slice = parent->quota; | ||
1417 | + } | ||
1418 | task_rq_unlock(rq, &flags); | ||
1419 | } | ||
1420 | |||
1421 | @@ -2085,23 +2095,17 @@ void sched_exec(void) | ||
1422 | * pull_task - move a task from a remote runqueue to the local runqueue. | ||
1423 | * Both runqueues must be locked. | ||
1424 | */ | ||
1425 | -static void pull_task(struct rq *src_rq, struct prio_array *src_array, | ||
1426 | - struct task_struct *p, struct rq *this_rq, | ||
1427 | - struct prio_array *this_array, int this_cpu) | ||
1428 | +static void pull_task(struct rq *src_rq, struct task_struct *p, | ||
1429 | + struct rq *this_rq, int this_cpu) | ||
1430 | { | ||
1431 | - dequeue_task(p, src_array); | ||
1432 | + dequeue_task(p, src_rq); | ||
1433 | dec_nr_running(p, src_rq); | ||
1434 | set_task_cpu(p, this_cpu); | ||
1435 | inc_nr_running(p, this_rq); | ||
1436 | - enqueue_task(p, this_array); | ||
1437 | + enqueue_task(p, this_rq); | ||
1438 | p->timestamp = (p->timestamp - src_rq->most_recent_timestamp) | ||
1439 | + this_rq->most_recent_timestamp; | ||
1440 | - /* | ||
1441 | - * Note that idle threads have a prio of MAX_PRIO, for this test | ||
1442 | - * to be always true for them. | ||
1443 | - */ | ||
1444 | - if (TASK_PREEMPTS_CURR(p, this_rq)) | ||
1445 | - resched_task(this_rq->curr); | ||
1446 | + try_preempt(p, this_rq); | ||
1447 | } | ||
1448 | |||
1449 | /* | ||
1450 | @@ -2144,7 +2148,16 @@ int can_migrate_task(struct task_struct | ||
1451 | return 1; | ||
1452 | } | ||
1453 | |||
1454 | -#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio) | ||
1455 | +static inline int rq_best_prio(struct rq *rq) | ||
1456 | +{ | ||
1457 | + int best_prio, exp_prio; | ||
1458 | + | ||
1459 | + best_prio = sched_find_first_bit(rq->dyn_bitmap); | ||
1460 | + exp_prio = find_next_bit(rq->exp_bitmap, MAX_PRIO, MAX_RT_PRIO); | ||
1461 | + if (unlikely(best_prio > exp_prio)) | ||
1462 | + best_prio = exp_prio; | ||
1463 | + return best_prio; | ||
1464 | +} | ||
1465 | |||
1466 | /* | ||
1467 | * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted | ||
1468 | @@ -2160,7 +2173,7 @@ static int move_tasks(struct rq *this_rq | ||
1469 | { | ||
1470 | int idx, pulled = 0, pinned = 0, this_best_prio, best_prio, | ||
1471 | best_prio_seen, skip_for_load; | ||
1472 | - struct prio_array *array, *dst_array; | ||
1473 | + struct prio_array *array; | ||
1474 | struct list_head *head, *curr; | ||
1475 | struct task_struct *tmp; | ||
1476 | long rem_load_move; | ||
1477 | @@ -2187,26 +2200,21 @@ static int move_tasks(struct rq *this_rq | ||
1478 | * be cache-cold, thus switching CPUs has the least effect | ||
1479 | * on them. | ||
1480 | */ | ||
1481 | - if (busiest->expired->nr_active) { | ||
1482 | - array = busiest->expired; | ||
1483 | - dst_array = this_rq->expired; | ||
1484 | - } else { | ||
1485 | - array = busiest->active; | ||
1486 | - dst_array = this_rq->active; | ||
1487 | - } | ||
1488 | - | ||
1489 | + array = busiest->expired; | ||
1490 | new_array: | ||
1491 | - /* Start searching at priority 0: */ | ||
1492 | - idx = 0; | ||
1493 | + /* Expired arrays don't have RT tasks so they're always MAX_RT_PRIO+ */ | ||
1494 | + if (array == busiest->expired) | ||
1495 | + idx = MAX_RT_PRIO; | ||
1496 | + else | ||
1497 | + idx = 0; | ||
1498 | skip_bitmap: | ||
1499 | if (!idx) | ||
1500 | - idx = sched_find_first_bit(array->bitmap); | ||
1501 | + idx = sched_find_first_bit(array->prio_bitmap); | ||
1502 | else | ||
1503 | - idx = find_next_bit(array->bitmap, MAX_PRIO, idx); | ||
1504 | + idx = find_next_bit(array->prio_bitmap, MAX_PRIO, idx); | ||
1505 | if (idx >= MAX_PRIO) { | ||
1506 | - if (array == busiest->expired && busiest->active->nr_active) { | ||
1507 | + if (array == busiest->expired) { | ||
1508 | array = busiest->active; | ||
1509 | - dst_array = this_rq->active; | ||
1510 | goto new_array; | ||
1511 | } | ||
1512 | goto out; | ||
1513 | @@ -2237,7 +2245,7 @@ skip_queue: | ||
1514 | goto skip_bitmap; | ||
1515 | } | ||
1516 | |||
1517 | - pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu); | ||
1518 | + pull_task(busiest, tmp, this_rq, this_cpu); | ||
1519 | pulled++; | ||
1520 | rem_load_move -= tmp->load_weight; | ||
1521 | |||
1522 | @@ -3013,11 +3021,36 @@ EXPORT_PER_CPU_SYMBOL(kstat); | ||
1523 | /* | ||
1524 | * This is called on clock ticks and on context switches. | ||
1525 | * Bank in p->sched_time the ns elapsed since the last tick or switch. | ||
1526 | + * CPU scheduler quota accounting is also performed here in microseconds. | ||
1527 | + * The value returned from sched_clock() occasionally gives bogus values so | ||
1528 | + * some sanity checking is required. | ||
1529 | */ | ||
1530 | -static inline void | ||
1531 | -update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now) | ||
1532 | +static void | ||
1533 | +update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now, | ||
1534 | + int tick) | ||
1535 | { | ||
1536 | - p->sched_time += now - p->last_ran; | ||
1537 | + long time_diff = now - p->last_ran; | ||
1538 | + | ||
1539 | + if (tick) { | ||
1540 | + /* | ||
1541 | + * Called from scheduler_tick() there should be less than two | ||
1542 | + * jiffies worth, and not negative/overflow. | ||
1543 | + */ | ||
1544 | + if (time_diff > JIFFIES_TO_NS(2) || time_diff < 0) | ||
1545 | + time_diff = JIFFIES_TO_NS(1); | ||
1546 | + } else { | ||
1547 | + /* | ||
1548 | + * Called from context_switch there should be less than one | ||
1549 | + * jiffy worth, and not negative/overflow. There should be | ||
1550 | + * some time banked here so use a nominal 1us. | ||
1551 | + */ | ||
1552 | + if (time_diff > JIFFIES_TO_NS(1) || time_diff < 1) | ||
1553 | + time_diff = 1000; | ||
1554 | + } | ||
1555 | + /* time_slice accounting is done in usecs to avoid overflow on 32bit */ | ||
1556 | + if (p != rq->idle && p->policy != SCHED_FIFO) | ||
1557 | + p->time_slice -= time_diff / 1000; | ||
1558 | + p->sched_time += time_diff; | ||
1559 | p->last_ran = rq->most_recent_timestamp = now; | ||
1560 | } | ||
1561 | |||
1562 | @@ -3038,27 +3071,6 @@ unsigned long long current_sched_time(co | ||
1563 | } | ||
1564 | |||
1565 | /* | ||
1566 | - * We place interactive tasks back into the active array, if possible. | ||
1567 | - * | ||
1568 | - * To guarantee that this does not starve expired tasks we ignore the | ||
1569 | - * interactivity of a task if the first expired task had to wait more | ||
1570 | - * than a 'reasonable' amount of time. This deadline timeout is | ||
1571 | - * load-dependent, as the frequency of array switched decreases with | ||
1572 | - * increasing number of running tasks. We also ignore the interactivity | ||
1573 | - * if a better static_prio task has expired: | ||
1574 | - */ | ||
1575 | -static inline int expired_starving(struct rq *rq) | ||
1576 | -{ | ||
1577 | - if (rq->curr->static_prio > rq->best_expired_prio) | ||
1578 | - return 1; | ||
1579 | - if (!STARVATION_LIMIT || !rq->expired_timestamp) | ||
1580 | - return 0; | ||
1581 | - if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running) | ||
1582 | - return 1; | ||
1583 | - return 0; | ||
1584 | -} | ||
1585 | - | ||
1586 | -/* | ||
1587 | * Account user cpu time to a process. | ||
1588 | * @p: the process that the cpu time gets accounted to | ||
1589 | * @hardirq_offset: the offset to subtract from hardirq_count() | ||
1590 | @@ -3131,87 +3143,47 @@ void account_steal_time(struct task_stru | ||
1591 | cpustat->steal = cputime64_add(cpustat->steal, tmp); | ||
1592 | } | ||
1593 | |||
1594 | -static void task_running_tick(struct rq *rq, struct task_struct *p) | ||
1595 | +/* | ||
1596 | + * The task has used up its quota of running in this prio_level so it must be | ||
1597 | + * dropped a priority level, all managed by recalc_task_prio(). | ||
1598 | + */ | ||
1599 | +static void task_expired_entitlement(struct rq *rq, struct task_struct *p) | ||
1600 | { | ||
1601 | - if (p->array != rq->active) { | ||
1602 | - /* Task has expired but was not scheduled yet */ | ||
1603 | - set_tsk_need_resched(p); | ||
1604 | + int overrun; | ||
1605 | + | ||
1606 | + reset_first_time_slice(p); | ||
1607 | + if (rt_task(p)) { | ||
1608 | + p->time_slice += p->quota; | ||
1609 | + list_move_tail(&p->run_list, p->array->queue + p->prio); | ||
1610 | return; | ||
1611 | } | ||
1612 | - spin_lock(&rq->lock); | ||
1613 | + overrun = p->time_slice; | ||
1614 | + dequeue_task(p, rq); | ||
1615 | + enqueue_task(p, rq); | ||
1616 | /* | ||
1617 | - * The task was running during this tick - update the | ||
1618 | - * time slice counter. Note: we do not update a thread's | ||
1619 | - * priority until it either goes to sleep or uses up its | ||
1620 | - * timeslice. This makes it possible for interactive tasks | ||
1621 | - * to use up their timeslices at their highest priority levels. | ||
1622 | + * Subtract any extra time this task ran over its time_slice; ie | ||
1623 | + * overrun will either be 0 or negative. | ||
1624 | */ | ||
1625 | - if (rt_task(p)) { | ||
1626 | - /* | ||
1627 | - * RR tasks need a special form of timeslice management. | ||
1628 | - * FIFO tasks have no timeslices. | ||
1629 | - */ | ||
1630 | - if ((p->policy == SCHED_RR) && !--p->time_slice) { | ||
1631 | - p->time_slice = task_timeslice(p); | ||
1632 | - p->first_time_slice = 0; | ||
1633 | - set_tsk_need_resched(p); | ||
1634 | - | ||
1635 | - /* put it at the end of the queue: */ | ||
1636 | - requeue_task(p, rq->active); | ||
1637 | - } | ||
1638 | - goto out_unlock; | ||
1639 | - } | ||
1640 | - if (!--p->time_slice) { | ||
1641 | - dequeue_task(p, rq->active); | ||
1642 | - set_tsk_need_resched(p); | ||
1643 | - p->prio = effective_prio(p); | ||
1644 | - p->time_slice = task_timeslice(p); | ||
1645 | - p->first_time_slice = 0; | ||
1646 | - | ||
1647 | - if (!rq->expired_timestamp) | ||
1648 | - rq->expired_timestamp = jiffies; | ||
1649 | - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { | ||
1650 | - enqueue_task(p, rq->expired); | ||
1651 | - if (p->static_prio < rq->best_expired_prio) | ||
1652 | - rq->best_expired_prio = p->static_prio; | ||
1653 | - } else | ||
1654 | - enqueue_task(p, rq->active); | ||
1655 | - } else { | ||
1656 | - /* | ||
1657 | - * Prevent a too long timeslice allowing a task to monopolize | ||
1658 | - * the CPU. We do this by splitting up the timeslice into | ||
1659 | - * smaller pieces. | ||
1660 | - * | ||
1661 | - * Note: this does not mean the task's timeslices expire or | ||
1662 | - * get lost in any way, they just might be preempted by | ||
1663 | - * another task of equal priority. (one with higher | ||
1664 | - * priority would have preempted this task already.) We | ||
1665 | - * requeue this task to the end of the list on this priority | ||
1666 | - * level, which is in essence a round-robin of tasks with | ||
1667 | - * equal priority. | ||
1668 | - * | ||
1669 | - * This only applies to tasks in the interactive | ||
1670 | - * delta range with at least TIMESLICE_GRANULARITY to requeue. | ||
1671 | - */ | ||
1672 | - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) - | ||
1673 | - p->time_slice) % TIMESLICE_GRANULARITY(p)) && | ||
1674 | - (p->time_slice >= TIMESLICE_GRANULARITY(p)) && | ||
1675 | - (p->array == rq->active)) { | ||
1676 | + p->time_slice += overrun; | ||
1677 | +} | ||
1678 | |||
1679 | - requeue_task(p, rq->active); | ||
1680 | - set_tsk_need_resched(p); | ||
1681 | - } | ||
1682 | - } | ||
1683 | -out_unlock: | ||
1684 | +/* This manages tasks that have run out of timeslice during a scheduler_tick */ | ||
1685 | +static void task_running_tick(struct rq *rq, struct task_struct *p) | ||
1686 | +{ | ||
1687 | + /* SCHED_FIFO tasks never run out of timeslice. */ | ||
1688 | + if (p->time_slice > 0 || p->policy == SCHED_FIFO) | ||
1689 | + return; | ||
1690 | + /* p->time_slice <= 0 */ | ||
1691 | + spin_lock(&rq->lock); | ||
1692 | + if (likely(task_queued(p))) | ||
1693 | + task_expired_entitlement(rq, p); | ||
1694 | + set_tsk_need_resched(p); | ||
1695 | spin_unlock(&rq->lock); | ||
1696 | } | ||
1697 | |||
1698 | /* | ||
1699 | * This function gets called by the timer code, with HZ frequency. | ||
1700 | * We call it with interrupts disabled. | ||
1701 | - * | ||
1702 | - * It also gets called by the fork code, when changing the parent's | ||
1703 | - * timeslices. | ||
1704 | */ | ||
1705 | void scheduler_tick(void) | ||
1706 | { | ||
1707 | @@ -3220,7 +3192,7 @@ void scheduler_tick(void) | ||
1708 | int cpu = smp_processor_id(); | ||
1709 | struct rq *rq = cpu_rq(cpu); | ||
1710 | |||
1711 | - update_cpu_clock(p, rq, now); | ||
1712 | + update_cpu_clock(p, rq, now, 1); | ||
1713 | |||
1714 | if (p != rq->idle) | ||
1715 | task_running_tick(rq, p); | ||
1716 | @@ -3269,10 +3241,55 @@ EXPORT_SYMBOL(sub_preempt_count); | ||
1717 | |||
1718 | #endif | ||
1719 | |||
1720 | -static inline int interactive_sleep(enum sleep_type sleep_type) | ||
1721 | +static void reset_prio_levels(struct rq *rq) | ||
1722 | { | ||
1723 | - return (sleep_type == SLEEP_INTERACTIVE || | ||
1724 | - sleep_type == SLEEP_INTERRUPTED); | ||
1725 | + rq->active->best_static_prio = MAX_PRIO - 1; | ||
1726 | + rq->expired->best_static_prio = MAX_PRIO - 1; | ||
1727 | + memset(rq->prio_level, 0, sizeof(int) * PRIO_RANGE); | ||
1728 | +} | ||
1729 | + | ||
1730 | +/* | ||
1731 | + * next_dynamic_task finds the next suitable dynamic task. | ||
1732 | + */ | ||
1733 | +static inline struct task_struct *next_dynamic_task(struct rq *rq, int idx) | ||
1734 | +{ | ||
1735 | + struct prio_array *array = rq->active; | ||
1736 | + struct task_struct *next; | ||
1737 | + struct list_head *queue; | ||
1738 | + int nstatic; | ||
1739 | + | ||
1740 | +retry: | ||
1741 | + if (idx >= MAX_PRIO) { | ||
1742 | + /* There are no more tasks in the active array. Swap arrays */ | ||
1743 | + array = rq->expired; | ||
1744 | + rq->expired = rq->active; | ||
1745 | + rq->active = array; | ||
1746 | + rq->exp_bitmap = rq->expired->prio_bitmap; | ||
1747 | + rq->dyn_bitmap = rq->active->prio_bitmap; | ||
1748 | + rq->prio_rotation++; | ||
1749 | + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO); | ||
1750 | + reset_prio_levels(rq); | ||
1751 | + } | ||
1752 | + queue = array->queue + idx; | ||
1753 | + next = list_entry(queue->next, struct task_struct, run_list); | ||
1754 | + if (unlikely(next->time_slice <= 0)) { | ||
1755 | + /* | ||
1756 | + * Unlucky enough that this task ran out of time_slice | ||
1757 | + * before it hit a scheduler_tick so it should have its | ||
1758 | + * priority reassessed and choose another task (possibly | ||
1759 | + * the same one) | ||
1760 | + */ | ||
1761 | + task_expired_entitlement(rq, next); | ||
1762 | + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO); | ||
1763 | + goto retry; | ||
1764 | + } | ||
1765 | + next->rotation = rq->prio_rotation; | ||
1766 | + nstatic = next->static_prio; | ||
1767 | + if (nstatic < array->best_static_prio) | ||
1768 | + array->best_static_prio = nstatic; | ||
1769 | + if (idx > rq->prio_level[USER_PRIO(nstatic)]) | ||
1770 | + rq->prio_level[USER_PRIO(nstatic)] = idx; | ||
1771 | + return next; | ||
1772 | } | ||
1773 | |||
1774 | /* | ||
1775 | @@ -3281,13 +3298,11 @@ static inline int interactive_sleep(enum | ||
1776 | asmlinkage void __sched schedule(void) | ||
1777 | { | ||
1778 | struct task_struct *prev, *next; | ||
1779 | - struct prio_array *array; | ||
1780 | struct list_head *queue; | ||
1781 | unsigned long long now; | ||
1782 | - unsigned long run_time; | ||
1783 | - int cpu, idx, new_prio; | ||
1784 | long *switch_count; | ||
1785 | struct rq *rq; | ||
1786 | + int cpu, idx; | ||
1787 | |||
1788 | /* | ||
1789 | * Test if we are atomic. Since do_exit() needs to call into | ||
1790 | @@ -3323,18 +3338,6 @@ need_resched_nonpreemptible: | ||
1791 | |||
1792 | schedstat_inc(rq, sched_cnt); | ||
1793 | now = sched_clock(); | ||
1794 | - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) { | ||
1795 | - run_time = now - prev->timestamp; | ||
1796 | - if (unlikely((long long)(now - prev->timestamp) < 0)) | ||
1797 | - run_time = 0; | ||
1798 | - } else | ||
1799 | - run_time = NS_MAX_SLEEP_AVG; | ||
1800 | - | ||
1801 | - /* | ||
1802 | - * Tasks charged proportionately less run_time at high sleep_avg to | ||
1803 | - * delay them losing their interactive status | ||
1804 | - */ | ||
1805 | - run_time /= (CURRENT_BONUS(prev) ? : 1); | ||
1806 | |||
1807 | spin_lock_irq(&rq->lock); | ||
1808 | |||
1809 | @@ -3356,59 +3359,29 @@ need_resched_nonpreemptible: | ||
1810 | idle_balance(cpu, rq); | ||
1811 | if (!rq->nr_running) { | ||
1812 | next = rq->idle; | ||
1813 | - rq->expired_timestamp = 0; | ||
1814 | goto switch_tasks; | ||
1815 | } | ||
1816 | } | ||
1817 | |||
1818 | - array = rq->active; | ||
1819 | - if (unlikely(!array->nr_active)) { | ||
1820 | - /* | ||
1821 | - * Switch the active and expired arrays. | ||
1822 | - */ | ||
1823 | - schedstat_inc(rq, sched_switch); | ||
1824 | - rq->active = rq->expired; | ||
1825 | - rq->expired = array; | ||
1826 | - array = rq->active; | ||
1827 | - rq->expired_timestamp = 0; | ||
1828 | - rq->best_expired_prio = MAX_PRIO; | ||
1829 | - } | ||
1830 | - | ||
1831 | - idx = sched_find_first_bit(array->bitmap); | ||
1832 | - queue = array->queue + idx; | ||
1833 | - next = list_entry(queue->next, struct task_struct, run_list); | ||
1834 | - | ||
1835 | - if (!rt_task(next) && interactive_sleep(next->sleep_type)) { | ||
1836 | - unsigned long long delta = now - next->timestamp; | ||
1837 | - if (unlikely((long long)(now - next->timestamp) < 0)) | ||
1838 | - delta = 0; | ||
1839 | - | ||
1840 | - if (next->sleep_type == SLEEP_INTERACTIVE) | ||
1841 | - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; | ||
1842 | - | ||
1843 | - array = next->array; | ||
1844 | - new_prio = recalc_task_prio(next, next->timestamp + delta); | ||
1845 | - | ||
1846 | - if (unlikely(next->prio != new_prio)) { | ||
1847 | - dequeue_task(next, array); | ||
1848 | - next->prio = new_prio; | ||
1849 | - enqueue_task(next, array); | ||
1850 | - } | ||
1851 | + idx = sched_find_first_bit(rq->dyn_bitmap); | ||
1852 | + if (!rt_prio(idx)) | ||
1853 | + next = next_dynamic_task(rq, idx); | ||
1854 | + else { | ||
1855 | + queue = rq->active->queue + idx; | ||
1856 | + next = list_entry(queue->next, struct task_struct, run_list); | ||
1857 | } | ||
1858 | - next->sleep_type = SLEEP_NORMAL; | ||
1859 | switch_tasks: | ||
1860 | - if (next == rq->idle) | ||
1861 | + if (next == rq->idle) { | ||
1862 | + reset_prio_levels(rq); | ||
1863 | + rq->prio_rotation++; | ||
1864 | schedstat_inc(rq, sched_goidle); | ||
1865 | + } | ||
1866 | prefetch(next); | ||
1867 | prefetch_stack(next); | ||
1868 | clear_tsk_need_resched(prev); | ||
1869 | rcu_qsctr_inc(task_cpu(prev)); | ||
1870 | |||
1871 | - update_cpu_clock(prev, rq, now); | ||
1872 | - | ||
1873 | - prev->sleep_avg -= run_time; | ||
1874 | - if ((long)prev->sleep_avg <= 0) | ||
1875 | - prev->sleep_avg = 0; | ||
1876 | + update_cpu_clock(prev, rq, now, 0); | ||
1877 | prev->timestamp = prev->last_ran = now; | ||
1878 | |||
1879 | sched_info_switch(prev, next); | ||
1880 | @@ -3844,29 +3817,22 @@ EXPORT_SYMBOL(sleep_on_timeout); | ||
1881 | */ | ||
1882 | void rt_mutex_setprio(struct task_struct *p, int prio) | ||
1883 | { | ||
1884 | - struct prio_array *array; | ||
1885 | unsigned long flags; | ||
1886 | + int queued, oldprio; | ||
1887 | struct rq *rq; | ||
1888 | - int oldprio; | ||
1889 | |||
1890 | BUG_ON(prio < 0 || prio > MAX_PRIO); | ||
1891 | |||
1892 | rq = task_rq_lock(p, &flags); | ||
1893 | |||
1894 | oldprio = p->prio; | ||
1895 | - array = p->array; | ||
1896 | - if (array) | ||
1897 | - dequeue_task(p, array); | ||
1898 | + queued = task_queued(p); | ||
1899 | + if (queued) | ||
1900 | + dequeue_task(p, rq); | ||
1901 | p->prio = prio; | ||
1902 | |||
1903 | - if (array) { | ||
1904 | - /* | ||
1905 | - * If changing to an RT priority then queue it | ||
1906 | - * in the active array! | ||
1907 | - */ | ||
1908 | - if (rt_task(p)) | ||
1909 | - array = rq->active; | ||
1910 | - enqueue_task(p, array); | ||
1911 | + if (queued) { | ||
1912 | + enqueue_task(p, rq); | ||
1913 | /* | ||
1914 | * Reschedule if we are currently running on this runqueue and | ||
1915 | * our priority decreased, or if we are not currently running on | ||
1916 | @@ -3875,8 +3841,8 @@ void rt_mutex_setprio(struct task_struct | ||
1917 | if (task_running(rq, p)) { | ||
1918 | if (p->prio > oldprio) | ||
1919 | resched_task(rq->curr); | ||
1920 | - } else if (TASK_PREEMPTS_CURR(p, rq)) | ||
1921 | - resched_task(rq->curr); | ||
1922 | + } else | ||
1923 | + try_preempt(p, rq); | ||
1924 | } | ||
1925 | task_rq_unlock(rq, &flags); | ||
1926 | } | ||
1927 | @@ -3885,8 +3851,7 @@ void rt_mutex_setprio(struct task_struct | ||
1928 | |||
1929 | void set_user_nice(struct task_struct *p, long nice) | ||
1930 | { | ||
1931 | - struct prio_array *array; | ||
1932 | - int old_prio, delta; | ||
1933 | + int queued, old_prio,delta; | ||
1934 | unsigned long flags; | ||
1935 | struct rq *rq; | ||
1936 | |||
1937 | @@ -3907,20 +3872,20 @@ void set_user_nice(struct task_struct *p | ||
1938 | p->static_prio = NICE_TO_PRIO(nice); | ||
1939 | goto out_unlock; | ||
1940 | } | ||
1941 | - array = p->array; | ||
1942 | - if (array) { | ||
1943 | - dequeue_task(p, array); | ||
1944 | + queued = task_queued(p); | ||
1945 | + if (queued) { | ||
1946 | + dequeue_task(p, rq); | ||
1947 | dec_raw_weighted_load(rq, p); | ||
1948 | } | ||
1949 | |||
1950 | p->static_prio = NICE_TO_PRIO(nice); | ||
1951 | - set_load_weight(p); | ||
1952 | old_prio = p->prio; | ||
1953 | p->prio = effective_prio(p); | ||
1954 | + set_quota(p); | ||
1955 | delta = p->prio - old_prio; | ||
1956 | |||
1957 | - if (array) { | ||
1958 | - enqueue_task(p, array); | ||
1959 | + if (queued) { | ||
1960 | + enqueue_task(p, rq); | ||
1961 | inc_raw_weighted_load(rq, p); | ||
1962 | /* | ||
1963 | * If the task increased its priority or is running and | ||
1964 | @@ -3996,7 +3961,7 @@ asmlinkage long sys_nice(int increment) | ||
1965 | * | ||
1966 | * This is the priority value as seen by users in /proc. | ||
1967 | * RT tasks are offset by -200. Normal tasks are centered | ||
1968 | - * around 0, value goes from -16 to +15. | ||
1969 | + * around 0, value goes from 0 to +39. | ||
1970 | */ | ||
1971 | int task_prio(const struct task_struct *p) | ||
1972 | { | ||
1973 | @@ -4043,19 +4008,14 @@ static inline struct task_struct *find_p | ||
1974 | /* Actually do priority change: must hold rq lock. */ | ||
1975 | static void __setscheduler(struct task_struct *p, int policy, int prio) | ||
1976 | { | ||
1977 | - BUG_ON(p->array); | ||
1978 | + BUG_ON(task_queued(p)); | ||
1979 | |||
1980 | p->policy = policy; | ||
1981 | p->rt_priority = prio; | ||
1982 | p->normal_prio = normal_prio(p); | ||
1983 | /* we are holding p->pi_lock already */ | ||
1984 | p->prio = rt_mutex_getprio(p); | ||
1985 | - /* | ||
1986 | - * SCHED_BATCH tasks are treated as perpetual CPU hogs: | ||
1987 | - */ | ||
1988 | - if (policy == SCHED_BATCH) | ||
1989 | - p->sleep_avg = 0; | ||
1990 | - set_load_weight(p); | ||
1991 | + set_quota(p); | ||
1992 | } | ||
1993 | |||
1994 | /** | ||
1995 | @@ -4069,8 +4029,7 @@ static void __setscheduler(struct task_s | ||
1996 | int sched_setscheduler(struct task_struct *p, int policy, | ||
1997 | struct sched_param *param) | ||
1998 | { | ||
1999 | - int retval, oldprio, oldpolicy = -1; | ||
2000 | - struct prio_array *array; | ||
2001 | + int queued, retval, oldprio, oldpolicy = -1; | ||
2002 | unsigned long flags; | ||
2003 | struct rq *rq; | ||
2004 | |||
2005 | @@ -4144,12 +4103,12 @@ recheck: | ||
2006 | spin_unlock_irqrestore(&p->pi_lock, flags); | ||
2007 | goto recheck; | ||
2008 | } | ||
2009 | - array = p->array; | ||
2010 | - if (array) | ||
2011 | + queued = task_queued(p); | ||
2012 | + if (queued) | ||
2013 | deactivate_task(p, rq); | ||
2014 | oldprio = p->prio; | ||
2015 | __setscheduler(p, policy, param->sched_priority); | ||
2016 | - if (array) { | ||
2017 | + if (queued) { | ||
2018 | __activate_task(p, rq); | ||
2019 | /* | ||
2020 | * Reschedule if we are currently running on this runqueue and | ||
2021 | @@ -4159,8 +4118,8 @@ recheck: | ||
2022 | if (task_running(rq, p)) { | ||
2023 | if (p->prio > oldprio) | ||
2024 | resched_task(rq->curr); | ||
2025 | - } else if (TASK_PREEMPTS_CURR(p, rq)) | ||
2026 | - resched_task(rq->curr); | ||
2027 | + } else | ||
2028 | + try_preempt(p, rq); | ||
2029 | } | ||
2030 | __task_rq_unlock(rq); | ||
2031 | spin_unlock_irqrestore(&p->pi_lock, flags); | ||
2032 | @@ -4433,40 +4392,27 @@ asmlinkage long sys_sched_getaffinity(pi | ||
2033 | * sys_sched_yield - yield the current processor to other threads. | ||
2034 | * | ||
2035 | * This function yields the current CPU by moving the calling thread | ||
2036 | - * to the expired array. If there are no other threads running on this | ||
2037 | - * CPU then this function will return. | ||
2038 | + * to the expired array if SCHED_NORMAL or the end of its current priority | ||
2039 | + * queue if a realtime task. If there are no other threads running on this | ||
2040 | + * cpu this function will return. | ||
2041 | */ | ||
2042 | asmlinkage long sys_sched_yield(void) | ||
2043 | { | ||
2044 | struct rq *rq = this_rq_lock(); | ||
2045 | - struct prio_array *array = current->array, *target = rq->expired; | ||
2046 | + struct task_struct *p = current; | ||
2047 | |||
2048 | schedstat_inc(rq, yld_cnt); | ||
2049 | - /* | ||
2050 | - * We implement yielding by moving the task into the expired | ||
2051 | - * queue. | ||
2052 | - * | ||
2053 | - * (special rule: RT tasks will just roundrobin in the active | ||
2054 | - * array.) | ||
2055 | - */ | ||
2056 | - if (rt_task(current)) | ||
2057 | - target = rq->active; | ||
2058 | + if (rq->nr_running == 1) | ||
2059 | + schedstat_inc(rq, yld_both_empty); | ||
2060 | + else { | ||
2061 | + struct prio_array *old_array = p->array; | ||
2062 | + int old_prio = p->prio; | ||
2063 | |||
2064 | - if (array->nr_active == 1) { | ||
2065 | - schedstat_inc(rq, yld_act_empty); | ||
2066 | - if (!rq->expired->nr_active) | ||
2067 | - schedstat_inc(rq, yld_both_empty); | ||
2068 | - } else if (!rq->expired->nr_active) | ||
2069 | - schedstat_inc(rq, yld_exp_empty); | ||
2070 | - | ||
2071 | - if (array != target) { | ||
2072 | - dequeue_task(current, array); | ||
2073 | - enqueue_task(current, target); | ||
2074 | - } else | ||
2075 | - /* | ||
2076 | - * requeue_task is cheaper so perform that if possible. | ||
2077 | - */ | ||
2078 | - requeue_task(current, array); | ||
2079 | + /* p->prio will be updated in requeue_task via queue_expired */ | ||
2080 | + if (!rt_task(p)) | ||
2081 | + p->array = rq->expired; | ||
2082 | + requeue_task(p, rq, old_array, old_prio); | ||
2083 | + } | ||
2084 | |||
2085 | /* | ||
2086 | * Since we are going to call schedule() anyway, there's | ||
2087 | @@ -4676,8 +4622,8 @@ long sys_sched_rr_get_interval(pid_t pid | ||
2088 | if (retval) | ||
2089 | goto out_unlock; | ||
2090 | |||
2091 | - jiffies_to_timespec(p->policy == SCHED_FIFO ? | ||
2092 | - 0 : task_timeslice(p), &t); | ||
2093 | + t = ns_to_timespec(p->policy == SCHED_FIFO ? 0 : | ||
2094 | + MS_TO_NS(task_timeslice(p))); | ||
2095 | read_unlock(&tasklist_lock); | ||
2096 | retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0; | ||
2097 | out_nounlock: | ||
2098 | @@ -4771,10 +4717,10 @@ void __cpuinit init_idle(struct task_str | ||
2099 | struct rq *rq = cpu_rq(cpu); | ||
2100 | unsigned long flags; | ||
2101 | |||
2102 | - idle->timestamp = sched_clock(); | ||
2103 | - idle->sleep_avg = 0; | ||
2104 | - idle->array = NULL; | ||
2105 | - idle->prio = idle->normal_prio = MAX_PRIO; | ||
2106 | + bitmap_zero(idle->bitmap, PRIO_RANGE); | ||
2107 | + idle->timestamp = idle->last_ran = sched_clock(); | ||
2108 | + idle->array = rq->active; | ||
2109 | + idle->prio = idle->normal_prio = NICE_TO_PRIO(0); | ||
2110 | idle->state = TASK_RUNNING; | ||
2111 | idle->cpus_allowed = cpumask_of_cpu(cpu); | ||
2112 | set_task_cpu(idle, cpu); | ||
2113 | @@ -4893,7 +4839,7 @@ static int __migrate_task(struct task_st | ||
2114 | goto out; | ||
2115 | |||
2116 | set_task_cpu(p, dest_cpu); | ||
2117 | - if (p->array) { | ||
2118 | + if (task_queued(p)) { | ||
2119 | /* | ||
2120 | * Sync timestamp with rq_dest's before activating. | ||
2121 | * The same thing could be achieved by doing this step | ||
2122 | @@ -4904,8 +4850,7 @@ static int __migrate_task(struct task_st | ||
2123 | + rq_dest->most_recent_timestamp; | ||
2124 | deactivate_task(p, rq_src); | ||
2125 | __activate_task(p, rq_dest); | ||
2126 | - if (TASK_PREEMPTS_CURR(p, rq_dest)) | ||
2127 | - resched_task(rq_dest->curr); | ||
2128 | + try_preempt(p, rq_dest); | ||
2129 | } | ||
2130 | ret = 1; | ||
2131 | out: | ||
2132 | @@ -5194,7 +5139,7 @@ migration_call(struct notifier_block *nf | ||
2133 | /* Idle task back to normal (off runqueue, low prio) */ | ||
2134 | rq = task_rq_lock(rq->idle, &flags); | ||
2135 | deactivate_task(rq->idle, rq); | ||
2136 | - rq->idle->static_prio = MAX_PRIO; | ||
2137 | + rq->idle->static_prio = NICE_TO_PRIO(0); | ||
2138 | __setscheduler(rq->idle, SCHED_NORMAL, 0); | ||
2139 | migrate_dead_tasks(cpu); | ||
2140 | task_rq_unlock(rq, &flags); | ||
2141 | @@ -6706,6 +6651,13 @@ void __init sched_init_smp(void) | ||
2142 | /* Move init over to a non-isolated CPU */ | ||
2143 | if (set_cpus_allowed(current, non_isolated_cpus) < 0) | ||
2144 | BUG(); | ||
2145 | + | ||
2146 | + /* | ||
2147 | + * Assume that every added cpu gives us slightly less overall latency | ||
2148 | + * allowing us to increase the base rr_interval, but in a non linear | ||
2149 | + * fashion. | ||
2150 | + */ | ||
2151 | + rr_interval *= 1 + ilog2(num_online_cpus()); | ||
2152 | } | ||
2153 | #else | ||
2154 | void __init sched_init_smp(void) | ||
2155 | @@ -6727,6 +6679,16 @@ void __init sched_init(void) | ||
2156 | { | ||
2157 | int i, j, k; | ||
2158 | |||
2159 | + /* Generate the priority matrix */ | ||
2160 | + for (i = 0; i < PRIO_RANGE; i++) { | ||
2161 | + bitmap_fill(prio_matrix[i], PRIO_RANGE); | ||
2162 | + j = PRIO_RANGE * PRIO_RANGE / (PRIO_RANGE - i); | ||
2163 | + for (k = 0; k <= PRIO_RANGE * (PRIO_RANGE - 1); k += j) { | ||
2164 | + __clear_bit(PRIO_RANGE - 1 - (k / PRIO_RANGE), | ||
2165 | + prio_matrix[i]); | ||
2166 | + } | ||
2167 | + } | ||
2168 | + | ||
2169 | for_each_possible_cpu(i) { | ||
2170 | struct prio_array *array; | ||
2171 | struct rq *rq; | ||
2172 | @@ -6735,11 +6697,16 @@ void __init sched_init(void) | ||
2173 | spin_lock_init(&rq->lock); | ||
2174 | lockdep_set_class(&rq->lock, &rq->rq_lock_key); | ||
2175 | rq->nr_running = 0; | ||
2176 | + rq->prio_rotation = 0; | ||
2177 | rq->active = rq->arrays; | ||
2178 | rq->expired = rq->arrays + 1; | ||
2179 | - rq->best_expired_prio = MAX_PRIO; | ||
2180 | + reset_prio_levels(rq); | ||
2181 | + rq->dyn_bitmap = rq->active->prio_bitmap; | ||
2182 | + rq->exp_bitmap = rq->expired->prio_bitmap; | ||
2183 | |||
2184 | #ifdef CONFIG_SMP | ||
2185 | + rq->active->rq = rq; | ||
2186 | + rq->expired->rq = rq; | ||
2187 | rq->sd = NULL; | ||
2188 | for (j = 1; j < 3; j++) | ||
2189 | rq->cpu_load[j] = 0; | ||
2190 | @@ -6752,16 +6719,16 @@ void __init sched_init(void) | ||
2191 | atomic_set(&rq->nr_iowait, 0); | ||
2192 | |||
2193 | for (j = 0; j < 2; j++) { | ||
2194 | + | ||
2195 | array = rq->arrays + j; | ||
2196 | - for (k = 0; k < MAX_PRIO; k++) { | ||
2197 | + for (k = 0; k < MAX_PRIO; k++) | ||
2198 | INIT_LIST_HEAD(array->queue + k); | ||
2199 | - __clear_bit(k, array->bitmap); | ||
2200 | - } | ||
2201 | - // delimiter for bitsearch | ||
2202 | - __set_bit(MAX_PRIO, array->bitmap); | ||
2203 | + bitmap_zero(array->prio_bitmap, MAX_PRIO); | ||
2204 | + /* delimiter for bitsearch */ | ||
2205 | + __set_bit(MAX_PRIO, array->prio_bitmap); | ||
2206 | } | ||
2207 | - } | ||
2208 | |||
2209 | + } | ||
2210 | set_load_weight(&init_task); | ||
2211 | |||
2212 | #ifdef CONFIG_SMP | ||
2213 | @@ -6815,10 +6782,10 @@ EXPORT_SYMBOL(__might_sleep); | ||
2214 | #ifdef CONFIG_MAGIC_SYSRQ | ||
2215 | void normalize_rt_tasks(void) | ||
2216 | { | ||
2217 | - struct prio_array *array; | ||
2218 | struct task_struct *p; | ||
2219 | unsigned long flags; | ||
2220 | struct rq *rq; | ||
2221 | + int queued; | ||
2222 | |||
2223 | read_lock_irq(&tasklist_lock); | ||
2224 | for_each_process(p) { | ||
2225 | @@ -6828,11 +6795,11 @@ void normalize_rt_tasks(void) | ||
2226 | spin_lock_irqsave(&p->pi_lock, flags); | ||
2227 | rq = __task_rq_lock(p); | ||
2228 | |||
2229 | - array = p->array; | ||
2230 | - if (array) | ||
2231 | + queued = task_queued(p); | ||
2232 | + if (queued) | ||
2233 | deactivate_task(p, task_rq(p)); | ||
2234 | __setscheduler(p, SCHED_NORMAL, 0); | ||
2235 | - if (array) { | ||
2236 | + if (queued) { | ||
2237 | __activate_task(p, task_rq(p)); | ||
2238 | resched_task(rq->curr); | ||
2239 | } | ||
2240 | Index: linux-2.6.21-ck2/Documentation/sysctl/kernel.txt | ||
2241 | =================================================================== | ||
2242 | --- linux-2.6.21-ck2.orig/Documentation/sysctl/kernel.txt 2007-02-05 22:51:59.000000000 +1100 | ||
2243 | +++ linux-2.6.21-ck2/Documentation/sysctl/kernel.txt 2007-05-14 19:30:30.000000000 +1000 | ||
2244 | @@ -43,6 +43,7 @@ show up in /proc/sys/kernel: | ||
2245 | - printk | ||
2246 | - real-root-dev ==> Documentation/initrd.txt | ||
2247 | - reboot-cmd [ SPARC only ] | ||
2248 | +- rr_interval | ||
2249 | - rtsig-max | ||
2250 | - rtsig-nr | ||
2251 | - sem | ||
2252 | @@ -288,6 +289,19 @@ rebooting. ??? | ||
2253 | |||
2254 | ============================================================== | ||
2255 | |||
2256 | +rr_interval: | ||
2257 | + | ||
2258 | +This is the smallest duration that any cpu process scheduling unit | ||
2259 | +will run for. Increasing this value can increase throughput of cpu | ||
2260 | +bound tasks substantially but at the expense of increased latencies | ||
2261 | +overall. This value is in milliseconds and the default value chosen | ||
2262 | +depends on the number of cpus available at scheduler initialisation | ||
2263 | +with a minimum of 8. | ||
2264 | + | ||
2265 | +Valid values are from 1-5000. | ||
2266 | + | ||
2267 | +============================================================== | ||
2268 | + | ||
2269 | rtsig-max & rtsig-nr: | ||
2270 | |||
2271 | The file rtsig-max can be used to tune the maximum number | ||
2272 | Index: linux-2.6.21-ck2/kernel/sysctl.c | ||
2273 | =================================================================== | ||
2274 | --- linux-2.6.21-ck2.orig/kernel/sysctl.c 2007-05-03 22:20:57.000000000 +1000 | ||
2275 | +++ linux-2.6.21-ck2/kernel/sysctl.c 2007-05-14 19:30:30.000000000 +1000 | ||
2276 | @@ -76,6 +76,7 @@ extern int pid_max_min, pid_max_max; | ||
2277 | extern int sysctl_drop_caches; | ||
2278 | extern int percpu_pagelist_fraction; | ||
2279 | extern int compat_log; | ||
2280 | +extern int rr_interval; | ||
2281 | |||
2282 | /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ | ||
2283 | static int maxolduid = 65535; | ||
2284 | @@ -159,6 +160,14 @@ int sysctl_legacy_va_layout; | ||
2285 | #endif | ||
2286 | |||
2287 | |||
2288 | +/* Constants for minimum and maximum testing. | ||
2289 | + We use these as one-element integer vectors. */ | ||
2290 | +static int __read_mostly zero; | ||
2291 | +static int __read_mostly one = 1; | ||
2292 | +static int __read_mostly one_hundred = 100; | ||
2293 | +static int __read_mostly five_thousand = 5000; | ||
2294 | + | ||
2295 | + | ||
2296 | /* The default sysctl tables: */ | ||
2297 | |||
2298 | static ctl_table root_table[] = { | ||
2299 | @@ -499,6 +508,17 @@ static ctl_table kern_table[] = { | ||
2300 | .mode = 0444, | ||
2301 | .proc_handler = &proc_dointvec, | ||
2302 | }, | ||
2303 | + { | ||
2304 | + .ctl_name = CTL_UNNUMBERED, | ||
2305 | + .procname = "rr_interval", | ||
2306 | + .data = &rr_interval, | ||
2307 | + .maxlen = sizeof (int), | ||
2308 | + .mode = 0644, | ||
2309 | + .proc_handler = &proc_dointvec_minmax, | ||
2310 | + .strategy = &sysctl_intvec, | ||
2311 | + .extra1 = &one, | ||
2312 | + .extra2 = &five_thousand, | ||
2313 | + }, | ||
2314 | #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) | ||
2315 | { | ||
2316 | .ctl_name = KERN_UNKNOWN_NMI_PANIC, | ||
2317 | @@ -607,12 +627,6 @@ static ctl_table kern_table[] = { | ||
2318 | { .ctl_name = 0 } | ||
2319 | }; | ||
2320 | |||
2321 | -/* Constants for minimum and maximum testing in vm_table. | ||
2322 | - We use these as one-element integer vectors. */ | ||
2323 | -static int zero; | ||
2324 | -static int one_hundred = 100; | ||
2325 | - | ||
2326 | - | ||
2327 | static ctl_table vm_table[] = { | ||
2328 | { | ||
2329 | .ctl_name = VM_OVERCOMMIT_MEMORY, | ||
2330 | Index: linux-2.6.21-ck2/fs/pipe.c | ||
2331 | =================================================================== | ||
2332 | --- linux-2.6.21-ck2.orig/fs/pipe.c 2007-05-03 22:20:56.000000000 +1000 | ||
2333 | +++ linux-2.6.21-ck2/fs/pipe.c 2007-05-14 19:30:30.000000000 +1000 | ||
2334 | @@ -41,12 +41,7 @@ void pipe_wait(struct pipe_inode_info *p | ||
2335 | { | ||
2336 | DEFINE_WAIT(wait); | ||
2337 | |||
2338 | - /* | ||
2339 | - * Pipes are system-local resources, so sleeping on them | ||
2340 | - * is considered a noninteractive wait: | ||
2341 | - */ | ||
2342 | - prepare_to_wait(&pipe->wait, &wait, | ||
2343 | - TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE); | ||
2344 | + prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); | ||
2345 | if (pipe->inode) | ||
2346 | mutex_unlock(&pipe->inode->i_mutex); | ||
2347 | schedule(); | ||
2348 | Index: linux-2.6.21-ck2/Documentation/sched-design.txt | ||
2349 | =================================================================== | ||
2350 | --- linux-2.6.21-ck2.orig/Documentation/sched-design.txt 2006-11-30 11:30:31.000000000 +1100 | ||
2351 | +++ linux-2.6.21-ck2/Documentation/sched-design.txt 2007-05-14 19:30:30.000000000 +1000 | ||
2352 | @@ -1,11 +1,14 @@ | ||
2353 | - Goals, Design and Implementation of the | ||
2354 | - new ultra-scalable O(1) scheduler | ||
2355 | + Goals, Design and Implementation of the ultra-scalable O(1) scheduler by | ||
2356 | + Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by | ||
2357 | + Con Kolivas. | ||
2358 | |||
2359 | |||
2360 | - This is an edited version of an email Ingo Molnar sent to | ||
2361 | - lkml on 4 Jan 2002. It describes the goals, design, and | ||
2362 | - implementation of Ingo's new ultra-scalable O(1) scheduler. | ||
2363 | - Last Updated: 18 April 2002. | ||
2364 | + This was originally an edited version of an email Ingo Molnar sent to | ||
2365 | + lkml on 4 Jan 2002. It describes the goals, design, and implementation | ||
2366 | + of Ingo's ultra-scalable O(1) scheduler. It now contains a description | ||
2367 | + of the Staircase Deadline priority scheduler that was built on this | ||
2368 | + design. | ||
2369 | + Last Updated: Fri, 4 May 2007 | ||
2370 | |||
2371 | |||
2372 | Goal | ||
2373 | @@ -163,3 +166,222 @@ certain code paths and data constructs. | ||
2374 | code is smaller than the old one. | ||
2375 | |||
2376 | Ingo | ||
2377 | + | ||
2378 | + | ||
2379 | +Staircase Deadline cpu scheduler policy | ||
2380 | +================================================ | ||
2381 | + | ||
2382 | +Design summary | ||
2383 | +============== | ||
2384 | + | ||
2385 | +A novel design which incorporates a foreground-background descending priority | ||
2386 | +system (the staircase) via a bandwidth allocation matrix according to nice | ||
2387 | +level. | ||
2388 | + | ||
2389 | + | ||
2390 | +Features | ||
2391 | +======== | ||
2392 | + | ||
2393 | +A starvation free, strict fairness O(1) scalable design with interactivity | ||
2394 | +as good as the above restrictions can provide. There is no interactivity | ||
2395 | +estimator, no sleep/run measurements and only simple fixed accounting. | ||
2396 | +The design has strict enough a design and accounting that task behaviour | ||
2397 | +can be modelled and maximum scheduling latencies can be predicted by | ||
2398 | +the virtual deadline mechanism that manages runqueues. The prime concern | ||
2399 | +in this design is to maintain fairness at all costs determined by nice level, | ||
2400 | +yet to maintain as good interactivity as can be allowed within the | ||
2401 | +constraints of strict fairness. | ||
2402 | + | ||
2403 | + | ||
2404 | +Design description | ||
2405 | +================== | ||
2406 | + | ||
2407 | +SD works off the principle of providing each task a quota of runtime that it is | ||
2408 | +allowed to run at a number of priority levels determined by its static priority | ||
2409 | +(ie. its nice level). If the task uses up its quota it has its priority | ||
2410 | +decremented to the next level determined by a priority matrix. Once every | ||
2411 | +runtime quota has been consumed of every priority level, a task is queued on the | ||
2412 | +"expired" array. When no other tasks exist with quota, the expired array is | ||
2413 | +activated and fresh quotas are handed out. This is all done in O(1). | ||
2414 | + | ||
2415 | +Design details | ||
2416 | +============== | ||
2417 | + | ||
2418 | +Each task keeps a record of its own entitlement of cpu time. Most of the rest of | ||
2419 | +these details apply to non-realtime tasks as rt task management is straight | ||
2420 | +forward. | ||
2421 | + | ||
2422 | +Each runqueue keeps a record of what major epoch it is up to in the | ||
2423 | +rq->prio_rotation field which is incremented on each major epoch. It also | ||
2424 | +keeps a record of the current prio_level for each static priority task. | ||
2425 | + | ||
2426 | +Each task keeps a record of what major runqueue epoch it was last running | ||
2427 | +on in p->rotation. It also keeps a record of what priority levels it has | ||
2428 | +already been allocated quota from during this epoch in a bitmap p->bitmap. | ||
2429 | + | ||
2430 | +The only tunable that determines all other details is the RR_INTERVAL. This | ||
2431 | +is set to 8ms, and is scaled gently upwards with more cpus. This value is | ||
2432 | +tunable via a /proc interface. | ||
2433 | + | ||
2434 | +All tasks are initially given a quota based on RR_INTERVAL. This is equal to | ||
2435 | +RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and | ||
2436 | +progressively larger for nice values from -1 to -20. This is assigned to | ||
2437 | +p->quota and only changes with changes in nice level. | ||
2438 | + | ||
2439 | +As a task is first queued, it checks in recalc_task_prio to see if it has run at | ||
2440 | +this runqueue's current priority rotation. If it has not, it will have its | ||
2441 | +p->prio level set according to the first slot in a "priority matrix" and will be | ||
2442 | +given a p->time_slice equal to the p->quota, and has its allocation bitmap bit | ||
2443 | +set in p->bitmap for this prio level. It is then queued on the current active | ||
2444 | +priority array. | ||
2445 | + | ||
2446 | +If a task has already been running during this major epoch, and it has | ||
2447 | +p->time_slice left and the rq->prio_quota for the task's p->prio still | ||
2448 | +has quota, it will be placed back on the active array, but no more quota | ||
2449 | +will be added. | ||
2450 | + | ||
2451 | +If a task has been running during this major epoch, but does not have | ||
2452 | +p->time_slice left, it will find the next lowest priority in its bitmap that it | ||
2453 | +has not been allocated quota from. It then gets the a full quota in | ||
2454 | +p->time_slice. It is then queued on the current active priority array at the | ||
2455 | +newly determined lower priority. | ||
2456 | + | ||
2457 | +If a task has been running during this major epoch, and does not have | ||
2458 | +any entitlement left in p->bitmap and no time_slice left, it will have its | ||
2459 | +bitmap cleared, and be queued at its best prio again, but on the expired | ||
2460 | +priority array. | ||
2461 | + | ||
2462 | +When a task is queued, it has its relevant bit set in the array->prio_bitmap. | ||
2463 | + | ||
2464 | +p->time_slice is stored in nanosconds and is updated via update_cpu_clock on | ||
2465 | +schedule() and scheduler_tick. If p->time_slice is below zero then the | ||
2466 | +recalc_task_prio is readjusted and the task rescheduled. | ||
2467 | + | ||
2468 | + | ||
2469 | +Priority Matrix | ||
2470 | +=============== | ||
2471 | + | ||
2472 | +In order to minimise the latencies between tasks of different nice levels | ||
2473 | +running concurrently, the dynamic priority slots where different nice levels | ||
2474 | +are queued are dithered instead of being sequential. What this means is that | ||
2475 | +there are 40 priority slots where a task may run during one major rotation, | ||
2476 | +and the allocation of slots is dependant on nice level. In the | ||
2477 | +following table, a zero represents a slot where the task may run. | ||
2478 | + | ||
2479 | +PRIORITY:0..................20.................39 | ||
2480 | +nice -20 0000000000000000000000000000000000000000 | ||
2481 | +nice -10 1000100010001000100010001000100010010000 | ||
2482 | +nice 0 1010101010101010101010101010101010101010 | ||
2483 | +nice 5 1011010110110101101101011011010110110110 | ||
2484 | +nice 10 1110111011101110111011101110111011101110 | ||
2485 | +nice 15 1111111011111110111111101111111011111110 | ||
2486 | +nice 19 1111111111111111111111111111111111111110 | ||
2487 | + | ||
2488 | +As can be seen, a nice -20 task runs in every priority slot whereas a nice 19 | ||
2489 | +task only runs one slot per major rotation. This dithered table allows for the | ||
2490 | +smallest possible maximum latencies between tasks of varying nice levels, thus | ||
2491 | +allowing vastly different nice levels to be used. | ||
2492 | + | ||
2493 | +SCHED_BATCH tasks are managed slightly differently, receiving only the top | ||
2494 | +slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but | ||
2495 | +slightly higher latencies. | ||
2496 | + | ||
2497 | + | ||
2498 | +Modelling deadline behaviour | ||
2499 | +============================ | ||
2500 | + | ||
2501 | +As the accounting in this design is hard and not modified by sleep average | ||
2502 | +calculations or interactivity modifiers, it is possible to accurately | ||
2503 | +predict the maximum latency that a task may experience under different | ||
2504 | +conditions. This is a virtual deadline mechanism enforced by mandatory | ||
2505 | +timeslice expiration and not outside bandwidth measurement. | ||
2506 | + | ||
2507 | +The maximum duration a task can run during one major epoch is determined by its | ||
2508 | +nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL | ||
2509 | +duration during each epoch. Nice 10 tasks can run at 9 priority levels for each | ||
2510 | +epoch, and so on. The table in the priority matrix above demonstrates how this | ||
2511 | +is enforced. | ||
2512 | + | ||
2513 | +Therefore the maximum duration a runqueue epoch can take is determined by | ||
2514 | +the number of tasks running, and their nice level. After that, the maximum | ||
2515 | +duration it can take before a task can wait before it get scheduled is | ||
2516 | +determined by the position of its first slot on the matrix. | ||
2517 | + | ||
2518 | +In the following examples, these are _worst case scenarios_ and would rarely | ||
2519 | +occur, but can be modelled nonetheless to determine the maximum possible | ||
2520 | +latency. | ||
2521 | + | ||
2522 | +So for example, if two nice 0 tasks are running, and one has just expired as | ||
2523 | +another is activated for the first time receiving a full quota for this | ||
2524 | +runqueue rotation, the first task will wait: | ||
2525 | + | ||
2526 | +nr_tasks * max_duration + nice_difference * rr_interval | ||
2527 | +1 * 19 * RR_INTERVAL + 0 = 152ms | ||
2528 | + | ||
2529 | +In the presence of a nice 10 task, a nice 0 task would wait a maximum of | ||
2530 | +1 * 10 * RR_INTERVAL + 0 = 80ms | ||
2531 | + | ||
2532 | +In the presence of a nice 0 task, a nice 10 task would wait a maximum of | ||
2533 | +1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms | ||
2534 | + | ||
2535 | +More useful than these values, though, are the average latencies which are | ||
2536 | +a matter of determining the average distance between priority slots of | ||
2537 | +different nice values and multiplying them by the tasks' quota. For example | ||
2538 | +in the presence of a nice -10 task, a nice 0 task will wait either one or | ||
2539 | +two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL, | ||
2540 | +this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or | ||
2541 | +20 and 40ms respectively (on uniprocessor at 1000HZ). | ||
2542 | + | ||
2543 | + | ||
2544 | +Achieving interactivity | ||
2545 | +======================= | ||
2546 | + | ||
2547 | +A requirement of this scheduler design was to achieve good interactivity | ||
2548 | +despite being a completely fair deadline based design. The disadvantage of | ||
2549 | +designs that try to achieve interactivity is that they usually do so at | ||
2550 | +the expense of maintaining fairness. As cpu speeds increase, the requirement | ||
2551 | +for some sort of metered unfairness towards interactive tasks becomes a less | ||
2552 | +desirable phenomenon, but low latency and fairness remains mandatory to | ||
2553 | +good interactive performance. | ||
2554 | + | ||
2555 | +This design relies on the fact that interactive tasks, by their nature, | ||
2556 | +sleep often. Most fair scheduling designs end up penalising such tasks | ||
2557 | +indirectly giving them less than their fair possible share because of the | ||
2558 | +sleep, and have to use a mechanism of bonusing their priority to offset | ||
2559 | +this based on the duration they sleep. This becomes increasingly inaccurate | ||
2560 | +as the number of running tasks rises and more tasks spend time waiting on | ||
2561 | +runqueues rather than sleeping, and it is impossible to tell whether the | ||
2562 | +task that's waiting on a runqueue only intends to run for a short period and | ||
2563 | +then sleep again after than runqueue wait. Furthermore, all such designs rely | ||
2564 | +on a period of time to pass to accumulate some form of statistic on the task | ||
2565 | +before deciding on how much to give them preference. The shorter this period, | ||
2566 | +the more rapidly bursts of cpu ruin the interactive tasks behaviour. The | ||
2567 | +longer this period, the longer it takes for interactive tasks to get low | ||
2568 | +scheduling latencies and fair cpu. | ||
2569 | + | ||
2570 | +This design does not measure sleep time at all. Interactive tasks that sleep | ||
2571 | +often will wake up having consumed very little if any of their quota for | ||
2572 | +the current major priority rotation. The longer they have slept, the less | ||
2573 | +likely they are to even be on the current major priority rotation. Once | ||
2574 | +woken up, though, they get to use up a their full quota for that epoch, | ||
2575 | +whether part of a quota remains or a full quota. Overall, however, they | ||
2576 | +can still only run as much cpu time for that epoch as any other task of the | ||
2577 | +same nice level. This means that two tasks behaving completely differently | ||
2578 | +from fully cpu bound to waking/sleeping extremely frequently will still | ||
2579 | +get the same quota of cpu, but the latter will be using its quota for that | ||
2580 | +epoch in bursts rather than continuously. This guarantees that interactive | ||
2581 | +tasks get the same amount of cpu as cpu bound ones. | ||
2582 | + | ||
2583 | +The other requirement of interactive tasks is also to obtain low latencies | ||
2584 | +for when they are scheduled. Unlike fully cpu bound tasks and the maximum | ||
2585 | +latencies possible described in the modelling deadline behaviour section | ||
2586 | +above, tasks that sleep will wake up with quota available usually at the | ||
2587 | +current runqueue's priority_level or better. This means that the most latency | ||
2588 | +they are likely to see is one RR_INTERVAL, and often they will preempt the | ||
2589 | +current task if it is not of a sleeping nature. This then guarantees very | ||
2590 | +low latency for interactive tasks, and the lowest latencies for the least | ||
2591 | +cpu bound tasks. | ||
2592 | + | ||
2593 | + | ||
2594 | +Fri, 4 May 2007 | ||
2595 | +Con Kolivas <kernel@kolivas.org> | ||
2596 | Index: linux-2.6.21-ck2/kernel/softirq.c | ||
2597 | =================================================================== | ||
2598 | --- linux-2.6.21-ck2.orig/kernel/softirq.c 2007-05-03 22:20:57.000000000 +1000 | ||
2599 | +++ linux-2.6.21-ck2/kernel/softirq.c 2007-05-14 19:30:30.000000000 +1000 | ||
2600 | @@ -488,7 +488,7 @@ void __init softirq_init(void) | ||
2601 | |||
2602 | static int ksoftirqd(void * __bind_cpu) | ||
2603 | { | ||
2604 | - set_user_nice(current, 19); | ||
2605 | + set_user_nice(current, 15); | ||
2606 | current->flags |= PF_NOFREEZE; | ||
2607 | |||
2608 | set_current_state(TASK_INTERRUPTIBLE); |