Contents of /trunk/kernel26-alx/patches-2.6.21-r14/0001-2.6.21-sd-0.48.patch
Parent Directory | Revision Log
Revision 447 -
(show annotations)
(download)
Tue Jan 22 17:55:52 2008 UTC (16 years, 9 months ago) by niro
File size: 89890 byte(s)
Tue Jan 22 17:55:52 2008 UTC (16 years, 9 months ago) by niro
File size: 89890 byte(s)
-2.6.21-alx-r14 - fixed some natsemi errors on wys terminals
1 | Staircase Deadline cpu scheduler policy |
2 | ================================================ |
3 | |
4 | Design summary |
5 | ============== |
6 | |
7 | A novel design which incorporates a foreground-background descending priority |
8 | system (the staircase) via a bandwidth allocation matrix according to nice |
9 | level. |
10 | |
11 | |
12 | Features |
13 | ======== |
14 | |
15 | A starvation free, strict fairness O(1) scalable design with interactivity |
16 | as good as the above restrictions can provide. There is no interactivity |
17 | estimator, no sleep/run measurements and only simple fixed accounting. |
18 | The design has strict enough a design and accounting that task behaviour |
19 | can be modelled and maximum scheduling latencies can be predicted by |
20 | the virtual deadline mechanism that manages runqueues. The prime concern |
21 | in this design is to maintain fairness at all costs determined by nice level, |
22 | yet to maintain as good interactivity as can be allowed within the |
23 | constraints of strict fairness. |
24 | |
25 | |
26 | Design description |
27 | ================== |
28 | |
29 | SD works off the principle of providing each task a quota of runtime that it is |
30 | allowed to run at a number of priority levels determined by its static priority |
31 | (ie. its nice level). If the task uses up its quota it has its priority |
32 | decremented to the next level determined by a priority matrix. Once every |
33 | runtime quota has been consumed of every priority level, a task is queued on the |
34 | "expired" array. When no other tasks exist with quota, the expired array is |
35 | activated and fresh quotas are handed out. This is all done in O(1). |
36 | |
37 | Design details |
38 | ============== |
39 | |
40 | Each task keeps a record of its own entitlement of cpu time. Most of the rest of |
41 | these details apply to non-realtime tasks as rt task management is straight |
42 | forward. |
43 | |
44 | Each runqueue keeps a record of what major epoch it is up to in the |
45 | rq->prio_rotation field which is incremented on each major epoch. It also |
46 | keeps a record of the current prio_level for each static priority task. |
47 | |
48 | Each task keeps a record of what major runqueue epoch it was last running |
49 | on in p->rotation. It also keeps a record of what priority levels it has |
50 | already been allocated quota from during this epoch in a bitmap p->bitmap. |
51 | |
52 | The only tunable that determines all other details is the RR_INTERVAL. This |
53 | is set to 8ms, and is scaled gently upwards with more cpus. This value is |
54 | tunable via a /proc interface. |
55 | |
56 | All tasks are initially given a quota based on RR_INTERVAL. This is equal to |
57 | RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and |
58 | progressively larger for nice values from -1 to -20. This is assigned to |
59 | p->quota and only changes with changes in nice level. |
60 | |
61 | As a task is first queued, it checks in recalc_task_prio to see if it has run at |
62 | this runqueue's current priority rotation. If it has not, it will have its |
63 | p->prio level set according to the first slot in a "priority matrix" and will be |
64 | given a p->time_slice equal to the p->quota, and has its allocation bitmap bit |
65 | set in p->bitmap for this prio level. It is then queued on the current active |
66 | priority array. |
67 | |
68 | If a task has already been running during this major epoch, and it has |
69 | p->time_slice left and the rq->prio_quota for the task's p->prio still |
70 | has quota, it will be placed back on the active array, but no more quota |
71 | will be added. |
72 | |
73 | If a task has been running during this major epoch, but does not have |
74 | p->time_slice left, it will find the next lowest priority in its bitmap that it |
75 | has not been allocated quota from. It then gets the a full quota in |
76 | p->time_slice. It is then queued on the current active priority array at the |
77 | newly determined lower priority. |
78 | |
79 | If a task has been running during this major epoch, and does not have |
80 | any entitlement left in p->bitmap and no time_slice left, it will have its |
81 | bitmap cleared, and be queued at its best prio again, but on the expired |
82 | priority array. |
83 | |
84 | When a task is queued, it has its relevant bit set in the array->prio_bitmap. |
85 | |
86 | p->time_slice is stored in nanosconds and is updated via update_cpu_clock on |
87 | schedule() and scheduler_tick. If p->time_slice is below zero then the |
88 | recalc_task_prio is readjusted and the task rescheduled. |
89 | |
90 | |
91 | Priority Matrix |
92 | =============== |
93 | |
94 | In order to minimise the latencies between tasks of different nice levels |
95 | running concurrently, the dynamic priority slots where different nice levels |
96 | are queued are dithered instead of being sequential. What this means is that |
97 | there are 40 priority slots where a task may run during one major rotation, |
98 | and the allocation of slots is dependant on nice level. In the |
99 | following table, a zero represents a slot where the task may run. |
100 | |
101 | PRIORITY:0..................20.................39 |
102 | nice -20 0000000000000000000000000000000000000000 |
103 | nice -10 1000100010001000100010001000100010010000 |
104 | nice 0 1010101010101010101010101010101010101010 |
105 | nice 5 1011010110110101101101011011010110110110 |
106 | nice 10 1110111011101110111011101110111011101110 |
107 | nice 15 1111111011111110111111101111111011111110 |
108 | nice 19 1111111111111111111111111111111111111110 |
109 | |
110 | As can be seen, a nice -20 task runs in every priority slot whereas a nice 19 |
111 | task only runs one slot per major rotation. This dithered table allows for the |
112 | smallest possible maximum latencies between tasks of varying nice levels, thus |
113 | allowing vastly different nice levels to be used. |
114 | |
115 | SCHED_BATCH tasks are managed slightly differently, receiving only the top |
116 | slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but |
117 | slightly higher latencies. |
118 | |
119 | |
120 | Modelling deadline behaviour |
121 | ============================ |
122 | |
123 | As the accounting in this design is hard and not modified by sleep average |
124 | calculations or interactivity modifiers, it is possible to accurately |
125 | predict the maximum latency that a task may experience under different |
126 | conditions. This is a virtual deadline mechanism enforced by mandatory |
127 | timeslice expiration and not outside bandwidth measurement. |
128 | |
129 | The maximum duration a task can run during one major epoch is determined by its |
130 | nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL |
131 | duration during each epoch. Nice 10 tasks can run at 9 priority levels for each |
132 | epoch, and so on. The table in the priority matrix above demonstrates how this |
133 | is enforced. |
134 | |
135 | Therefore the maximum duration a runqueue epoch can take is determined by |
136 | the number of tasks running, and their nice level. After that, the maximum |
137 | duration it can take before a task can wait before it get scheduled is |
138 | determined by the position of its first slot on the matrix. |
139 | |
140 | In the following examples, these are _worst case scenarios_ and would rarely |
141 | occur, but can be modelled nonetheless to determine the maximum possible |
142 | latency. |
143 | |
144 | So for example, if two nice 0 tasks are running, and one has just expired as |
145 | another is activated for the first time receiving a full quota for this |
146 | runqueue rotation, the first task will wait: |
147 | |
148 | nr_tasks * max_duration + nice_difference * rr_interval |
149 | 1 * 19 * RR_INTERVAL + 0 = 152ms |
150 | |
151 | In the presence of a nice 10 task, a nice 0 task would wait a maximum of |
152 | 1 * 10 * RR_INTERVAL + 0 = 80ms |
153 | |
154 | In the presence of a nice 0 task, a nice 10 task would wait a maximum of |
155 | 1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms |
156 | |
157 | More useful than these values, though, are the average latencies which are |
158 | a matter of determining the average distance between priority slots of |
159 | different nice values and multiplying them by the tasks' quota. For example |
160 | in the presence of a nice -10 task, a nice 0 task will wait either one or |
161 | two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL, |
162 | this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or |
163 | 20 and 40ms respectively (on uniprocessor at 1000HZ). |
164 | |
165 | |
166 | Achieving interactivity |
167 | ======================= |
168 | |
169 | A requirement of this scheduler design was to achieve good interactivity |
170 | despite being a completely fair deadline based design. The disadvantage of |
171 | designs that try to achieve interactivity is that they usually do so at |
172 | the expense of maintaining fairness. As cpu speeds increase, the requirement |
173 | for some sort of metered unfairness towards interactive tasks becomes a less |
174 | desirable phenomenon, but low latency and fairness remains mandatory to |
175 | good interactive performance. |
176 | |
177 | This design relies on the fact that interactive tasks, by their nature, |
178 | sleep often. Most fair scheduling designs end up penalising such tasks |
179 | indirectly giving them less than their fair possible share because of the |
180 | sleep, and have to use a mechanism of bonusing their priority to offset |
181 | this based on the duration they sleep. This becomes increasingly inaccurate |
182 | as the number of running tasks rises and more tasks spend time waiting on |
183 | runqueues rather than sleeping, and it is impossible to tell whether the |
184 | task that's waiting on a runqueue only intends to run for a short period and |
185 | then sleep again after than runqueue wait. Furthermore, all such designs rely |
186 | on a period of time to pass to accumulate some form of statistic on the task |
187 | before deciding on how much to give them preference. The shorter this period, |
188 | the more rapidly bursts of cpu ruin the interactive tasks behaviour. The |
189 | longer this period, the longer it takes for interactive tasks to get low |
190 | scheduling latencies and fair cpu. |
191 | |
192 | This design does not measure sleep time at all. Interactive tasks that sleep |
193 | often will wake up having consumed very little if any of their quota for |
194 | the current major priority rotation. The longer they have slept, the less |
195 | likely they are to even be on the current major priority rotation. Once |
196 | woken up, though, they get to use up a their full quota for that epoch, |
197 | whether part of a quota remains or a full quota. Overall, however, they |
198 | can still only run as much cpu time for that epoch as any other task of the |
199 | same nice level. This means that two tasks behaving completely differently |
200 | from fully cpu bound to waking/sleeping extremely frequently will still |
201 | get the same quota of cpu, but the latter will be using its quota for that |
202 | epoch in bursts rather than continuously. This guarantees that interactive |
203 | tasks get the same amount of cpu as cpu bound ones. |
204 | |
205 | The other requirement of interactive tasks is also to obtain low latencies |
206 | for when they are scheduled. Unlike fully cpu bound tasks and the maximum |
207 | latencies possible described in the modelling deadline behaviour section |
208 | above, tasks that sleep will wake up with quota available usually at the |
209 | current runqueue's priority_level or better. This means that the most latency |
210 | they are likely to see is one RR_INTERVAL, and often they will preempt the |
211 | current task if it is not of a sleeping nature. This then guarantees very |
212 | low latency for interactive tasks, and the lowest latencies for the least |
213 | cpu bound tasks. |
214 | |
215 | |
216 | Fri, 4 May 2007 |
217 | |
218 | Signed-off-by: Con Kolivas <kernel@kolivas.org> |
219 | |
220 | --- |
221 | Documentation/sched-design.txt | 234 +++++++ |
222 | Documentation/sysctl/kernel.txt | 14 |
223 | fs/pipe.c | 7 |
224 | fs/proc/array.c | 2 |
225 | include/linux/init_task.h | 4 |
226 | include/linux/sched.h | 32 - |
227 | kernel/sched.c | 1277 +++++++++++++++++++--------------------- |
228 | kernel/softirq.c | 2 |
229 | kernel/sysctl.c | 26 |
230 | kernel/workqueue.c | 2 |
231 | 10 files changed, 908 insertions(+), 692 deletions(-) |
232 | |
233 | Index: linux-2.6.21-ck2/kernel/workqueue.c |
234 | =================================================================== |
235 | --- linux-2.6.21-ck2.orig/kernel/workqueue.c 2007-05-03 22:20:57.000000000 +1000 |
236 | +++ linux-2.6.21-ck2/kernel/workqueue.c 2007-05-14 19:30:30.000000000 +1000 |
237 | @@ -355,8 +355,6 @@ static int worker_thread(void *__cwq) |
238 | if (!cwq->freezeable) |
239 | current->flags |= PF_NOFREEZE; |
240 | |
241 | - set_user_nice(current, -5); |
242 | - |
243 | /* Block and flush all signals */ |
244 | sigfillset(&blocked); |
245 | sigprocmask(SIG_BLOCK, &blocked, NULL); |
246 | Index: linux-2.6.21-ck2/fs/proc/array.c |
247 | =================================================================== |
248 | --- linux-2.6.21-ck2.orig/fs/proc/array.c 2007-05-03 22:20:56.000000000 +1000 |
249 | +++ linux-2.6.21-ck2/fs/proc/array.c 2007-05-14 19:30:30.000000000 +1000 |
250 | @@ -165,7 +165,6 @@ static inline char * task_state(struct t |
251 | rcu_read_lock(); |
252 | buffer += sprintf(buffer, |
253 | "State:\t%s\n" |
254 | - "SleepAVG:\t%lu%%\n" |
255 | "Tgid:\t%d\n" |
256 | "Pid:\t%d\n" |
257 | "PPid:\t%d\n" |
258 | @@ -173,7 +172,6 @@ static inline char * task_state(struct t |
259 | "Uid:\t%d\t%d\t%d\t%d\n" |
260 | "Gid:\t%d\t%d\t%d\t%d\n", |
261 | get_task_state(p), |
262 | - (p->sleep_avg/1024)*100/(1020000000/1024), |
263 | p->tgid, p->pid, |
264 | pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0, |
265 | pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0, |
266 | Index: linux-2.6.21-ck2/include/linux/init_task.h |
267 | =================================================================== |
268 | --- linux-2.6.21-ck2.orig/include/linux/init_task.h 2007-05-03 22:20:57.000000000 +1000 |
269 | +++ linux-2.6.21-ck2/include/linux/init_task.h 2007-05-14 19:30:30.000000000 +1000 |
270 | @@ -102,13 +102,15 @@ extern struct group_info init_groups; |
271 | .prio = MAX_PRIO-20, \ |
272 | .static_prio = MAX_PRIO-20, \ |
273 | .normal_prio = MAX_PRIO-20, \ |
274 | + .rotation = 0, \ |
275 | .policy = SCHED_NORMAL, \ |
276 | .cpus_allowed = CPU_MASK_ALL, \ |
277 | .mm = NULL, \ |
278 | .active_mm = &init_mm, \ |
279 | .run_list = LIST_HEAD_INIT(tsk.run_list), \ |
280 | .ioprio = 0, \ |
281 | - .time_slice = HZ, \ |
282 | + .time_slice = 1000000000, \ |
283 | + .quota = 1000000000, \ |
284 | .tasks = LIST_HEAD_INIT(tsk.tasks), \ |
285 | .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \ |
286 | .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \ |
287 | Index: linux-2.6.21-ck2/include/linux/sched.h |
288 | =================================================================== |
289 | --- linux-2.6.21-ck2.orig/include/linux/sched.h 2007-05-03 22:20:57.000000000 +1000 |
290 | +++ linux-2.6.21-ck2/include/linux/sched.h 2007-05-14 19:30:30.000000000 +1000 |
291 | @@ -149,8 +149,7 @@ extern unsigned long weighted_cpuload(co |
292 | #define EXIT_ZOMBIE 16 |
293 | #define EXIT_DEAD 32 |
294 | /* in tsk->state again */ |
295 | -#define TASK_NONINTERACTIVE 64 |
296 | -#define TASK_DEAD 128 |
297 | +#define TASK_DEAD 64 |
298 | |
299 | #define __set_task_state(tsk, state_value) \ |
300 | do { (tsk)->state = (state_value); } while (0) |
301 | @@ -522,8 +521,9 @@ struct signal_struct { |
302 | |
303 | #define MAX_USER_RT_PRIO 100 |
304 | #define MAX_RT_PRIO MAX_USER_RT_PRIO |
305 | +#define PRIO_RANGE (40) |
306 | |
307 | -#define MAX_PRIO (MAX_RT_PRIO + 40) |
308 | +#define MAX_PRIO (MAX_RT_PRIO + PRIO_RANGE) |
309 | |
310 | #define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO) |
311 | #define rt_task(p) rt_prio((p)->prio) |
312 | @@ -788,13 +788,6 @@ struct mempolicy; |
313 | struct pipe_inode_info; |
314 | struct uts_namespace; |
315 | |
316 | -enum sleep_type { |
317 | - SLEEP_NORMAL, |
318 | - SLEEP_NONINTERACTIVE, |
319 | - SLEEP_INTERACTIVE, |
320 | - SLEEP_INTERRUPTED, |
321 | -}; |
322 | - |
323 | struct prio_array; |
324 | |
325 | struct task_struct { |
326 | @@ -814,20 +807,33 @@ struct task_struct { |
327 | int load_weight; /* for niceness load balancing purposes */ |
328 | int prio, static_prio, normal_prio; |
329 | struct list_head run_list; |
330 | + /* |
331 | + * This bitmap shows what priorities this task has received quota |
332 | + * from for this major priority rotation on its current runqueue. |
333 | + */ |
334 | + DECLARE_BITMAP(bitmap, PRIO_RANGE + 1); |
335 | struct prio_array *array; |
336 | + /* Which major runqueue rotation did this task run */ |
337 | + unsigned long rotation; |
338 | |
339 | unsigned short ioprio; |
340 | #ifdef CONFIG_BLK_DEV_IO_TRACE |
341 | unsigned int btrace_seq; |
342 | #endif |
343 | - unsigned long sleep_avg; |
344 | unsigned long long timestamp, last_ran; |
345 | unsigned long long sched_time; /* sched_clock time spent running */ |
346 | - enum sleep_type sleep_type; |
347 | |
348 | unsigned long policy; |
349 | cpumask_t cpus_allowed; |
350 | - unsigned int time_slice, first_time_slice; |
351 | + /* |
352 | + * How much this task is entitled to run at the current priority |
353 | + * before being requeued at a lower priority. |
354 | + */ |
355 | + int time_slice; |
356 | + /* Is this the very first time_slice this task has ever run. */ |
357 | + unsigned int first_time_slice; |
358 | + /* How much this task receives at each priority level */ |
359 | + int quota; |
360 | |
361 | #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) |
362 | struct sched_info sched_info; |
363 | Index: linux-2.6.21-ck2/kernel/sched.c |
364 | =================================================================== |
365 | --- linux-2.6.21-ck2.orig/kernel/sched.c 2007-05-03 22:20:57.000000000 +1000 |
366 | +++ linux-2.6.21-ck2/kernel/sched.c 2007-05-14 19:30:30.000000000 +1000 |
367 | @@ -16,6 +16,7 @@ |
368 | * by Davide Libenzi, preemptible kernel bits by Robert Love. |
369 | * 2003-09-03 Interactivity tuning by Con Kolivas. |
370 | * 2004-04-02 Scheduler domains code by Nick Piggin |
371 | + * 2007-03-02 Staircase deadline scheduling policy by Con Kolivas |
372 | */ |
373 | |
374 | #include <linux/mm.h> |
375 | @@ -52,6 +53,7 @@ |
376 | #include <linux/tsacct_kern.h> |
377 | #include <linux/kprobes.h> |
378 | #include <linux/delayacct.h> |
379 | +#include <linux/log2.h> |
380 | #include <asm/tlb.h> |
381 | |
382 | #include <asm/unistd.h> |
383 | @@ -83,126 +85,72 @@ unsigned long long __attribute__((weak)) |
384 | #define USER_PRIO(p) ((p)-MAX_RT_PRIO) |
385 | #define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio) |
386 | #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO)) |
387 | +#define SCHED_PRIO(p) ((p)+MAX_RT_PRIO) |
388 | |
389 | -/* |
390 | - * Some helpers for converting nanosecond timing to jiffy resolution |
391 | - */ |
392 | -#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ)) |
393 | +/* Some helpers for converting to/from various scales.*/ |
394 | #define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) |
395 | - |
396 | -/* |
397 | - * These are the 'tuning knobs' of the scheduler: |
398 | - * |
399 | - * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger), |
400 | - * default timeslice is 100 msecs, maximum timeslice is 800 msecs. |
401 | - * Timeslices get refilled after they expire. |
402 | - */ |
403 | -#define MIN_TIMESLICE max(5 * HZ / 1000, 1) |
404 | -#define DEF_TIMESLICE (100 * HZ / 1000) |
405 | -#define ON_RUNQUEUE_WEIGHT 30 |
406 | -#define CHILD_PENALTY 95 |
407 | -#define PARENT_PENALTY 100 |
408 | -#define EXIT_WEIGHT 3 |
409 | -#define PRIO_BONUS_RATIO 25 |
410 | -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100) |
411 | -#define INTERACTIVE_DELTA 2 |
412 | -#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS) |
413 | -#define STARVATION_LIMIT (MAX_SLEEP_AVG) |
414 | -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG)) |
415 | - |
416 | -/* |
417 | - * If a task is 'interactive' then we reinsert it in the active |
418 | - * array after it has expired its current timeslice. (it will not |
419 | - * continue to run immediately, it will still roundrobin with |
420 | - * other interactive tasks.) |
421 | - * |
422 | - * This part scales the interactivity limit depending on niceness. |
423 | - * |
424 | - * We scale it linearly, offset by the INTERACTIVE_DELTA delta. |
425 | - * Here are a few examples of different nice levels: |
426 | - * |
427 | - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] |
428 | - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] |
429 | - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] |
430 | - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] |
431 | - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] |
432 | - * |
433 | - * (the X axis represents the possible -5 ... 0 ... +5 dynamic |
434 | - * priority range a task can explore, a value of '1' means the |
435 | - * task is rated interactive.) |
436 | - * |
437 | - * Ie. nice +19 tasks can never get 'interactive' enough to be |
438 | - * reinserted into the active array. And only heavily CPU-hog nice -20 |
439 | - * tasks will be expired. Default nice 0 tasks are somewhere between, |
440 | - * it takes some effort for them to get interactive, but it's not |
441 | - * too hard. |
442 | - */ |
443 | - |
444 | -#define CURRENT_BONUS(p) \ |
445 | - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ |
446 | - MAX_SLEEP_AVG) |
447 | - |
448 | -#define GRANULARITY (10 * HZ / 1000 ? : 1) |
449 | - |
450 | -#ifdef CONFIG_SMP |
451 | -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \ |
452 | - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \ |
453 | - num_online_cpus()) |
454 | -#else |
455 | -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \ |
456 | - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1))) |
457 | -#endif |
458 | - |
459 | -#define SCALE(v1,v1_max,v2_max) \ |
460 | - (v1) * (v2_max) / (v1_max) |
461 | - |
462 | -#define DELTA(p) \ |
463 | - (SCALE(TASK_NICE(p) + 20, 40, MAX_BONUS) - 20 * MAX_BONUS / 40 + \ |
464 | - INTERACTIVE_DELTA) |
465 | - |
466 | -#define TASK_INTERACTIVE(p) \ |
467 | - ((p)->prio <= (p)->static_prio - DELTA(p)) |
468 | - |
469 | -#define INTERACTIVE_SLEEP(p) \ |
470 | - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \ |
471 | - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) |
472 | - |
473 | -#define TASK_PREEMPTS_CURR(p, rq) \ |
474 | - ((p)->prio < (rq)->curr->prio) |
475 | - |
476 | -#define SCALE_PRIO(x, prio) \ |
477 | - max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) |
478 | - |
479 | -static unsigned int static_prio_timeslice(int static_prio) |
480 | -{ |
481 | - if (static_prio < NICE_TO_PRIO(0)) |
482 | - return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio); |
483 | - else |
484 | - return SCALE_PRIO(DEF_TIMESLICE, static_prio); |
485 | -} |
486 | - |
487 | -/* |
488 | - * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] |
489 | - * to time slice values: [800ms ... 100ms ... 5ms] |
490 | - * |
491 | - * The higher a thread's priority, the bigger timeslices |
492 | - * it gets during one round of execution. But even the lowest |
493 | - * priority thread gets MIN_TIMESLICE worth of execution time. |
494 | +#define MS_TO_NS(TIME) ((TIME) * 1000000) |
495 | +#define MS_TO_US(TIME) ((TIME) * 1000) |
496 | +#define US_TO_MS(TIME) ((TIME) / 1000) |
497 | + |
498 | +#define TASK_PREEMPTS_CURR(p, curr) ((p)->prio < (curr)->prio) |
499 | + |
500 | +/* |
501 | + * This is the time all tasks within the same priority round robin. |
502 | + * Value is in ms and set to a minimum of 8ms. Scales with number of cpus. |
503 | + * Tunable via /proc interface. |
504 | + */ |
505 | +int rr_interval __read_mostly = 8; |
506 | + |
507 | +/* |
508 | + * This contains a bitmap for each dynamic priority level with empty slots |
509 | + * for the valid priorities each different nice level can have. It allows |
510 | + * us to stagger the slots where differing priorities run in a way that |
511 | + * keeps latency differences between different nice levels at a minimum. |
512 | + * The purpose of a pre-generated matrix is for rapid lookup of next slot in |
513 | + * O(1) time without having to recalculate every time priority gets demoted. |
514 | + * All nice levels use priority slot 39 as this allows less niced tasks to |
515 | + * get all priority slots better than that before expiration is forced. |
516 | + * ie, where 0 means a slot for that priority, priority running from left to |
517 | + * right is from prio 0 to prio 39: |
518 | + * nice -20 0000000000000000000000000000000000000000 |
519 | + * nice -10 1000100010001000100010001000100010010000 |
520 | + * nice 0 1010101010101010101010101010101010101010 |
521 | + * nice 5 1011010110110101101101011011010110110110 |
522 | + * nice 10 1110111011101110111011101110111011101110 |
523 | + * nice 15 1111111011111110111111101111111011111110 |
524 | + * nice 19 1111111111111111111111111111111111111110 |
525 | */ |
526 | +static unsigned long prio_matrix[PRIO_RANGE][BITS_TO_LONGS(PRIO_RANGE)] |
527 | + __read_mostly; |
528 | |
529 | -static inline unsigned int task_timeslice(struct task_struct *p) |
530 | -{ |
531 | - return static_prio_timeslice(p->static_prio); |
532 | -} |
533 | +struct rq; |
534 | |
535 | /* |
536 | * These are the runqueue data structures: |
537 | */ |
538 | - |
539 | struct prio_array { |
540 | - unsigned int nr_active; |
541 | - DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */ |
542 | + /* Tasks queued at each priority */ |
543 | struct list_head queue[MAX_PRIO]; |
544 | + |
545 | + /* |
546 | + * The bitmap of priorities queued for this array. While the expired |
547 | + * array will never have realtime tasks on it, it is simpler to have |
548 | + * equal sized bitmaps for a cheap array swap. Include 1 bit for |
549 | + * delimiter. |
550 | + */ |
551 | + DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1); |
552 | + |
553 | + /* |
554 | + * The best static priority (of the dynamic priority tasks) queued |
555 | + * this array. |
556 | + */ |
557 | + int best_static_prio; |
558 | + |
559 | +#ifdef CONFIG_SMP |
560 | + /* For convenience looks back at rq */ |
561 | + struct rq *rq; |
562 | +#endif |
563 | }; |
564 | |
565 | /* |
566 | @@ -234,14 +182,24 @@ struct rq { |
567 | */ |
568 | unsigned long nr_uninterruptible; |
569 | |
570 | - unsigned long expired_timestamp; |
571 | /* Cached timestamp set by update_cpu_clock() */ |
572 | unsigned long long most_recent_timestamp; |
573 | struct task_struct *curr, *idle; |
574 | unsigned long next_balance; |
575 | struct mm_struct *prev_mm; |
576 | + |
577 | struct prio_array *active, *expired, arrays[2]; |
578 | - int best_expired_prio; |
579 | + unsigned long *dyn_bitmap, *exp_bitmap; |
580 | + |
581 | + /* |
582 | + * The current dynamic priority level this runqueue is at per static |
583 | + * priority level. |
584 | + */ |
585 | + int prio_level[PRIO_RANGE]; |
586 | + |
587 | + /* How many times we have rotated the priority queue */ |
588 | + unsigned long prio_rotation; |
589 | + |
590 | atomic_t nr_iowait; |
591 | |
592 | #ifdef CONFIG_SMP |
593 | @@ -579,12 +537,9 @@ static inline struct rq *this_rq_lock(vo |
594 | #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) |
595 | /* |
596 | * Called when a process is dequeued from the active array and given |
597 | - * the cpu. We should note that with the exception of interactive |
598 | - * tasks, the expired queue will become the active queue after the active |
599 | - * queue is empty, without explicitly dequeuing and requeuing tasks in the |
600 | - * expired queue. (Interactive tasks may be requeued directly to the |
601 | - * active queue, thus delaying tasks in the expired queue from running; |
602 | - * see scheduler_tick()). |
603 | + * the cpu. We should note that the expired queue will become the active |
604 | + * queue after the active queue is empty, without explicitly dequeuing and |
605 | + * requeuing tasks in the expired queue. |
606 | * |
607 | * This function is only called from sched_info_arrive(), rather than |
608 | * dequeue_task(). Even though a task may be queued and dequeued multiple |
609 | @@ -682,71 +637,227 @@ sched_info_switch(struct task_struct *pr |
610 | #define sched_info_switch(t, next) do { } while (0) |
611 | #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */ |
612 | |
613 | +static inline int task_queued(struct task_struct *task) |
614 | +{ |
615 | + return !list_empty(&task->run_list); |
616 | +} |
617 | + |
618 | +static inline void set_dynamic_bit(struct task_struct *p, struct rq *rq) |
619 | +{ |
620 | + __set_bit(p->prio, p->array->prio_bitmap); |
621 | +} |
622 | + |
623 | /* |
624 | - * Adding/removing a task to/from a priority array: |
625 | + * Removing from a runqueue. |
626 | */ |
627 | -static void dequeue_task(struct task_struct *p, struct prio_array *array) |
628 | +static void dequeue_task(struct task_struct *p, struct rq *rq) |
629 | { |
630 | - array->nr_active--; |
631 | - list_del(&p->run_list); |
632 | - if (list_empty(array->queue + p->prio)) |
633 | - __clear_bit(p->prio, array->bitmap); |
634 | + list_del_init(&p->run_list); |
635 | + if (list_empty(p->array->queue + p->prio)) |
636 | + __clear_bit(p->prio, p->array->prio_bitmap); |
637 | } |
638 | |
639 | -static void enqueue_task(struct task_struct *p, struct prio_array *array) |
640 | +static void reset_first_time_slice(struct task_struct *p) |
641 | { |
642 | - sched_info_queued(p); |
643 | - list_add_tail(&p->run_list, array->queue + p->prio); |
644 | - __set_bit(p->prio, array->bitmap); |
645 | - array->nr_active++; |
646 | + if (unlikely(p->first_time_slice)) |
647 | + p->first_time_slice = 0; |
648 | +} |
649 | + |
650 | +/* |
651 | + * The task is being queued on a fresh array so it has its entitlement |
652 | + * bitmap cleared. |
653 | + */ |
654 | +static void task_new_array(struct task_struct *p, struct rq *rq, |
655 | + struct prio_array *array) |
656 | +{ |
657 | + bitmap_zero(p->bitmap, PRIO_RANGE); |
658 | + p->rotation = rq->prio_rotation; |
659 | + p->time_slice = p->quota; |
660 | p->array = array; |
661 | + reset_first_time_slice(p); |
662 | +} |
663 | + |
664 | +/* Find the first slot from the relevant prio_matrix entry */ |
665 | +static int first_prio_slot(struct task_struct *p) |
666 | +{ |
667 | + if (unlikely(p->policy == SCHED_BATCH)) |
668 | + return p->static_prio; |
669 | + return SCHED_PRIO(find_first_zero_bit( |
670 | + prio_matrix[USER_PRIO(p->static_prio)], PRIO_RANGE)); |
671 | } |
672 | |
673 | /* |
674 | - * Put task to the end of the run list without the overhead of dequeue |
675 | - * followed by enqueue. |
676 | + * Find the first unused slot by this task that is also in its prio_matrix |
677 | + * level. SCHED_BATCH tasks do not use the priority matrix. They only take |
678 | + * priority slots from their static_prio and above. |
679 | */ |
680 | -static void requeue_task(struct task_struct *p, struct prio_array *array) |
681 | +static int next_entitled_slot(struct task_struct *p, struct rq *rq) |
682 | { |
683 | - list_move_tail(&p->run_list, array->queue + p->prio); |
684 | + int search_prio = MAX_RT_PRIO, uprio = USER_PRIO(p->static_prio); |
685 | + struct prio_array *array = rq->active; |
686 | + DECLARE_BITMAP(tmp, PRIO_RANGE); |
687 | + |
688 | + /* |
689 | + * Go straight to expiration if there are higher priority tasks |
690 | + * already expired. |
691 | + */ |
692 | + if (p->static_prio > rq->expired->best_static_prio) |
693 | + return MAX_PRIO; |
694 | + if (!rq->prio_level[uprio]) |
695 | + rq->prio_level[uprio] = MAX_RT_PRIO; |
696 | + /* |
697 | + * Only priorities equal to the prio_level and above for their |
698 | + * static_prio are acceptable, and only if it's not better than |
699 | + * a queued better static_prio's prio_level. |
700 | + */ |
701 | + if (p->static_prio < array->best_static_prio) { |
702 | + if (likely(p->policy != SCHED_BATCH)) |
703 | + array->best_static_prio = p->static_prio; |
704 | + } else if (p->static_prio == array->best_static_prio) { |
705 | + search_prio = rq->prio_level[uprio]; |
706 | + } else { |
707 | + int i; |
708 | + |
709 | + search_prio = rq->prio_level[uprio]; |
710 | + /* A bound O(n) function, worst case n is 40 */ |
711 | + for (i = array->best_static_prio; i <= p->static_prio ; i++) { |
712 | + if (!rq->prio_level[USER_PRIO(i)]) |
713 | + rq->prio_level[USER_PRIO(i)] = MAX_RT_PRIO; |
714 | + search_prio = max(search_prio, |
715 | + rq->prio_level[USER_PRIO(i)]); |
716 | + } |
717 | + } |
718 | + if (unlikely(p->policy == SCHED_BATCH)) { |
719 | + search_prio = max(search_prio, p->static_prio); |
720 | + return SCHED_PRIO(find_next_zero_bit(p->bitmap, PRIO_RANGE, |
721 | + USER_PRIO(search_prio))); |
722 | + } |
723 | + bitmap_or(tmp, p->bitmap, prio_matrix[uprio], PRIO_RANGE); |
724 | + return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE, |
725 | + USER_PRIO(search_prio))); |
726 | +} |
727 | + |
728 | +static void queue_expired(struct task_struct *p, struct rq *rq) |
729 | +{ |
730 | + task_new_array(p, rq, rq->expired); |
731 | + p->prio = p->normal_prio = first_prio_slot(p); |
732 | + if (p->static_prio < rq->expired->best_static_prio) |
733 | + rq->expired->best_static_prio = p->static_prio; |
734 | + reset_first_time_slice(p); |
735 | } |
736 | |
737 | -static inline void |
738 | -enqueue_task_head(struct task_struct *p, struct prio_array *array) |
739 | +#ifdef CONFIG_SMP |
740 | +/* |
741 | + * If we're waking up a task that was previously on a different runqueue, |
742 | + * update its data appropriately. Note we may be reading data from src_rq-> |
743 | + * outside of lock, but the occasional inaccurate result should be harmless. |
744 | + */ |
745 | + static void update_if_moved(struct task_struct *p, struct rq *rq) |
746 | +{ |
747 | + struct rq *src_rq = p->array->rq; |
748 | + |
749 | + if (src_rq == rq) |
750 | + return; |
751 | + /* |
752 | + * Only need to set p->array when p->rotation == rq->prio_rotation as |
753 | + * they will be set in recalc_task_prio when != rq->prio_rotation. |
754 | + */ |
755 | + if (p->rotation == src_rq->prio_rotation) { |
756 | + p->rotation = rq->prio_rotation; |
757 | + if (p->array == src_rq->expired) |
758 | + p->array = rq->expired; |
759 | + else |
760 | + p->array = rq->active; |
761 | + } else |
762 | + p->rotation = 0; |
763 | +} |
764 | +#else |
765 | +static inline void update_if_moved(struct task_struct *p, struct rq *rq) |
766 | { |
767 | - list_add(&p->run_list, array->queue + p->prio); |
768 | - __set_bit(p->prio, array->bitmap); |
769 | - array->nr_active++; |
770 | - p->array = array; |
771 | } |
772 | +#endif |
773 | |
774 | /* |
775 | - * __normal_prio - return the priority that is based on the static |
776 | - * priority but is modified by bonuses/penalties. |
777 | - * |
778 | - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG] |
779 | - * into the -5 ... 0 ... +5 bonus/penalty range. |
780 | - * |
781 | - * We use 25% of the full 0...39 priority range so that: |
782 | - * |
783 | - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs. |
784 | - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks. |
785 | - * |
786 | - * Both properties are important to certain workloads. |
787 | + * recalc_task_prio determines what priority a non rt_task will be |
788 | + * queued at. If the task has already been running during this runqueue's |
789 | + * major rotation (rq->prio_rotation) then it continues at the same |
790 | + * priority if it has tick entitlement left. If it does not have entitlement |
791 | + * left, it finds the next priority slot according to its nice value that it |
792 | + * has not extracted quota from. If it has not run during this major |
793 | + * rotation, it starts at the next_entitled_slot and has its bitmap quota |
794 | + * cleared. If it does not have any slots left it has all its slots reset and |
795 | + * is queued on the expired at its first_prio_slot. |
796 | */ |
797 | +static void recalc_task_prio(struct task_struct *p, struct rq *rq) |
798 | +{ |
799 | + struct prio_array *array = rq->active; |
800 | + int queue_prio; |
801 | |
802 | -static inline int __normal_prio(struct task_struct *p) |
803 | + update_if_moved(p, rq); |
804 | + if (p->rotation == rq->prio_rotation) { |
805 | + if (p->array == array) { |
806 | + if (p->time_slice > 0) |
807 | + return; |
808 | + p->time_slice = p->quota; |
809 | + } else if (p->array == rq->expired) { |
810 | + queue_expired(p, rq); |
811 | + return; |
812 | + } else |
813 | + task_new_array(p, rq, array); |
814 | + } else |
815 | + task_new_array(p, rq, array); |
816 | + |
817 | + queue_prio = next_entitled_slot(p, rq); |
818 | + if (queue_prio >= MAX_PRIO) { |
819 | + queue_expired(p, rq); |
820 | + return; |
821 | + } |
822 | + p->prio = p->normal_prio = queue_prio; |
823 | + __set_bit(USER_PRIO(p->prio), p->bitmap); |
824 | +} |
825 | + |
826 | +/* |
827 | + * Adding to a runqueue. The dynamic priority queue that it is added to is |
828 | + * determined by recalc_task_prio() above. |
829 | + */ |
830 | +static inline void __enqueue_task(struct task_struct *p, struct rq *rq) |
831 | +{ |
832 | + if (rt_task(p)) |
833 | + p->array = rq->active; |
834 | + else |
835 | + recalc_task_prio(p, rq); |
836 | + |
837 | + sched_info_queued(p); |
838 | + set_dynamic_bit(p, rq); |
839 | +} |
840 | + |
841 | +static void enqueue_task(struct task_struct *p, struct rq *rq) |
842 | { |
843 | - int bonus, prio; |
844 | + __enqueue_task(p, rq); |
845 | + list_add_tail(&p->run_list, p->array->queue + p->prio); |
846 | +} |
847 | |
848 | - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; |
849 | +static inline void enqueue_task_head(struct task_struct *p, struct rq *rq) |
850 | +{ |
851 | + __enqueue_task(p, rq); |
852 | + list_add(&p->run_list, p->array->queue + p->prio); |
853 | +} |
854 | |
855 | - prio = p->static_prio - bonus; |
856 | - if (prio < MAX_RT_PRIO) |
857 | - prio = MAX_RT_PRIO; |
858 | - if (prio > MAX_PRIO-1) |
859 | - prio = MAX_PRIO-1; |
860 | - return prio; |
861 | +/* |
862 | + * requeue_task is only called when p->static_prio does not change. p->prio |
863 | + * can change with dynamic tasks. |
864 | + */ |
865 | +static void requeue_task(struct task_struct *p, struct rq *rq, |
866 | + struct prio_array *old_array, int old_prio) |
867 | +{ |
868 | + if (p->array == rq->expired) |
869 | + queue_expired(p, rq); |
870 | + list_move_tail(&p->run_list, p->array->queue + p->prio); |
871 | + if (!rt_task(p)) { |
872 | + if (list_empty(old_array->queue + old_prio)) |
873 | + __clear_bit(old_prio, old_array->prio_bitmap); |
874 | + set_dynamic_bit(p, rq); |
875 | + } |
876 | } |
877 | |
878 | /* |
879 | @@ -759,17 +870,24 @@ static inline int __normal_prio(struct t |
880 | */ |
881 | |
882 | /* |
883 | - * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE |
884 | - * If static_prio_timeslice() is ever changed to break this assumption then |
885 | - * this code will need modification |
886 | - */ |
887 | -#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE |
888 | -#define LOAD_WEIGHT(lp) \ |
889 | - (((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO) |
890 | -#define PRIO_TO_LOAD_WEIGHT(prio) \ |
891 | - LOAD_WEIGHT(static_prio_timeslice(prio)) |
892 | -#define RTPRIO_TO_LOAD_WEIGHT(rp) \ |
893 | - (PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp)) |
894 | + * task_timeslice - the total duration a task can run during one major |
895 | + * rotation. Returns value in milliseconds as the smallest value can be 1. |
896 | + */ |
897 | +static int task_timeslice(struct task_struct *p) |
898 | +{ |
899 | + int slice = p->quota; /* quota is in us */ |
900 | + |
901 | + if (!rt_task(p)) |
902 | + slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * slice; |
903 | + return US_TO_MS(slice); |
904 | +} |
905 | + |
906 | +/* |
907 | + * The load weight is basically the task_timeslice in ms. Realtime tasks are |
908 | + * special cased to be proportionately larger than nice -20 by their |
909 | + * rt_priority. The weight for rt tasks can only be arbitrary at best. |
910 | + */ |
911 | +#define RTPRIO_TO_LOAD_WEIGHT(rp) (rr_interval * 20 * (40 + rp)) |
912 | |
913 | static void set_load_weight(struct task_struct *p) |
914 | { |
915 | @@ -786,7 +904,7 @@ static void set_load_weight(struct task_ |
916 | #endif |
917 | p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority); |
918 | } else |
919 | - p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio); |
920 | + p->load_weight = task_timeslice(p); |
921 | } |
922 | |
923 | static inline void |
924 | @@ -814,28 +932,38 @@ static inline void dec_nr_running(struct |
925 | } |
926 | |
927 | /* |
928 | - * Calculate the expected normal priority: i.e. priority |
929 | - * without taking RT-inheritance into account. Might be |
930 | - * boosted by interactivity modifiers. Changes upon fork, |
931 | - * setprio syscalls, and whenever the interactivity |
932 | - * estimator recalculates. |
933 | + * __activate_task - move a task to the runqueue. |
934 | */ |
935 | -static inline int normal_prio(struct task_struct *p) |
936 | +static inline void __activate_task(struct task_struct *p, struct rq *rq) |
937 | +{ |
938 | + enqueue_task(p, rq); |
939 | + inc_nr_running(p, rq); |
940 | +} |
941 | + |
942 | +/* |
943 | + * __activate_idle_task - move idle task to the _front_ of runqueue. |
944 | + */ |
945 | +static inline void __activate_idle_task(struct task_struct *p, struct rq *rq) |
946 | { |
947 | - int prio; |
948 | + enqueue_task_head(p, rq); |
949 | + inc_nr_running(p, rq); |
950 | +} |
951 | |
952 | +static inline int normal_prio(struct task_struct *p) |
953 | +{ |
954 | if (has_rt_policy(p)) |
955 | - prio = MAX_RT_PRIO-1 - p->rt_priority; |
956 | + return MAX_RT_PRIO-1 - p->rt_priority; |
957 | + /* Other tasks all have normal_prio set in recalc_task_prio */ |
958 | + if (likely(p->prio >= MAX_RT_PRIO && p->prio < MAX_PRIO)) |
959 | + return p->prio; |
960 | else |
961 | - prio = __normal_prio(p); |
962 | - return prio; |
963 | + return p->static_prio; |
964 | } |
965 | |
966 | /* |
967 | * Calculate the current priority, i.e. the priority |
968 | * taken into account by the scheduler. This value might |
969 | - * be boosted by RT tasks, or might be boosted by |
970 | - * interactivity modifiers. Will be RT if the task got |
971 | + * be boosted by RT tasks as it will be RT if the task got |
972 | * RT-boosted. If not then it returns p->normal_prio. |
973 | */ |
974 | static int effective_prio(struct task_struct *p) |
975 | @@ -852,111 +980,41 @@ static int effective_prio(struct task_st |
976 | } |
977 | |
978 | /* |
979 | - * __activate_task - move a task to the runqueue. |
980 | + * All tasks have quotas based on rr_interval. RT tasks all get rr_interval. |
981 | + * From nice 1 to 19 they are smaller than it only if they are at least one |
982 | + * tick still. Below nice 0 they get progressively larger. |
983 | + * ie nice -6..0 = rr_interval. nice -10 = 2.5 * rr_interval |
984 | + * nice -20 = 10 * rr_interval. nice 1-19 = rr_interval / 2. |
985 | + * Value returned is in microseconds. |
986 | */ |
987 | -static void __activate_task(struct task_struct *p, struct rq *rq) |
988 | +static inline unsigned int rr_quota(struct task_struct *p) |
989 | { |
990 | - struct prio_array *target = rq->active; |
991 | - |
992 | - if (batch_task(p)) |
993 | - target = rq->expired; |
994 | - enqueue_task(p, target); |
995 | - inc_nr_running(p, rq); |
996 | -} |
997 | + int nice = TASK_NICE(p), rr = rr_interval; |
998 | |
999 | -/* |
1000 | - * __activate_idle_task - move idle task to the _front_ of runqueue. |
1001 | - */ |
1002 | -static inline void __activate_idle_task(struct task_struct *p, struct rq *rq) |
1003 | -{ |
1004 | - enqueue_task_head(p, rq->active); |
1005 | - inc_nr_running(p, rq); |
1006 | + if (!rt_task(p)) { |
1007 | + if (nice < -6) { |
1008 | + rr *= nice * nice; |
1009 | + rr /= 40; |
1010 | + } else if (nice > 0) |
1011 | + rr = rr / 2 ? : 1; |
1012 | + } |
1013 | + return MS_TO_US(rr); |
1014 | } |
1015 | |
1016 | -/* |
1017 | - * Recalculate p->normal_prio and p->prio after having slept, |
1018 | - * updating the sleep-average too: |
1019 | - */ |
1020 | -static int recalc_task_prio(struct task_struct *p, unsigned long long now) |
1021 | +/* Every time we set the quota we need to set the load weight */ |
1022 | +static void set_quota(struct task_struct *p) |
1023 | { |
1024 | - /* Caller must always ensure 'now >= p->timestamp' */ |
1025 | - unsigned long sleep_time = now - p->timestamp; |
1026 | - |
1027 | - if (batch_task(p)) |
1028 | - sleep_time = 0; |
1029 | - |
1030 | - if (likely(sleep_time > 0)) { |
1031 | - /* |
1032 | - * This ceiling is set to the lowest priority that would allow |
1033 | - * a task to be reinserted into the active array on timeslice |
1034 | - * completion. |
1035 | - */ |
1036 | - unsigned long ceiling = INTERACTIVE_SLEEP(p); |
1037 | - |
1038 | - if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) { |
1039 | - /* |
1040 | - * Prevents user tasks from achieving best priority |
1041 | - * with one single large enough sleep. |
1042 | - */ |
1043 | - p->sleep_avg = ceiling; |
1044 | - /* |
1045 | - * Using INTERACTIVE_SLEEP() as a ceiling places a |
1046 | - * nice(0) task 1ms sleep away from promotion, and |
1047 | - * gives it 700ms to round-robin with no chance of |
1048 | - * being demoted. This is more than generous, so |
1049 | - * mark this sleep as non-interactive to prevent the |
1050 | - * on-runqueue bonus logic from intervening should |
1051 | - * this task not receive cpu immediately. |
1052 | - */ |
1053 | - p->sleep_type = SLEEP_NONINTERACTIVE; |
1054 | - } else { |
1055 | - /* |
1056 | - * Tasks waking from uninterruptible sleep are |
1057 | - * limited in their sleep_avg rise as they |
1058 | - * are likely to be waiting on I/O |
1059 | - */ |
1060 | - if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) { |
1061 | - if (p->sleep_avg >= ceiling) |
1062 | - sleep_time = 0; |
1063 | - else if (p->sleep_avg + sleep_time >= |
1064 | - ceiling) { |
1065 | - p->sleep_avg = ceiling; |
1066 | - sleep_time = 0; |
1067 | - } |
1068 | - } |
1069 | - |
1070 | - /* |
1071 | - * This code gives a bonus to interactive tasks. |
1072 | - * |
1073 | - * The boost works by updating the 'average sleep time' |
1074 | - * value here, based on ->timestamp. The more time a |
1075 | - * task spends sleeping, the higher the average gets - |
1076 | - * and the higher the priority boost gets as well. |
1077 | - */ |
1078 | - p->sleep_avg += sleep_time; |
1079 | - |
1080 | - } |
1081 | - if (p->sleep_avg > NS_MAX_SLEEP_AVG) |
1082 | - p->sleep_avg = NS_MAX_SLEEP_AVG; |
1083 | - } |
1084 | - |
1085 | - return effective_prio(p); |
1086 | + p->quota = rr_quota(p); |
1087 | + set_load_weight(p); |
1088 | } |
1089 | |
1090 | /* |
1091 | * activate_task - move a task to the runqueue and do priority recalculation |
1092 | - * |
1093 | - * Update all the scheduling statistics stuff. (sleep average |
1094 | - * calculation, priority modifiers, etc.) |
1095 | */ |
1096 | static void activate_task(struct task_struct *p, struct rq *rq, int local) |
1097 | { |
1098 | - unsigned long long now; |
1099 | - |
1100 | - if (rt_task(p)) |
1101 | - goto out; |
1102 | + unsigned long long now = sched_clock(); |
1103 | |
1104 | - now = sched_clock(); |
1105 | #ifdef CONFIG_SMP |
1106 | if (!local) { |
1107 | /* Compensate for drifting sched_clock */ |
1108 | @@ -977,32 +1035,9 @@ static void activate_task(struct task_st |
1109 | (now - p->timestamp) >> 20); |
1110 | } |
1111 | |
1112 | - p->prio = recalc_task_prio(p, now); |
1113 | - |
1114 | - /* |
1115 | - * This checks to make sure it's not an uninterruptible task |
1116 | - * that is now waking up. |
1117 | - */ |
1118 | - if (p->sleep_type == SLEEP_NORMAL) { |
1119 | - /* |
1120 | - * Tasks which were woken up by interrupts (ie. hw events) |
1121 | - * are most likely of interactive nature. So we give them |
1122 | - * the credit of extending their sleep time to the period |
1123 | - * of time they spend on the runqueue, waiting for execution |
1124 | - * on a CPU, first time around: |
1125 | - */ |
1126 | - if (in_interrupt()) |
1127 | - p->sleep_type = SLEEP_INTERRUPTED; |
1128 | - else { |
1129 | - /* |
1130 | - * Normal first-time wakeups get a credit too for |
1131 | - * on-runqueue time, but it will be weighted down: |
1132 | - */ |
1133 | - p->sleep_type = SLEEP_INTERACTIVE; |
1134 | - } |
1135 | - } |
1136 | + set_quota(p); |
1137 | + p->prio = effective_prio(p); |
1138 | p->timestamp = now; |
1139 | -out: |
1140 | __activate_task(p, rq); |
1141 | } |
1142 | |
1143 | @@ -1012,8 +1047,7 @@ out: |
1144 | static void deactivate_task(struct task_struct *p, struct rq *rq) |
1145 | { |
1146 | dec_nr_running(p, rq); |
1147 | - dequeue_task(p, p->array); |
1148 | - p->array = NULL; |
1149 | + dequeue_task(p, rq); |
1150 | } |
1151 | |
1152 | /* |
1153 | @@ -1095,7 +1129,7 @@ migrate_task(struct task_struct *p, int |
1154 | * If the task is not on a runqueue (and not running), then |
1155 | * it is sufficient to simply update the task's cpu field. |
1156 | */ |
1157 | - if (!p->array && !task_running(rq, p)) { |
1158 | + if (!task_queued(p) && !task_running(rq, p)) { |
1159 | set_task_cpu(p, dest_cpu); |
1160 | return 0; |
1161 | } |
1162 | @@ -1126,7 +1160,7 @@ void wait_task_inactive(struct task_stru |
1163 | repeat: |
1164 | rq = task_rq_lock(p, &flags); |
1165 | /* Must be off runqueue entirely, not preempted. */ |
1166 | - if (unlikely(p->array || task_running(rq, p))) { |
1167 | + if (unlikely(task_queued(p) || task_running(rq, p))) { |
1168 | /* If it's preempted, we yield. It could be a while. */ |
1169 | preempted = !task_running(rq, p); |
1170 | task_rq_unlock(rq, &flags); |
1171 | @@ -1391,6 +1425,31 @@ static inline int wake_idle(int cpu, str |
1172 | } |
1173 | #endif |
1174 | |
1175 | +/* |
1176 | + * We need to have a special definition for an idle runqueue when testing |
1177 | + * for preemption on CONFIG_HOTPLUG_CPU as the idle task may be scheduled as |
1178 | + * a realtime task in sched_idle_next. |
1179 | + */ |
1180 | +#ifdef CONFIG_HOTPLUG_CPU |
1181 | +#define rq_idle(rq) ((rq)->curr == (rq)->idle && !rt_task((rq)->curr)) |
1182 | +#else |
1183 | +#define rq_idle(rq) ((rq)->curr == (rq)->idle) |
1184 | +#endif |
1185 | + |
1186 | +static inline int task_preempts_curr(struct task_struct *p, struct rq *rq) |
1187 | +{ |
1188 | + struct task_struct *curr = rq->curr; |
1189 | + |
1190 | + return ((p->array == task_rq(p)->active && |
1191 | + TASK_PREEMPTS_CURR(p, curr)) || rq_idle(rq)); |
1192 | +} |
1193 | + |
1194 | +static inline void try_preempt(struct task_struct *p, struct rq *rq) |
1195 | +{ |
1196 | + if (task_preempts_curr(p, rq)) |
1197 | + resched_task(rq->curr); |
1198 | +} |
1199 | + |
1200 | /*** |
1201 | * try_to_wake_up - wake up a thread |
1202 | * @p: the to-be-woken-up thread |
1203 | @@ -1422,7 +1481,7 @@ static int try_to_wake_up(struct task_st |
1204 | if (!(old_state & state)) |
1205 | goto out; |
1206 | |
1207 | - if (p->array) |
1208 | + if (task_queued(p)) |
1209 | goto out_running; |
1210 | |
1211 | cpu = task_cpu(p); |
1212 | @@ -1515,7 +1574,7 @@ out_set_cpu: |
1213 | old_state = p->state; |
1214 | if (!(old_state & state)) |
1215 | goto out; |
1216 | - if (p->array) |
1217 | + if (task_queued(p)) |
1218 | goto out_running; |
1219 | |
1220 | this_cpu = smp_processor_id(); |
1221 | @@ -1524,25 +1583,9 @@ out_set_cpu: |
1222 | |
1223 | out_activate: |
1224 | #endif /* CONFIG_SMP */ |
1225 | - if (old_state == TASK_UNINTERRUPTIBLE) { |
1226 | + if (old_state == TASK_UNINTERRUPTIBLE) |
1227 | rq->nr_uninterruptible--; |
1228 | - /* |
1229 | - * Tasks on involuntary sleep don't earn |
1230 | - * sleep_avg beyond just interactive state. |
1231 | - */ |
1232 | - p->sleep_type = SLEEP_NONINTERACTIVE; |
1233 | - } else |
1234 | - |
1235 | - /* |
1236 | - * Tasks that have marked their sleep as noninteractive get |
1237 | - * woken up with their sleep average not weighted in an |
1238 | - * interactive way. |
1239 | - */ |
1240 | - if (old_state & TASK_NONINTERACTIVE) |
1241 | - p->sleep_type = SLEEP_NONINTERACTIVE; |
1242 | - |
1243 | |
1244 | - activate_task(p, rq, cpu == this_cpu); |
1245 | /* |
1246 | * Sync wakeups (i.e. those types of wakeups where the waker |
1247 | * has indicated that it will leave the CPU in short order) |
1248 | @@ -1551,10 +1594,9 @@ out_activate: |
1249 | * the waker guarantees that the freshly woken up task is going |
1250 | * to be considered on this CPU.) |
1251 | */ |
1252 | - if (!sync || cpu != this_cpu) { |
1253 | - if (TASK_PREEMPTS_CURR(p, rq)) |
1254 | - resched_task(rq->curr); |
1255 | - } |
1256 | + activate_task(p, rq, cpu == this_cpu); |
1257 | + if (!sync || cpu != this_cpu) |
1258 | + try_preempt(p, rq); |
1259 | success = 1; |
1260 | |
1261 | out_running: |
1262 | @@ -1577,7 +1619,6 @@ int fastcall wake_up_state(struct task_s |
1263 | return try_to_wake_up(p, state, 0); |
1264 | } |
1265 | |
1266 | -static void task_running_tick(struct rq *rq, struct task_struct *p); |
1267 | /* |
1268 | * Perform scheduler related setup for a newly forked process p. |
1269 | * p is forked by current. |
1270 | @@ -1605,7 +1646,6 @@ void fastcall sched_fork(struct task_str |
1271 | p->prio = current->normal_prio; |
1272 | |
1273 | INIT_LIST_HEAD(&p->run_list); |
1274 | - p->array = NULL; |
1275 | #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) |
1276 | if (unlikely(sched_info_on())) |
1277 | memset(&p->sched_info, 0, sizeof(p->sched_info)); |
1278 | @@ -1617,30 +1657,31 @@ void fastcall sched_fork(struct task_str |
1279 | /* Want to start with kernel preemption disabled. */ |
1280 | task_thread_info(p)->preempt_count = 1; |
1281 | #endif |
1282 | + if (unlikely(p->policy == SCHED_FIFO)) |
1283 | + goto out; |
1284 | /* |
1285 | * Share the timeslice between parent and child, thus the |
1286 | * total amount of pending timeslices in the system doesn't change, |
1287 | * resulting in more scheduling fairness. |
1288 | */ |
1289 | local_irq_disable(); |
1290 | - p->time_slice = (current->time_slice + 1) >> 1; |
1291 | - /* |
1292 | - * The remainder of the first timeslice might be recovered by |
1293 | - * the parent if the child exits early enough. |
1294 | - */ |
1295 | - p->first_time_slice = 1; |
1296 | - current->time_slice >>= 1; |
1297 | - p->timestamp = sched_clock(); |
1298 | - if (unlikely(!current->time_slice)) { |
1299 | + if (current->time_slice > 0) { |
1300 | + current->time_slice /= 2; |
1301 | + if (current->time_slice) |
1302 | + p->time_slice = current->time_slice; |
1303 | + else |
1304 | + p->time_slice = 1; |
1305 | /* |
1306 | - * This case is rare, it happens when the parent has only |
1307 | - * a single jiffy left from its timeslice. Taking the |
1308 | - * runqueue lock is not a problem. |
1309 | + * The remainder of the first timeslice might be recovered by |
1310 | + * the parent if the child exits early enough. |
1311 | */ |
1312 | - current->time_slice = 1; |
1313 | - task_running_tick(cpu_rq(cpu), current); |
1314 | - } |
1315 | + p->first_time_slice = 1; |
1316 | + } else |
1317 | + p->time_slice = 0; |
1318 | + |
1319 | + p->timestamp = sched_clock(); |
1320 | local_irq_enable(); |
1321 | +out: |
1322 | put_cpu(); |
1323 | } |
1324 | |
1325 | @@ -1662,38 +1703,16 @@ void fastcall wake_up_new_task(struct ta |
1326 | this_cpu = smp_processor_id(); |
1327 | cpu = task_cpu(p); |
1328 | |
1329 | - /* |
1330 | - * We decrease the sleep average of forking parents |
1331 | - * and children as well, to keep max-interactive tasks |
1332 | - * from forking tasks that are max-interactive. The parent |
1333 | - * (current) is done further down, under its lock. |
1334 | - */ |
1335 | - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * |
1336 | - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); |
1337 | - |
1338 | - p->prio = effective_prio(p); |
1339 | - |
1340 | if (likely(cpu == this_cpu)) { |
1341 | + activate_task(p, rq, 1); |
1342 | if (!(clone_flags & CLONE_VM)) { |
1343 | /* |
1344 | * The VM isn't cloned, so we're in a good position to |
1345 | * do child-runs-first in anticipation of an exec. This |
1346 | * usually avoids a lot of COW overhead. |
1347 | */ |
1348 | - if (unlikely(!current->array)) |
1349 | - __activate_task(p, rq); |
1350 | - else { |
1351 | - p->prio = current->prio; |
1352 | - p->normal_prio = current->normal_prio; |
1353 | - list_add_tail(&p->run_list, ¤t->run_list); |
1354 | - p->array = current->array; |
1355 | - p->array->nr_active++; |
1356 | - inc_nr_running(p, rq); |
1357 | - } |
1358 | set_need_resched(); |
1359 | - } else |
1360 | - /* Run child last */ |
1361 | - __activate_task(p, rq); |
1362 | + } |
1363 | /* |
1364 | * We skip the following code due to cpu == this_cpu |
1365 | * |
1366 | @@ -1710,19 +1729,16 @@ void fastcall wake_up_new_task(struct ta |
1367 | */ |
1368 | p->timestamp = (p->timestamp - this_rq->most_recent_timestamp) |
1369 | + rq->most_recent_timestamp; |
1370 | - __activate_task(p, rq); |
1371 | - if (TASK_PREEMPTS_CURR(p, rq)) |
1372 | - resched_task(rq->curr); |
1373 | + activate_task(p, rq, 0); |
1374 | + try_preempt(p, rq); |
1375 | |
1376 | /* |
1377 | * Parent and child are on different CPUs, now get the |
1378 | - * parent runqueue to update the parent's ->sleep_avg: |
1379 | + * parent runqueue to update the parent's ->flags: |
1380 | */ |
1381 | task_rq_unlock(rq, &flags); |
1382 | this_rq = task_rq_lock(current, &flags); |
1383 | } |
1384 | - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * |
1385 | - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); |
1386 | task_rq_unlock(this_rq, &flags); |
1387 | } |
1388 | |
1389 | @@ -1737,23 +1753,17 @@ void fastcall wake_up_new_task(struct ta |
1390 | */ |
1391 | void fastcall sched_exit(struct task_struct *p) |
1392 | { |
1393 | + struct task_struct *parent; |
1394 | unsigned long flags; |
1395 | struct rq *rq; |
1396 | |
1397 | - /* |
1398 | - * If the child was a (relative-) CPU hog then decrease |
1399 | - * the sleep_avg of the parent as well. |
1400 | - */ |
1401 | - rq = task_rq_lock(p->parent, &flags); |
1402 | - if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) { |
1403 | - p->parent->time_slice += p->time_slice; |
1404 | - if (unlikely(p->parent->time_slice > task_timeslice(p))) |
1405 | - p->parent->time_slice = task_timeslice(p); |
1406 | - } |
1407 | - if (p->sleep_avg < p->parent->sleep_avg) |
1408 | - p->parent->sleep_avg = p->parent->sleep_avg / |
1409 | - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / |
1410 | - (EXIT_WEIGHT + 1); |
1411 | + parent = p->parent; |
1412 | + rq = task_rq_lock(parent, &flags); |
1413 | + if (p->first_time_slice > 0 && task_cpu(p) == task_cpu(parent)) { |
1414 | + parent->time_slice += p->time_slice; |
1415 | + if (unlikely(parent->time_slice > parent->quota)) |
1416 | + parent->time_slice = parent->quota; |
1417 | + } |
1418 | task_rq_unlock(rq, &flags); |
1419 | } |
1420 | |
1421 | @@ -2085,23 +2095,17 @@ void sched_exec(void) |
1422 | * pull_task - move a task from a remote runqueue to the local runqueue. |
1423 | * Both runqueues must be locked. |
1424 | */ |
1425 | -static void pull_task(struct rq *src_rq, struct prio_array *src_array, |
1426 | - struct task_struct *p, struct rq *this_rq, |
1427 | - struct prio_array *this_array, int this_cpu) |
1428 | +static void pull_task(struct rq *src_rq, struct task_struct *p, |
1429 | + struct rq *this_rq, int this_cpu) |
1430 | { |
1431 | - dequeue_task(p, src_array); |
1432 | + dequeue_task(p, src_rq); |
1433 | dec_nr_running(p, src_rq); |
1434 | set_task_cpu(p, this_cpu); |
1435 | inc_nr_running(p, this_rq); |
1436 | - enqueue_task(p, this_array); |
1437 | + enqueue_task(p, this_rq); |
1438 | p->timestamp = (p->timestamp - src_rq->most_recent_timestamp) |
1439 | + this_rq->most_recent_timestamp; |
1440 | - /* |
1441 | - * Note that idle threads have a prio of MAX_PRIO, for this test |
1442 | - * to be always true for them. |
1443 | - */ |
1444 | - if (TASK_PREEMPTS_CURR(p, this_rq)) |
1445 | - resched_task(this_rq->curr); |
1446 | + try_preempt(p, this_rq); |
1447 | } |
1448 | |
1449 | /* |
1450 | @@ -2144,7 +2148,16 @@ int can_migrate_task(struct task_struct |
1451 | return 1; |
1452 | } |
1453 | |
1454 | -#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio) |
1455 | +static inline int rq_best_prio(struct rq *rq) |
1456 | +{ |
1457 | + int best_prio, exp_prio; |
1458 | + |
1459 | + best_prio = sched_find_first_bit(rq->dyn_bitmap); |
1460 | + exp_prio = find_next_bit(rq->exp_bitmap, MAX_PRIO, MAX_RT_PRIO); |
1461 | + if (unlikely(best_prio > exp_prio)) |
1462 | + best_prio = exp_prio; |
1463 | + return best_prio; |
1464 | +} |
1465 | |
1466 | /* |
1467 | * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted |
1468 | @@ -2160,7 +2173,7 @@ static int move_tasks(struct rq *this_rq |
1469 | { |
1470 | int idx, pulled = 0, pinned = 0, this_best_prio, best_prio, |
1471 | best_prio_seen, skip_for_load; |
1472 | - struct prio_array *array, *dst_array; |
1473 | + struct prio_array *array; |
1474 | struct list_head *head, *curr; |
1475 | struct task_struct *tmp; |
1476 | long rem_load_move; |
1477 | @@ -2187,26 +2200,21 @@ static int move_tasks(struct rq *this_rq |
1478 | * be cache-cold, thus switching CPUs has the least effect |
1479 | * on them. |
1480 | */ |
1481 | - if (busiest->expired->nr_active) { |
1482 | - array = busiest->expired; |
1483 | - dst_array = this_rq->expired; |
1484 | - } else { |
1485 | - array = busiest->active; |
1486 | - dst_array = this_rq->active; |
1487 | - } |
1488 | - |
1489 | + array = busiest->expired; |
1490 | new_array: |
1491 | - /* Start searching at priority 0: */ |
1492 | - idx = 0; |
1493 | + /* Expired arrays don't have RT tasks so they're always MAX_RT_PRIO+ */ |
1494 | + if (array == busiest->expired) |
1495 | + idx = MAX_RT_PRIO; |
1496 | + else |
1497 | + idx = 0; |
1498 | skip_bitmap: |
1499 | if (!idx) |
1500 | - idx = sched_find_first_bit(array->bitmap); |
1501 | + idx = sched_find_first_bit(array->prio_bitmap); |
1502 | else |
1503 | - idx = find_next_bit(array->bitmap, MAX_PRIO, idx); |
1504 | + idx = find_next_bit(array->prio_bitmap, MAX_PRIO, idx); |
1505 | if (idx >= MAX_PRIO) { |
1506 | - if (array == busiest->expired && busiest->active->nr_active) { |
1507 | + if (array == busiest->expired) { |
1508 | array = busiest->active; |
1509 | - dst_array = this_rq->active; |
1510 | goto new_array; |
1511 | } |
1512 | goto out; |
1513 | @@ -2237,7 +2245,7 @@ skip_queue: |
1514 | goto skip_bitmap; |
1515 | } |
1516 | |
1517 | - pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu); |
1518 | + pull_task(busiest, tmp, this_rq, this_cpu); |
1519 | pulled++; |
1520 | rem_load_move -= tmp->load_weight; |
1521 | |
1522 | @@ -3013,11 +3021,36 @@ EXPORT_PER_CPU_SYMBOL(kstat); |
1523 | /* |
1524 | * This is called on clock ticks and on context switches. |
1525 | * Bank in p->sched_time the ns elapsed since the last tick or switch. |
1526 | + * CPU scheduler quota accounting is also performed here in microseconds. |
1527 | + * The value returned from sched_clock() occasionally gives bogus values so |
1528 | + * some sanity checking is required. |
1529 | */ |
1530 | -static inline void |
1531 | -update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now) |
1532 | +static void |
1533 | +update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now, |
1534 | + int tick) |
1535 | { |
1536 | - p->sched_time += now - p->last_ran; |
1537 | + long time_diff = now - p->last_ran; |
1538 | + |
1539 | + if (tick) { |
1540 | + /* |
1541 | + * Called from scheduler_tick() there should be less than two |
1542 | + * jiffies worth, and not negative/overflow. |
1543 | + */ |
1544 | + if (time_diff > JIFFIES_TO_NS(2) || time_diff < 0) |
1545 | + time_diff = JIFFIES_TO_NS(1); |
1546 | + } else { |
1547 | + /* |
1548 | + * Called from context_switch there should be less than one |
1549 | + * jiffy worth, and not negative/overflow. There should be |
1550 | + * some time banked here so use a nominal 1us. |
1551 | + */ |
1552 | + if (time_diff > JIFFIES_TO_NS(1) || time_diff < 1) |
1553 | + time_diff = 1000; |
1554 | + } |
1555 | + /* time_slice accounting is done in usecs to avoid overflow on 32bit */ |
1556 | + if (p != rq->idle && p->policy != SCHED_FIFO) |
1557 | + p->time_slice -= time_diff / 1000; |
1558 | + p->sched_time += time_diff; |
1559 | p->last_ran = rq->most_recent_timestamp = now; |
1560 | } |
1561 | |
1562 | @@ -3038,27 +3071,6 @@ unsigned long long current_sched_time(co |
1563 | } |
1564 | |
1565 | /* |
1566 | - * We place interactive tasks back into the active array, if possible. |
1567 | - * |
1568 | - * To guarantee that this does not starve expired tasks we ignore the |
1569 | - * interactivity of a task if the first expired task had to wait more |
1570 | - * than a 'reasonable' amount of time. This deadline timeout is |
1571 | - * load-dependent, as the frequency of array switched decreases with |
1572 | - * increasing number of running tasks. We also ignore the interactivity |
1573 | - * if a better static_prio task has expired: |
1574 | - */ |
1575 | -static inline int expired_starving(struct rq *rq) |
1576 | -{ |
1577 | - if (rq->curr->static_prio > rq->best_expired_prio) |
1578 | - return 1; |
1579 | - if (!STARVATION_LIMIT || !rq->expired_timestamp) |
1580 | - return 0; |
1581 | - if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running) |
1582 | - return 1; |
1583 | - return 0; |
1584 | -} |
1585 | - |
1586 | -/* |
1587 | * Account user cpu time to a process. |
1588 | * @p: the process that the cpu time gets accounted to |
1589 | * @hardirq_offset: the offset to subtract from hardirq_count() |
1590 | @@ -3131,87 +3143,47 @@ void account_steal_time(struct task_stru |
1591 | cpustat->steal = cputime64_add(cpustat->steal, tmp); |
1592 | } |
1593 | |
1594 | -static void task_running_tick(struct rq *rq, struct task_struct *p) |
1595 | +/* |
1596 | + * The task has used up its quota of running in this prio_level so it must be |
1597 | + * dropped a priority level, all managed by recalc_task_prio(). |
1598 | + */ |
1599 | +static void task_expired_entitlement(struct rq *rq, struct task_struct *p) |
1600 | { |
1601 | - if (p->array != rq->active) { |
1602 | - /* Task has expired but was not scheduled yet */ |
1603 | - set_tsk_need_resched(p); |
1604 | + int overrun; |
1605 | + |
1606 | + reset_first_time_slice(p); |
1607 | + if (rt_task(p)) { |
1608 | + p->time_slice += p->quota; |
1609 | + list_move_tail(&p->run_list, p->array->queue + p->prio); |
1610 | return; |
1611 | } |
1612 | - spin_lock(&rq->lock); |
1613 | + overrun = p->time_slice; |
1614 | + dequeue_task(p, rq); |
1615 | + enqueue_task(p, rq); |
1616 | /* |
1617 | - * The task was running during this tick - update the |
1618 | - * time slice counter. Note: we do not update a thread's |
1619 | - * priority until it either goes to sleep or uses up its |
1620 | - * timeslice. This makes it possible for interactive tasks |
1621 | - * to use up their timeslices at their highest priority levels. |
1622 | + * Subtract any extra time this task ran over its time_slice; ie |
1623 | + * overrun will either be 0 or negative. |
1624 | */ |
1625 | - if (rt_task(p)) { |
1626 | - /* |
1627 | - * RR tasks need a special form of timeslice management. |
1628 | - * FIFO tasks have no timeslices. |
1629 | - */ |
1630 | - if ((p->policy == SCHED_RR) && !--p->time_slice) { |
1631 | - p->time_slice = task_timeslice(p); |
1632 | - p->first_time_slice = 0; |
1633 | - set_tsk_need_resched(p); |
1634 | - |
1635 | - /* put it at the end of the queue: */ |
1636 | - requeue_task(p, rq->active); |
1637 | - } |
1638 | - goto out_unlock; |
1639 | - } |
1640 | - if (!--p->time_slice) { |
1641 | - dequeue_task(p, rq->active); |
1642 | - set_tsk_need_resched(p); |
1643 | - p->prio = effective_prio(p); |
1644 | - p->time_slice = task_timeslice(p); |
1645 | - p->first_time_slice = 0; |
1646 | - |
1647 | - if (!rq->expired_timestamp) |
1648 | - rq->expired_timestamp = jiffies; |
1649 | - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { |
1650 | - enqueue_task(p, rq->expired); |
1651 | - if (p->static_prio < rq->best_expired_prio) |
1652 | - rq->best_expired_prio = p->static_prio; |
1653 | - } else |
1654 | - enqueue_task(p, rq->active); |
1655 | - } else { |
1656 | - /* |
1657 | - * Prevent a too long timeslice allowing a task to monopolize |
1658 | - * the CPU. We do this by splitting up the timeslice into |
1659 | - * smaller pieces. |
1660 | - * |
1661 | - * Note: this does not mean the task's timeslices expire or |
1662 | - * get lost in any way, they just might be preempted by |
1663 | - * another task of equal priority. (one with higher |
1664 | - * priority would have preempted this task already.) We |
1665 | - * requeue this task to the end of the list on this priority |
1666 | - * level, which is in essence a round-robin of tasks with |
1667 | - * equal priority. |
1668 | - * |
1669 | - * This only applies to tasks in the interactive |
1670 | - * delta range with at least TIMESLICE_GRANULARITY to requeue. |
1671 | - */ |
1672 | - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) - |
1673 | - p->time_slice) % TIMESLICE_GRANULARITY(p)) && |
1674 | - (p->time_slice >= TIMESLICE_GRANULARITY(p)) && |
1675 | - (p->array == rq->active)) { |
1676 | + p->time_slice += overrun; |
1677 | +} |
1678 | |
1679 | - requeue_task(p, rq->active); |
1680 | - set_tsk_need_resched(p); |
1681 | - } |
1682 | - } |
1683 | -out_unlock: |
1684 | +/* This manages tasks that have run out of timeslice during a scheduler_tick */ |
1685 | +static void task_running_tick(struct rq *rq, struct task_struct *p) |
1686 | +{ |
1687 | + /* SCHED_FIFO tasks never run out of timeslice. */ |
1688 | + if (p->time_slice > 0 || p->policy == SCHED_FIFO) |
1689 | + return; |
1690 | + /* p->time_slice <= 0 */ |
1691 | + spin_lock(&rq->lock); |
1692 | + if (likely(task_queued(p))) |
1693 | + task_expired_entitlement(rq, p); |
1694 | + set_tsk_need_resched(p); |
1695 | spin_unlock(&rq->lock); |
1696 | } |
1697 | |
1698 | /* |
1699 | * This function gets called by the timer code, with HZ frequency. |
1700 | * We call it with interrupts disabled. |
1701 | - * |
1702 | - * It also gets called by the fork code, when changing the parent's |
1703 | - * timeslices. |
1704 | */ |
1705 | void scheduler_tick(void) |
1706 | { |
1707 | @@ -3220,7 +3192,7 @@ void scheduler_tick(void) |
1708 | int cpu = smp_processor_id(); |
1709 | struct rq *rq = cpu_rq(cpu); |
1710 | |
1711 | - update_cpu_clock(p, rq, now); |
1712 | + update_cpu_clock(p, rq, now, 1); |
1713 | |
1714 | if (p != rq->idle) |
1715 | task_running_tick(rq, p); |
1716 | @@ -3269,10 +3241,55 @@ EXPORT_SYMBOL(sub_preempt_count); |
1717 | |
1718 | #endif |
1719 | |
1720 | -static inline int interactive_sleep(enum sleep_type sleep_type) |
1721 | +static void reset_prio_levels(struct rq *rq) |
1722 | { |
1723 | - return (sleep_type == SLEEP_INTERACTIVE || |
1724 | - sleep_type == SLEEP_INTERRUPTED); |
1725 | + rq->active->best_static_prio = MAX_PRIO - 1; |
1726 | + rq->expired->best_static_prio = MAX_PRIO - 1; |
1727 | + memset(rq->prio_level, 0, sizeof(int) * PRIO_RANGE); |
1728 | +} |
1729 | + |
1730 | +/* |
1731 | + * next_dynamic_task finds the next suitable dynamic task. |
1732 | + */ |
1733 | +static inline struct task_struct *next_dynamic_task(struct rq *rq, int idx) |
1734 | +{ |
1735 | + struct prio_array *array = rq->active; |
1736 | + struct task_struct *next; |
1737 | + struct list_head *queue; |
1738 | + int nstatic; |
1739 | + |
1740 | +retry: |
1741 | + if (idx >= MAX_PRIO) { |
1742 | + /* There are no more tasks in the active array. Swap arrays */ |
1743 | + array = rq->expired; |
1744 | + rq->expired = rq->active; |
1745 | + rq->active = array; |
1746 | + rq->exp_bitmap = rq->expired->prio_bitmap; |
1747 | + rq->dyn_bitmap = rq->active->prio_bitmap; |
1748 | + rq->prio_rotation++; |
1749 | + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO); |
1750 | + reset_prio_levels(rq); |
1751 | + } |
1752 | + queue = array->queue + idx; |
1753 | + next = list_entry(queue->next, struct task_struct, run_list); |
1754 | + if (unlikely(next->time_slice <= 0)) { |
1755 | + /* |
1756 | + * Unlucky enough that this task ran out of time_slice |
1757 | + * before it hit a scheduler_tick so it should have its |
1758 | + * priority reassessed and choose another task (possibly |
1759 | + * the same one) |
1760 | + */ |
1761 | + task_expired_entitlement(rq, next); |
1762 | + idx = find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO); |
1763 | + goto retry; |
1764 | + } |
1765 | + next->rotation = rq->prio_rotation; |
1766 | + nstatic = next->static_prio; |
1767 | + if (nstatic < array->best_static_prio) |
1768 | + array->best_static_prio = nstatic; |
1769 | + if (idx > rq->prio_level[USER_PRIO(nstatic)]) |
1770 | + rq->prio_level[USER_PRIO(nstatic)] = idx; |
1771 | + return next; |
1772 | } |
1773 | |
1774 | /* |
1775 | @@ -3281,13 +3298,11 @@ static inline int interactive_sleep(enum |
1776 | asmlinkage void __sched schedule(void) |
1777 | { |
1778 | struct task_struct *prev, *next; |
1779 | - struct prio_array *array; |
1780 | struct list_head *queue; |
1781 | unsigned long long now; |
1782 | - unsigned long run_time; |
1783 | - int cpu, idx, new_prio; |
1784 | long *switch_count; |
1785 | struct rq *rq; |
1786 | + int cpu, idx; |
1787 | |
1788 | /* |
1789 | * Test if we are atomic. Since do_exit() needs to call into |
1790 | @@ -3323,18 +3338,6 @@ need_resched_nonpreemptible: |
1791 | |
1792 | schedstat_inc(rq, sched_cnt); |
1793 | now = sched_clock(); |
1794 | - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) { |
1795 | - run_time = now - prev->timestamp; |
1796 | - if (unlikely((long long)(now - prev->timestamp) < 0)) |
1797 | - run_time = 0; |
1798 | - } else |
1799 | - run_time = NS_MAX_SLEEP_AVG; |
1800 | - |
1801 | - /* |
1802 | - * Tasks charged proportionately less run_time at high sleep_avg to |
1803 | - * delay them losing their interactive status |
1804 | - */ |
1805 | - run_time /= (CURRENT_BONUS(prev) ? : 1); |
1806 | |
1807 | spin_lock_irq(&rq->lock); |
1808 | |
1809 | @@ -3356,59 +3359,29 @@ need_resched_nonpreemptible: |
1810 | idle_balance(cpu, rq); |
1811 | if (!rq->nr_running) { |
1812 | next = rq->idle; |
1813 | - rq->expired_timestamp = 0; |
1814 | goto switch_tasks; |
1815 | } |
1816 | } |
1817 | |
1818 | - array = rq->active; |
1819 | - if (unlikely(!array->nr_active)) { |
1820 | - /* |
1821 | - * Switch the active and expired arrays. |
1822 | - */ |
1823 | - schedstat_inc(rq, sched_switch); |
1824 | - rq->active = rq->expired; |
1825 | - rq->expired = array; |
1826 | - array = rq->active; |
1827 | - rq->expired_timestamp = 0; |
1828 | - rq->best_expired_prio = MAX_PRIO; |
1829 | - } |
1830 | - |
1831 | - idx = sched_find_first_bit(array->bitmap); |
1832 | - queue = array->queue + idx; |
1833 | - next = list_entry(queue->next, struct task_struct, run_list); |
1834 | - |
1835 | - if (!rt_task(next) && interactive_sleep(next->sleep_type)) { |
1836 | - unsigned long long delta = now - next->timestamp; |
1837 | - if (unlikely((long long)(now - next->timestamp) < 0)) |
1838 | - delta = 0; |
1839 | - |
1840 | - if (next->sleep_type == SLEEP_INTERACTIVE) |
1841 | - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; |
1842 | - |
1843 | - array = next->array; |
1844 | - new_prio = recalc_task_prio(next, next->timestamp + delta); |
1845 | - |
1846 | - if (unlikely(next->prio != new_prio)) { |
1847 | - dequeue_task(next, array); |
1848 | - next->prio = new_prio; |
1849 | - enqueue_task(next, array); |
1850 | - } |
1851 | + idx = sched_find_first_bit(rq->dyn_bitmap); |
1852 | + if (!rt_prio(idx)) |
1853 | + next = next_dynamic_task(rq, idx); |
1854 | + else { |
1855 | + queue = rq->active->queue + idx; |
1856 | + next = list_entry(queue->next, struct task_struct, run_list); |
1857 | } |
1858 | - next->sleep_type = SLEEP_NORMAL; |
1859 | switch_tasks: |
1860 | - if (next == rq->idle) |
1861 | + if (next == rq->idle) { |
1862 | + reset_prio_levels(rq); |
1863 | + rq->prio_rotation++; |
1864 | schedstat_inc(rq, sched_goidle); |
1865 | + } |
1866 | prefetch(next); |
1867 | prefetch_stack(next); |
1868 | clear_tsk_need_resched(prev); |
1869 | rcu_qsctr_inc(task_cpu(prev)); |
1870 | |
1871 | - update_cpu_clock(prev, rq, now); |
1872 | - |
1873 | - prev->sleep_avg -= run_time; |
1874 | - if ((long)prev->sleep_avg <= 0) |
1875 | - prev->sleep_avg = 0; |
1876 | + update_cpu_clock(prev, rq, now, 0); |
1877 | prev->timestamp = prev->last_ran = now; |
1878 | |
1879 | sched_info_switch(prev, next); |
1880 | @@ -3844,29 +3817,22 @@ EXPORT_SYMBOL(sleep_on_timeout); |
1881 | */ |
1882 | void rt_mutex_setprio(struct task_struct *p, int prio) |
1883 | { |
1884 | - struct prio_array *array; |
1885 | unsigned long flags; |
1886 | + int queued, oldprio; |
1887 | struct rq *rq; |
1888 | - int oldprio; |
1889 | |
1890 | BUG_ON(prio < 0 || prio > MAX_PRIO); |
1891 | |
1892 | rq = task_rq_lock(p, &flags); |
1893 | |
1894 | oldprio = p->prio; |
1895 | - array = p->array; |
1896 | - if (array) |
1897 | - dequeue_task(p, array); |
1898 | + queued = task_queued(p); |
1899 | + if (queued) |
1900 | + dequeue_task(p, rq); |
1901 | p->prio = prio; |
1902 | |
1903 | - if (array) { |
1904 | - /* |
1905 | - * If changing to an RT priority then queue it |
1906 | - * in the active array! |
1907 | - */ |
1908 | - if (rt_task(p)) |
1909 | - array = rq->active; |
1910 | - enqueue_task(p, array); |
1911 | + if (queued) { |
1912 | + enqueue_task(p, rq); |
1913 | /* |
1914 | * Reschedule if we are currently running on this runqueue and |
1915 | * our priority decreased, or if we are not currently running on |
1916 | @@ -3875,8 +3841,8 @@ void rt_mutex_setprio(struct task_struct |
1917 | if (task_running(rq, p)) { |
1918 | if (p->prio > oldprio) |
1919 | resched_task(rq->curr); |
1920 | - } else if (TASK_PREEMPTS_CURR(p, rq)) |
1921 | - resched_task(rq->curr); |
1922 | + } else |
1923 | + try_preempt(p, rq); |
1924 | } |
1925 | task_rq_unlock(rq, &flags); |
1926 | } |
1927 | @@ -3885,8 +3851,7 @@ void rt_mutex_setprio(struct task_struct |
1928 | |
1929 | void set_user_nice(struct task_struct *p, long nice) |
1930 | { |
1931 | - struct prio_array *array; |
1932 | - int old_prio, delta; |
1933 | + int queued, old_prio,delta; |
1934 | unsigned long flags; |
1935 | struct rq *rq; |
1936 | |
1937 | @@ -3907,20 +3872,20 @@ void set_user_nice(struct task_struct *p |
1938 | p->static_prio = NICE_TO_PRIO(nice); |
1939 | goto out_unlock; |
1940 | } |
1941 | - array = p->array; |
1942 | - if (array) { |
1943 | - dequeue_task(p, array); |
1944 | + queued = task_queued(p); |
1945 | + if (queued) { |
1946 | + dequeue_task(p, rq); |
1947 | dec_raw_weighted_load(rq, p); |
1948 | } |
1949 | |
1950 | p->static_prio = NICE_TO_PRIO(nice); |
1951 | - set_load_weight(p); |
1952 | old_prio = p->prio; |
1953 | p->prio = effective_prio(p); |
1954 | + set_quota(p); |
1955 | delta = p->prio - old_prio; |
1956 | |
1957 | - if (array) { |
1958 | - enqueue_task(p, array); |
1959 | + if (queued) { |
1960 | + enqueue_task(p, rq); |
1961 | inc_raw_weighted_load(rq, p); |
1962 | /* |
1963 | * If the task increased its priority or is running and |
1964 | @@ -3996,7 +3961,7 @@ asmlinkage long sys_nice(int increment) |
1965 | * |
1966 | * This is the priority value as seen by users in /proc. |
1967 | * RT tasks are offset by -200. Normal tasks are centered |
1968 | - * around 0, value goes from -16 to +15. |
1969 | + * around 0, value goes from 0 to +39. |
1970 | */ |
1971 | int task_prio(const struct task_struct *p) |
1972 | { |
1973 | @@ -4043,19 +4008,14 @@ static inline struct task_struct *find_p |
1974 | /* Actually do priority change: must hold rq lock. */ |
1975 | static void __setscheduler(struct task_struct *p, int policy, int prio) |
1976 | { |
1977 | - BUG_ON(p->array); |
1978 | + BUG_ON(task_queued(p)); |
1979 | |
1980 | p->policy = policy; |
1981 | p->rt_priority = prio; |
1982 | p->normal_prio = normal_prio(p); |
1983 | /* we are holding p->pi_lock already */ |
1984 | p->prio = rt_mutex_getprio(p); |
1985 | - /* |
1986 | - * SCHED_BATCH tasks are treated as perpetual CPU hogs: |
1987 | - */ |
1988 | - if (policy == SCHED_BATCH) |
1989 | - p->sleep_avg = 0; |
1990 | - set_load_weight(p); |
1991 | + set_quota(p); |
1992 | } |
1993 | |
1994 | /** |
1995 | @@ -4069,8 +4029,7 @@ static void __setscheduler(struct task_s |
1996 | int sched_setscheduler(struct task_struct *p, int policy, |
1997 | struct sched_param *param) |
1998 | { |
1999 | - int retval, oldprio, oldpolicy = -1; |
2000 | - struct prio_array *array; |
2001 | + int queued, retval, oldprio, oldpolicy = -1; |
2002 | unsigned long flags; |
2003 | struct rq *rq; |
2004 | |
2005 | @@ -4144,12 +4103,12 @@ recheck: |
2006 | spin_unlock_irqrestore(&p->pi_lock, flags); |
2007 | goto recheck; |
2008 | } |
2009 | - array = p->array; |
2010 | - if (array) |
2011 | + queued = task_queued(p); |
2012 | + if (queued) |
2013 | deactivate_task(p, rq); |
2014 | oldprio = p->prio; |
2015 | __setscheduler(p, policy, param->sched_priority); |
2016 | - if (array) { |
2017 | + if (queued) { |
2018 | __activate_task(p, rq); |
2019 | /* |
2020 | * Reschedule if we are currently running on this runqueue and |
2021 | @@ -4159,8 +4118,8 @@ recheck: |
2022 | if (task_running(rq, p)) { |
2023 | if (p->prio > oldprio) |
2024 | resched_task(rq->curr); |
2025 | - } else if (TASK_PREEMPTS_CURR(p, rq)) |
2026 | - resched_task(rq->curr); |
2027 | + } else |
2028 | + try_preempt(p, rq); |
2029 | } |
2030 | __task_rq_unlock(rq); |
2031 | spin_unlock_irqrestore(&p->pi_lock, flags); |
2032 | @@ -4433,40 +4392,27 @@ asmlinkage long sys_sched_getaffinity(pi |
2033 | * sys_sched_yield - yield the current processor to other threads. |
2034 | * |
2035 | * This function yields the current CPU by moving the calling thread |
2036 | - * to the expired array. If there are no other threads running on this |
2037 | - * CPU then this function will return. |
2038 | + * to the expired array if SCHED_NORMAL or the end of its current priority |
2039 | + * queue if a realtime task. If there are no other threads running on this |
2040 | + * cpu this function will return. |
2041 | */ |
2042 | asmlinkage long sys_sched_yield(void) |
2043 | { |
2044 | struct rq *rq = this_rq_lock(); |
2045 | - struct prio_array *array = current->array, *target = rq->expired; |
2046 | + struct task_struct *p = current; |
2047 | |
2048 | schedstat_inc(rq, yld_cnt); |
2049 | - /* |
2050 | - * We implement yielding by moving the task into the expired |
2051 | - * queue. |
2052 | - * |
2053 | - * (special rule: RT tasks will just roundrobin in the active |
2054 | - * array.) |
2055 | - */ |
2056 | - if (rt_task(current)) |
2057 | - target = rq->active; |
2058 | + if (rq->nr_running == 1) |
2059 | + schedstat_inc(rq, yld_both_empty); |
2060 | + else { |
2061 | + struct prio_array *old_array = p->array; |
2062 | + int old_prio = p->prio; |
2063 | |
2064 | - if (array->nr_active == 1) { |
2065 | - schedstat_inc(rq, yld_act_empty); |
2066 | - if (!rq->expired->nr_active) |
2067 | - schedstat_inc(rq, yld_both_empty); |
2068 | - } else if (!rq->expired->nr_active) |
2069 | - schedstat_inc(rq, yld_exp_empty); |
2070 | - |
2071 | - if (array != target) { |
2072 | - dequeue_task(current, array); |
2073 | - enqueue_task(current, target); |
2074 | - } else |
2075 | - /* |
2076 | - * requeue_task is cheaper so perform that if possible. |
2077 | - */ |
2078 | - requeue_task(current, array); |
2079 | + /* p->prio will be updated in requeue_task via queue_expired */ |
2080 | + if (!rt_task(p)) |
2081 | + p->array = rq->expired; |
2082 | + requeue_task(p, rq, old_array, old_prio); |
2083 | + } |
2084 | |
2085 | /* |
2086 | * Since we are going to call schedule() anyway, there's |
2087 | @@ -4676,8 +4622,8 @@ long sys_sched_rr_get_interval(pid_t pid |
2088 | if (retval) |
2089 | goto out_unlock; |
2090 | |
2091 | - jiffies_to_timespec(p->policy == SCHED_FIFO ? |
2092 | - 0 : task_timeslice(p), &t); |
2093 | + t = ns_to_timespec(p->policy == SCHED_FIFO ? 0 : |
2094 | + MS_TO_NS(task_timeslice(p))); |
2095 | read_unlock(&tasklist_lock); |
2096 | retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0; |
2097 | out_nounlock: |
2098 | @@ -4771,10 +4717,10 @@ void __cpuinit init_idle(struct task_str |
2099 | struct rq *rq = cpu_rq(cpu); |
2100 | unsigned long flags; |
2101 | |
2102 | - idle->timestamp = sched_clock(); |
2103 | - idle->sleep_avg = 0; |
2104 | - idle->array = NULL; |
2105 | - idle->prio = idle->normal_prio = MAX_PRIO; |
2106 | + bitmap_zero(idle->bitmap, PRIO_RANGE); |
2107 | + idle->timestamp = idle->last_ran = sched_clock(); |
2108 | + idle->array = rq->active; |
2109 | + idle->prio = idle->normal_prio = NICE_TO_PRIO(0); |
2110 | idle->state = TASK_RUNNING; |
2111 | idle->cpus_allowed = cpumask_of_cpu(cpu); |
2112 | set_task_cpu(idle, cpu); |
2113 | @@ -4893,7 +4839,7 @@ static int __migrate_task(struct task_st |
2114 | goto out; |
2115 | |
2116 | set_task_cpu(p, dest_cpu); |
2117 | - if (p->array) { |
2118 | + if (task_queued(p)) { |
2119 | /* |
2120 | * Sync timestamp with rq_dest's before activating. |
2121 | * The same thing could be achieved by doing this step |
2122 | @@ -4904,8 +4850,7 @@ static int __migrate_task(struct task_st |
2123 | + rq_dest->most_recent_timestamp; |
2124 | deactivate_task(p, rq_src); |
2125 | __activate_task(p, rq_dest); |
2126 | - if (TASK_PREEMPTS_CURR(p, rq_dest)) |
2127 | - resched_task(rq_dest->curr); |
2128 | + try_preempt(p, rq_dest); |
2129 | } |
2130 | ret = 1; |
2131 | out: |
2132 | @@ -5194,7 +5139,7 @@ migration_call(struct notifier_block *nf |
2133 | /* Idle task back to normal (off runqueue, low prio) */ |
2134 | rq = task_rq_lock(rq->idle, &flags); |
2135 | deactivate_task(rq->idle, rq); |
2136 | - rq->idle->static_prio = MAX_PRIO; |
2137 | + rq->idle->static_prio = NICE_TO_PRIO(0); |
2138 | __setscheduler(rq->idle, SCHED_NORMAL, 0); |
2139 | migrate_dead_tasks(cpu); |
2140 | task_rq_unlock(rq, &flags); |
2141 | @@ -6706,6 +6651,13 @@ void __init sched_init_smp(void) |
2142 | /* Move init over to a non-isolated CPU */ |
2143 | if (set_cpus_allowed(current, non_isolated_cpus) < 0) |
2144 | BUG(); |
2145 | + |
2146 | + /* |
2147 | + * Assume that every added cpu gives us slightly less overall latency |
2148 | + * allowing us to increase the base rr_interval, but in a non linear |
2149 | + * fashion. |
2150 | + */ |
2151 | + rr_interval *= 1 + ilog2(num_online_cpus()); |
2152 | } |
2153 | #else |
2154 | void __init sched_init_smp(void) |
2155 | @@ -6727,6 +6679,16 @@ void __init sched_init(void) |
2156 | { |
2157 | int i, j, k; |
2158 | |
2159 | + /* Generate the priority matrix */ |
2160 | + for (i = 0; i < PRIO_RANGE; i++) { |
2161 | + bitmap_fill(prio_matrix[i], PRIO_RANGE); |
2162 | + j = PRIO_RANGE * PRIO_RANGE / (PRIO_RANGE - i); |
2163 | + for (k = 0; k <= PRIO_RANGE * (PRIO_RANGE - 1); k += j) { |
2164 | + __clear_bit(PRIO_RANGE - 1 - (k / PRIO_RANGE), |
2165 | + prio_matrix[i]); |
2166 | + } |
2167 | + } |
2168 | + |
2169 | for_each_possible_cpu(i) { |
2170 | struct prio_array *array; |
2171 | struct rq *rq; |
2172 | @@ -6735,11 +6697,16 @@ void __init sched_init(void) |
2173 | spin_lock_init(&rq->lock); |
2174 | lockdep_set_class(&rq->lock, &rq->rq_lock_key); |
2175 | rq->nr_running = 0; |
2176 | + rq->prio_rotation = 0; |
2177 | rq->active = rq->arrays; |
2178 | rq->expired = rq->arrays + 1; |
2179 | - rq->best_expired_prio = MAX_PRIO; |
2180 | + reset_prio_levels(rq); |
2181 | + rq->dyn_bitmap = rq->active->prio_bitmap; |
2182 | + rq->exp_bitmap = rq->expired->prio_bitmap; |
2183 | |
2184 | #ifdef CONFIG_SMP |
2185 | + rq->active->rq = rq; |
2186 | + rq->expired->rq = rq; |
2187 | rq->sd = NULL; |
2188 | for (j = 1; j < 3; j++) |
2189 | rq->cpu_load[j] = 0; |
2190 | @@ -6752,16 +6719,16 @@ void __init sched_init(void) |
2191 | atomic_set(&rq->nr_iowait, 0); |
2192 | |
2193 | for (j = 0; j < 2; j++) { |
2194 | + |
2195 | array = rq->arrays + j; |
2196 | - for (k = 0; k < MAX_PRIO; k++) { |
2197 | + for (k = 0; k < MAX_PRIO; k++) |
2198 | INIT_LIST_HEAD(array->queue + k); |
2199 | - __clear_bit(k, array->bitmap); |
2200 | - } |
2201 | - // delimiter for bitsearch |
2202 | - __set_bit(MAX_PRIO, array->bitmap); |
2203 | + bitmap_zero(array->prio_bitmap, MAX_PRIO); |
2204 | + /* delimiter for bitsearch */ |
2205 | + __set_bit(MAX_PRIO, array->prio_bitmap); |
2206 | } |
2207 | - } |
2208 | |
2209 | + } |
2210 | set_load_weight(&init_task); |
2211 | |
2212 | #ifdef CONFIG_SMP |
2213 | @@ -6815,10 +6782,10 @@ EXPORT_SYMBOL(__might_sleep); |
2214 | #ifdef CONFIG_MAGIC_SYSRQ |
2215 | void normalize_rt_tasks(void) |
2216 | { |
2217 | - struct prio_array *array; |
2218 | struct task_struct *p; |
2219 | unsigned long flags; |
2220 | struct rq *rq; |
2221 | + int queued; |
2222 | |
2223 | read_lock_irq(&tasklist_lock); |
2224 | for_each_process(p) { |
2225 | @@ -6828,11 +6795,11 @@ void normalize_rt_tasks(void) |
2226 | spin_lock_irqsave(&p->pi_lock, flags); |
2227 | rq = __task_rq_lock(p); |
2228 | |
2229 | - array = p->array; |
2230 | - if (array) |
2231 | + queued = task_queued(p); |
2232 | + if (queued) |
2233 | deactivate_task(p, task_rq(p)); |
2234 | __setscheduler(p, SCHED_NORMAL, 0); |
2235 | - if (array) { |
2236 | + if (queued) { |
2237 | __activate_task(p, task_rq(p)); |
2238 | resched_task(rq->curr); |
2239 | } |
2240 | Index: linux-2.6.21-ck2/Documentation/sysctl/kernel.txt |
2241 | =================================================================== |
2242 | --- linux-2.6.21-ck2.orig/Documentation/sysctl/kernel.txt 2007-02-05 22:51:59.000000000 +1100 |
2243 | +++ linux-2.6.21-ck2/Documentation/sysctl/kernel.txt 2007-05-14 19:30:30.000000000 +1000 |
2244 | @@ -43,6 +43,7 @@ show up in /proc/sys/kernel: |
2245 | - printk |
2246 | - real-root-dev ==> Documentation/initrd.txt |
2247 | - reboot-cmd [ SPARC only ] |
2248 | +- rr_interval |
2249 | - rtsig-max |
2250 | - rtsig-nr |
2251 | - sem |
2252 | @@ -288,6 +289,19 @@ rebooting. ??? |
2253 | |
2254 | ============================================================== |
2255 | |
2256 | +rr_interval: |
2257 | + |
2258 | +This is the smallest duration that any cpu process scheduling unit |
2259 | +will run for. Increasing this value can increase throughput of cpu |
2260 | +bound tasks substantially but at the expense of increased latencies |
2261 | +overall. This value is in milliseconds and the default value chosen |
2262 | +depends on the number of cpus available at scheduler initialisation |
2263 | +with a minimum of 8. |
2264 | + |
2265 | +Valid values are from 1-5000. |
2266 | + |
2267 | +============================================================== |
2268 | + |
2269 | rtsig-max & rtsig-nr: |
2270 | |
2271 | The file rtsig-max can be used to tune the maximum number |
2272 | Index: linux-2.6.21-ck2/kernel/sysctl.c |
2273 | =================================================================== |
2274 | --- linux-2.6.21-ck2.orig/kernel/sysctl.c 2007-05-03 22:20:57.000000000 +1000 |
2275 | +++ linux-2.6.21-ck2/kernel/sysctl.c 2007-05-14 19:30:30.000000000 +1000 |
2276 | @@ -76,6 +76,7 @@ extern int pid_max_min, pid_max_max; |
2277 | extern int sysctl_drop_caches; |
2278 | extern int percpu_pagelist_fraction; |
2279 | extern int compat_log; |
2280 | +extern int rr_interval; |
2281 | |
2282 | /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ |
2283 | static int maxolduid = 65535; |
2284 | @@ -159,6 +160,14 @@ int sysctl_legacy_va_layout; |
2285 | #endif |
2286 | |
2287 | |
2288 | +/* Constants for minimum and maximum testing. |
2289 | + We use these as one-element integer vectors. */ |
2290 | +static int __read_mostly zero; |
2291 | +static int __read_mostly one = 1; |
2292 | +static int __read_mostly one_hundred = 100; |
2293 | +static int __read_mostly five_thousand = 5000; |
2294 | + |
2295 | + |
2296 | /* The default sysctl tables: */ |
2297 | |
2298 | static ctl_table root_table[] = { |
2299 | @@ -499,6 +508,17 @@ static ctl_table kern_table[] = { |
2300 | .mode = 0444, |
2301 | .proc_handler = &proc_dointvec, |
2302 | }, |
2303 | + { |
2304 | + .ctl_name = CTL_UNNUMBERED, |
2305 | + .procname = "rr_interval", |
2306 | + .data = &rr_interval, |
2307 | + .maxlen = sizeof (int), |
2308 | + .mode = 0644, |
2309 | + .proc_handler = &proc_dointvec_minmax, |
2310 | + .strategy = &sysctl_intvec, |
2311 | + .extra1 = &one, |
2312 | + .extra2 = &five_thousand, |
2313 | + }, |
2314 | #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) |
2315 | { |
2316 | .ctl_name = KERN_UNKNOWN_NMI_PANIC, |
2317 | @@ -607,12 +627,6 @@ static ctl_table kern_table[] = { |
2318 | { .ctl_name = 0 } |
2319 | }; |
2320 | |
2321 | -/* Constants for minimum and maximum testing in vm_table. |
2322 | - We use these as one-element integer vectors. */ |
2323 | -static int zero; |
2324 | -static int one_hundred = 100; |
2325 | - |
2326 | - |
2327 | static ctl_table vm_table[] = { |
2328 | { |
2329 | .ctl_name = VM_OVERCOMMIT_MEMORY, |
2330 | Index: linux-2.6.21-ck2/fs/pipe.c |
2331 | =================================================================== |
2332 | --- linux-2.6.21-ck2.orig/fs/pipe.c 2007-05-03 22:20:56.000000000 +1000 |
2333 | +++ linux-2.6.21-ck2/fs/pipe.c 2007-05-14 19:30:30.000000000 +1000 |
2334 | @@ -41,12 +41,7 @@ void pipe_wait(struct pipe_inode_info *p |
2335 | { |
2336 | DEFINE_WAIT(wait); |
2337 | |
2338 | - /* |
2339 | - * Pipes are system-local resources, so sleeping on them |
2340 | - * is considered a noninteractive wait: |
2341 | - */ |
2342 | - prepare_to_wait(&pipe->wait, &wait, |
2343 | - TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE); |
2344 | + prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); |
2345 | if (pipe->inode) |
2346 | mutex_unlock(&pipe->inode->i_mutex); |
2347 | schedule(); |
2348 | Index: linux-2.6.21-ck2/Documentation/sched-design.txt |
2349 | =================================================================== |
2350 | --- linux-2.6.21-ck2.orig/Documentation/sched-design.txt 2006-11-30 11:30:31.000000000 +1100 |
2351 | +++ linux-2.6.21-ck2/Documentation/sched-design.txt 2007-05-14 19:30:30.000000000 +1000 |
2352 | @@ -1,11 +1,14 @@ |
2353 | - Goals, Design and Implementation of the |
2354 | - new ultra-scalable O(1) scheduler |
2355 | + Goals, Design and Implementation of the ultra-scalable O(1) scheduler by |
2356 | + Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by |
2357 | + Con Kolivas. |
2358 | |
2359 | |
2360 | - This is an edited version of an email Ingo Molnar sent to |
2361 | - lkml on 4 Jan 2002. It describes the goals, design, and |
2362 | - implementation of Ingo's new ultra-scalable O(1) scheduler. |
2363 | - Last Updated: 18 April 2002. |
2364 | + This was originally an edited version of an email Ingo Molnar sent to |
2365 | + lkml on 4 Jan 2002. It describes the goals, design, and implementation |
2366 | + of Ingo's ultra-scalable O(1) scheduler. It now contains a description |
2367 | + of the Staircase Deadline priority scheduler that was built on this |
2368 | + design. |
2369 | + Last Updated: Fri, 4 May 2007 |
2370 | |
2371 | |
2372 | Goal |
2373 | @@ -163,3 +166,222 @@ certain code paths and data constructs. |
2374 | code is smaller than the old one. |
2375 | |
2376 | Ingo |
2377 | + |
2378 | + |
2379 | +Staircase Deadline cpu scheduler policy |
2380 | +================================================ |
2381 | + |
2382 | +Design summary |
2383 | +============== |
2384 | + |
2385 | +A novel design which incorporates a foreground-background descending priority |
2386 | +system (the staircase) via a bandwidth allocation matrix according to nice |
2387 | +level. |
2388 | + |
2389 | + |
2390 | +Features |
2391 | +======== |
2392 | + |
2393 | +A starvation free, strict fairness O(1) scalable design with interactivity |
2394 | +as good as the above restrictions can provide. There is no interactivity |
2395 | +estimator, no sleep/run measurements and only simple fixed accounting. |
2396 | +The design has strict enough a design and accounting that task behaviour |
2397 | +can be modelled and maximum scheduling latencies can be predicted by |
2398 | +the virtual deadline mechanism that manages runqueues. The prime concern |
2399 | +in this design is to maintain fairness at all costs determined by nice level, |
2400 | +yet to maintain as good interactivity as can be allowed within the |
2401 | +constraints of strict fairness. |
2402 | + |
2403 | + |
2404 | +Design description |
2405 | +================== |
2406 | + |
2407 | +SD works off the principle of providing each task a quota of runtime that it is |
2408 | +allowed to run at a number of priority levels determined by its static priority |
2409 | +(ie. its nice level). If the task uses up its quota it has its priority |
2410 | +decremented to the next level determined by a priority matrix. Once every |
2411 | +runtime quota has been consumed of every priority level, a task is queued on the |
2412 | +"expired" array. When no other tasks exist with quota, the expired array is |
2413 | +activated and fresh quotas are handed out. This is all done in O(1). |
2414 | + |
2415 | +Design details |
2416 | +============== |
2417 | + |
2418 | +Each task keeps a record of its own entitlement of cpu time. Most of the rest of |
2419 | +these details apply to non-realtime tasks as rt task management is straight |
2420 | +forward. |
2421 | + |
2422 | +Each runqueue keeps a record of what major epoch it is up to in the |
2423 | +rq->prio_rotation field which is incremented on each major epoch. It also |
2424 | +keeps a record of the current prio_level for each static priority task. |
2425 | + |
2426 | +Each task keeps a record of what major runqueue epoch it was last running |
2427 | +on in p->rotation. It also keeps a record of what priority levels it has |
2428 | +already been allocated quota from during this epoch in a bitmap p->bitmap. |
2429 | + |
2430 | +The only tunable that determines all other details is the RR_INTERVAL. This |
2431 | +is set to 8ms, and is scaled gently upwards with more cpus. This value is |
2432 | +tunable via a /proc interface. |
2433 | + |
2434 | +All tasks are initially given a quota based on RR_INTERVAL. This is equal to |
2435 | +RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and |
2436 | +progressively larger for nice values from -1 to -20. This is assigned to |
2437 | +p->quota and only changes with changes in nice level. |
2438 | + |
2439 | +As a task is first queued, it checks in recalc_task_prio to see if it has run at |
2440 | +this runqueue's current priority rotation. If it has not, it will have its |
2441 | +p->prio level set according to the first slot in a "priority matrix" and will be |
2442 | +given a p->time_slice equal to the p->quota, and has its allocation bitmap bit |
2443 | +set in p->bitmap for this prio level. It is then queued on the current active |
2444 | +priority array. |
2445 | + |
2446 | +If a task has already been running during this major epoch, and it has |
2447 | +p->time_slice left and the rq->prio_quota for the task's p->prio still |
2448 | +has quota, it will be placed back on the active array, but no more quota |
2449 | +will be added. |
2450 | + |
2451 | +If a task has been running during this major epoch, but does not have |
2452 | +p->time_slice left, it will find the next lowest priority in its bitmap that it |
2453 | +has not been allocated quota from. It then gets the a full quota in |
2454 | +p->time_slice. It is then queued on the current active priority array at the |
2455 | +newly determined lower priority. |
2456 | + |
2457 | +If a task has been running during this major epoch, and does not have |
2458 | +any entitlement left in p->bitmap and no time_slice left, it will have its |
2459 | +bitmap cleared, and be queued at its best prio again, but on the expired |
2460 | +priority array. |
2461 | + |
2462 | +When a task is queued, it has its relevant bit set in the array->prio_bitmap. |
2463 | + |
2464 | +p->time_slice is stored in nanosconds and is updated via update_cpu_clock on |
2465 | +schedule() and scheduler_tick. If p->time_slice is below zero then the |
2466 | +recalc_task_prio is readjusted and the task rescheduled. |
2467 | + |
2468 | + |
2469 | +Priority Matrix |
2470 | +=============== |
2471 | + |
2472 | +In order to minimise the latencies between tasks of different nice levels |
2473 | +running concurrently, the dynamic priority slots where different nice levels |
2474 | +are queued are dithered instead of being sequential. What this means is that |
2475 | +there are 40 priority slots where a task may run during one major rotation, |
2476 | +and the allocation of slots is dependant on nice level. In the |
2477 | +following table, a zero represents a slot where the task may run. |
2478 | + |
2479 | +PRIORITY:0..................20.................39 |
2480 | +nice -20 0000000000000000000000000000000000000000 |
2481 | +nice -10 1000100010001000100010001000100010010000 |
2482 | +nice 0 1010101010101010101010101010101010101010 |
2483 | +nice 5 1011010110110101101101011011010110110110 |
2484 | +nice 10 1110111011101110111011101110111011101110 |
2485 | +nice 15 1111111011111110111111101111111011111110 |
2486 | +nice 19 1111111111111111111111111111111111111110 |
2487 | + |
2488 | +As can be seen, a nice -20 task runs in every priority slot whereas a nice 19 |
2489 | +task only runs one slot per major rotation. This dithered table allows for the |
2490 | +smallest possible maximum latencies between tasks of varying nice levels, thus |
2491 | +allowing vastly different nice levels to be used. |
2492 | + |
2493 | +SCHED_BATCH tasks are managed slightly differently, receiving only the top |
2494 | +slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but |
2495 | +slightly higher latencies. |
2496 | + |
2497 | + |
2498 | +Modelling deadline behaviour |
2499 | +============================ |
2500 | + |
2501 | +As the accounting in this design is hard and not modified by sleep average |
2502 | +calculations or interactivity modifiers, it is possible to accurately |
2503 | +predict the maximum latency that a task may experience under different |
2504 | +conditions. This is a virtual deadline mechanism enforced by mandatory |
2505 | +timeslice expiration and not outside bandwidth measurement. |
2506 | + |
2507 | +The maximum duration a task can run during one major epoch is determined by its |
2508 | +nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL |
2509 | +duration during each epoch. Nice 10 tasks can run at 9 priority levels for each |
2510 | +epoch, and so on. The table in the priority matrix above demonstrates how this |
2511 | +is enforced. |
2512 | + |
2513 | +Therefore the maximum duration a runqueue epoch can take is determined by |
2514 | +the number of tasks running, and their nice level. After that, the maximum |
2515 | +duration it can take before a task can wait before it get scheduled is |
2516 | +determined by the position of its first slot on the matrix. |
2517 | + |
2518 | +In the following examples, these are _worst case scenarios_ and would rarely |
2519 | +occur, but can be modelled nonetheless to determine the maximum possible |
2520 | +latency. |
2521 | + |
2522 | +So for example, if two nice 0 tasks are running, and one has just expired as |
2523 | +another is activated for the first time receiving a full quota for this |
2524 | +runqueue rotation, the first task will wait: |
2525 | + |
2526 | +nr_tasks * max_duration + nice_difference * rr_interval |
2527 | +1 * 19 * RR_INTERVAL + 0 = 152ms |
2528 | + |
2529 | +In the presence of a nice 10 task, a nice 0 task would wait a maximum of |
2530 | +1 * 10 * RR_INTERVAL + 0 = 80ms |
2531 | + |
2532 | +In the presence of a nice 0 task, a nice 10 task would wait a maximum of |
2533 | +1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms |
2534 | + |
2535 | +More useful than these values, though, are the average latencies which are |
2536 | +a matter of determining the average distance between priority slots of |
2537 | +different nice values and multiplying them by the tasks' quota. For example |
2538 | +in the presence of a nice -10 task, a nice 0 task will wait either one or |
2539 | +two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL, |
2540 | +this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or |
2541 | +20 and 40ms respectively (on uniprocessor at 1000HZ). |
2542 | + |
2543 | + |
2544 | +Achieving interactivity |
2545 | +======================= |
2546 | + |
2547 | +A requirement of this scheduler design was to achieve good interactivity |
2548 | +despite being a completely fair deadline based design. The disadvantage of |
2549 | +designs that try to achieve interactivity is that they usually do so at |
2550 | +the expense of maintaining fairness. As cpu speeds increase, the requirement |
2551 | +for some sort of metered unfairness towards interactive tasks becomes a less |
2552 | +desirable phenomenon, but low latency and fairness remains mandatory to |
2553 | +good interactive performance. |
2554 | + |
2555 | +This design relies on the fact that interactive tasks, by their nature, |
2556 | +sleep often. Most fair scheduling designs end up penalising such tasks |
2557 | +indirectly giving them less than their fair possible share because of the |
2558 | +sleep, and have to use a mechanism of bonusing their priority to offset |
2559 | +this based on the duration they sleep. This becomes increasingly inaccurate |
2560 | +as the number of running tasks rises and more tasks spend time waiting on |
2561 | +runqueues rather than sleeping, and it is impossible to tell whether the |
2562 | +task that's waiting on a runqueue only intends to run for a short period and |
2563 | +then sleep again after than runqueue wait. Furthermore, all such designs rely |
2564 | +on a period of time to pass to accumulate some form of statistic on the task |
2565 | +before deciding on how much to give them preference. The shorter this period, |
2566 | +the more rapidly bursts of cpu ruin the interactive tasks behaviour. The |
2567 | +longer this period, the longer it takes for interactive tasks to get low |
2568 | +scheduling latencies and fair cpu. |
2569 | + |
2570 | +This design does not measure sleep time at all. Interactive tasks that sleep |
2571 | +often will wake up having consumed very little if any of their quota for |
2572 | +the current major priority rotation. The longer they have slept, the less |
2573 | +likely they are to even be on the current major priority rotation. Once |
2574 | +woken up, though, they get to use up a their full quota for that epoch, |
2575 | +whether part of a quota remains or a full quota. Overall, however, they |
2576 | +can still only run as much cpu time for that epoch as any other task of the |
2577 | +same nice level. This means that two tasks behaving completely differently |
2578 | +from fully cpu bound to waking/sleeping extremely frequently will still |
2579 | +get the same quota of cpu, but the latter will be using its quota for that |
2580 | +epoch in bursts rather than continuously. This guarantees that interactive |
2581 | +tasks get the same amount of cpu as cpu bound ones. |
2582 | + |
2583 | +The other requirement of interactive tasks is also to obtain low latencies |
2584 | +for when they are scheduled. Unlike fully cpu bound tasks and the maximum |
2585 | +latencies possible described in the modelling deadline behaviour section |
2586 | +above, tasks that sleep will wake up with quota available usually at the |
2587 | +current runqueue's priority_level or better. This means that the most latency |
2588 | +they are likely to see is one RR_INTERVAL, and often they will preempt the |
2589 | +current task if it is not of a sleeping nature. This then guarantees very |
2590 | +low latency for interactive tasks, and the lowest latencies for the least |
2591 | +cpu bound tasks. |
2592 | + |
2593 | + |
2594 | +Fri, 4 May 2007 |
2595 | +Con Kolivas <kernel@kolivas.org> |
2596 | Index: linux-2.6.21-ck2/kernel/softirq.c |
2597 | =================================================================== |
2598 | --- linux-2.6.21-ck2.orig/kernel/softirq.c 2007-05-03 22:20:57.000000000 +1000 |
2599 | +++ linux-2.6.21-ck2/kernel/softirq.c 2007-05-14 19:30:30.000000000 +1000 |
2600 | @@ -488,7 +488,7 @@ void __init softirq_init(void) |
2601 | |
2602 | static int ksoftirqd(void * __bind_cpu) |
2603 | { |
2604 | - set_user_nice(current, 19); |
2605 | + set_user_nice(current, 15); |
2606 | current->flags |= PF_NOFREEZE; |
2607 | |
2608 | set_current_state(TASK_INTERRUPTIBLE); |