Magellan Linux

Contents of /trunk/kernel26-magellan/patches-2.6.16-r10/0001-2.6.16-sched-implement-smpnice.patch

Parent Directory Parent Directory | Revision Log Revision Log


Revision 70 - (show annotations) (download)
Thu May 11 19:09:22 2006 UTC (18 years ago) by niro
File size: 22925 byte(s)
import

1
2 From: Peter Williams <pwil3058@bigpond.net.au>
3
4 Problem:
5
6 The introduction of separate run queues per CPU has brought with it "nice"
7 enforcement problems that are best described by a simple example.
8
9 For the sake of argument suppose that on a single CPU machine with a
10 nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
11 task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now
12 suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
13 2 nice==0 hard spinners running. The user of this system would be entitled
14 to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
15 tasks only get 5% each. However, whether this expectation is met is pretty
16 much down to luck as there are four equally likely distributions of the
17 tasks to the CPUs that the load balancing code will consider to be balanced
18 with loads of 2.0 for each CPU. Two of these distributions involve one
19 nice==0 and one nice==19 task per CPU and in these circumstances the users
20 expectations will be met. The other two distributions both involve both
21 nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
22 each task will get 50% of a CPU and the user's expectations will not be
23 met.
24
25 Solution:
26
27 The solution to this problem that is implemented in the attached patch is
28 to use weighted loads when determining if the system is balanced and, when
29 an imbalance is detected, to move an amount of weighted load between run
30 queues (as opposed to a number of tasks) to restore the balance. Once
31 again, the easiest way to explain why both of these measures are necessary
32 is to use a simple example. Suppose that (in a slight variation of the
33 above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
34 hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
35 the 4 nice==19 tasks are on the other CPU. The weighted loads for the two
36 CPUs would be 4.0 and 0.2 respectively and the load balancing code would
37 move 2 tasks resulting in one CPU with a load of 2.0 and the other with
38 load of 2.2. If this was considered to be a big enough imbalance to
39 justify moving a task and that task was moved using the current
40 move_tasks() then it would move the highest priority task that it found and
41 this would result in one CPU with a load of 3.0 and the other with a load
42 of 1.2 which would result in the movement of a task in the opposite
43 direction and so on -- infinite loop. If, on the other hand, an amount of
44 load to be moved is calculated from the imbalance (in this case 0.1) and
45 move_tasks() skips tasks until it find ones whose contributions to the
46 weighted load are less than this amount it would move two of the nice==19
47 tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
48 loads of 2.1 for each CPU.
49
50 One of the advantages of this mechanism is that on a system where all tasks
51 have nice==0 the load balancing calculations would be mathematically
52 identical to the current load balancing code.
53
54 Notes:
55
56 struct task_struct:
57
58 has a new field load_weight which (in a trade off of space for speed)
59 stores the contribution that this task makes to a CPU's weighted load when
60 it is runnable.
61
62 struct runqueue:
63
64 has a new field raw_weighted_load which is the sum of the load_weight
65 values for the currently runnable tasks on this run queue. This field
66 always needs to be updated when nr_running is updated so two new inline
67 functions inc_nr_running() and dec_nr_running() have been created to make
68 sure that this happens. This also offers a convenient way to optimize away
69 this part of the smpnice mechanism when CONFIG_SMP is not defined.
70
71 int try_to_wake_up():
72
73 in this function the value SCHED_LOAD_BALANCE is used to represent the load
74 contribution of a single task in various calculations in the code that
75 decides which CPU to put the waking task on. While this would be a valid
76 on a system where the nice values for the runnable tasks were distributed
77 evenly around zero it will lead to anomalous load balancing if the
78 distribution is skewed in either direction. To overcome this problem
79 SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
80 or by the average load_weight per task for the queue in question (as
81 appropriate).
82
83 int move_tasks():
84
85 The modifications to this function were complicated by the fact that
86 active_load_balance() uses it to move exactly one task without checking
87 whether an imbalance actually exists. This precluded the simple
88 overloading of max_nr_move with max_load_move and necessitated the addition
89 of the latter as an extra argument to the function. The internal
90 implementation is then modified to move up to max_nr_move tasks and
91 max_load_move of weighted load. This slightly complicates the code where
92 move_tasks() is called and if ever active_load_balance() is changed to not
93 use move_tasks() the implementation of move_tasks() should be simplified
94 accordingly.
95
96 struct sched_group *find_busiest_group():
97
98 Similar to try_to_wake_up(), there are places in this function where
99 SCHED_LOAD_SCALE is used to represent the load contribution of a single
100 task and the same issues are created. A similar solution is adopted except
101 that it is now the average per task contribution to a group's load (as
102 opposed to a run queue) that is required. As this value is not directly
103 available from the group it is calculated on the fly as the queues in the
104 groups are visited when determining the busiest group.
105
106 A key change to this function is that it is no longer to scale down
107 *imbalance on exit as move_tasks() uses the load in its scaled form.
108
109 void set_user_nice():
110
111 has been modified to update the task's load_weight field when it's nice
112 value and also to ensure that its run queue's raw_weighted_load field is
113 updated if it was runnable.
114
115 Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
116 Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
117 Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
118 Acked-by: Ingo Molnar <mingo@elte.hu>
119 Cc: Nick Piggin <nickpiggin@yahoo.com.au>
120 Signed-off-by: Con Kolivas <kernel@kolivas.org>
121 Cc: John Hawkes <hawkes@sgi.com>
122 Signed-off-by: Andrew Morton <akpm@osdl.org>
123 include/linux/sched.h | 3
124 kernel/sched.c | 230 +++++++++++++++++++++++++++++++++++++++-----------
125 2 files changed, 183 insertions(+), 50 deletions(-)
126
127 Index: linux-2.6.16-ck1/include/linux/sched.h
128 ===================================================================
129 --- linux-2.6.16-ck1.orig/include/linux/sched.h 2006-03-20 20:46:27.000000000 +1100
130 +++ linux-2.6.16-ck1/include/linux/sched.h 2006-03-20 20:46:44.000000000 +1100
131 @@ -702,6 +702,9 @@ struct task_struct {
132 int oncpu;
133 #endif
134 int prio, static_prio;
135 +#ifdef CONFIG_SMP
136 + int load_weight; /* for load balancing purposes */
137 +#endif
138 struct list_head run_list;
139 prio_array_t *array;
140
141 Index: linux-2.6.16-ck1/kernel/sched.c
142 ===================================================================
143 --- linux-2.6.16-ck1.orig/kernel/sched.c 2006-03-20 20:46:27.000000000 +1100
144 +++ linux-2.6.16-ck1/kernel/sched.c 2006-03-20 20:46:44.000000000 +1100
145 @@ -208,6 +208,7 @@ struct runqueue {
146 */
147 unsigned long nr_running;
148 #ifdef CONFIG_SMP
149 + unsigned long raw_weighted_load;
150 unsigned long cpu_load[3];
151 #endif
152 unsigned long long nr_switches;
153 @@ -661,13 +662,85 @@ static int effective_prio(task_t *p)
154 return prio;
155 }
156
157 +#ifdef CONFIG_SMP
158 +/*
159 + * To aid in avoiding the subversion of "niceness" due to uneven distribution
160 + * of tasks with abnormal "nice" values across CPUs the contribution that
161 + * each task makes to its run queue's load is weighted according to its
162 + * scheduling class and "nice" value.
163 + */
164 +
165 +/*
166 + * Priority weight for load balancing ranges from 1/20 (nice==19) to 459/20 (RT
167 + * priority of 100).
168 + */
169 +#define NICE_TO_LOAD_PRIO(nice) \
170 + ((nice >= 0) ? (20 - (nice)) : (20 + (nice) * (nice)))
171 +#define LOAD_WEIGHT(lp) \
172 + (((lp) * SCHED_LOAD_SCALE) / NICE_TO_LOAD_PRIO(0))
173 +#define NICE_TO_LOAD_WEIGHT(nice) LOAD_WEIGHT(NICE_TO_LOAD_PRIO(nice))
174 +#define PRIO_TO_LOAD_WEIGHT(prio) NICE_TO_LOAD_WEIGHT(PRIO_TO_NICE(prio))
175 +#define RTPRIO_TO_LOAD_WEIGHT(rp) \
176 + LOAD_WEIGHT(NICE_TO_LOAD_PRIO(-20) + (rp))
177 +
178 +static inline void set_load_weight(task_t *p)
179 +{
180 + if (rt_task(p)) {
181 + if (p == task_rq(p)->migration_thread)
182 + /*
183 + * The migration thread does the actual balancing.
184 + * Giving its load any weight will skew balancing
185 + * adversely.
186 + */
187 + p->load_weight = 0;
188 + else
189 + p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
190 + } else
191 + p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
192 +}
193 +
194 +static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t *p)
195 +{
196 + rq->raw_weighted_load += p->load_weight;
197 +}
198 +
199 +static inline void dec_raw_weighted_load(runqueue_t *rq, const task_t *p)
200 +{
201 + rq->raw_weighted_load -= p->load_weight;
202 +}
203 +#else
204 +static inline void set_load_weight(task_t *p)
205 +{
206 +}
207 +
208 +static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t *p)
209 +{
210 +}
211 +
212 +static inline void dec_raw_weighted_load(runqueue_t *rq, const task_t *p)
213 +{
214 +}
215 +#endif
216 +
217 +static inline void inc_nr_running(task_t *p, runqueue_t *rq)
218 +{
219 + rq->nr_running++;
220 + inc_raw_weighted_load(rq, p);
221 +}
222 +
223 +static inline void dec_nr_running(task_t *p, runqueue_t *rq)
224 +{
225 + rq->nr_running--;
226 + dec_raw_weighted_load(rq, p);
227 +}
228 +
229 /*
230 * __activate_task - move a task to the runqueue.
231 */
232 static inline void __activate_task(task_t *p, runqueue_t *rq)
233 {
234 enqueue_task(p, rq->active);
235 - rq->nr_running++;
236 + inc_nr_running(p, rq);
237 }
238
239 /*
240 @@ -676,7 +749,7 @@ static inline void __activate_task(task_
241 static inline void __activate_idle_task(task_t *p, runqueue_t *rq)
242 {
243 enqueue_task_head(p, rq->active);
244 - rq->nr_running++;
245 + inc_nr_running(p, rq);
246 }
247
248 static int recalc_task_prio(task_t *p, unsigned long long now)
249 @@ -800,7 +873,7 @@ static void activate_task(task_t *p, run
250 */
251 static void deactivate_task(struct task_struct *p, runqueue_t *rq)
252 {
253 - rq->nr_running--;
254 + dec_nr_running(p, rq);
255 dequeue_task(p, p->array);
256 p->array = NULL;
257 }
258 @@ -939,7 +1012,8 @@ void kick_process(task_t *p)
259 }
260
261 /*
262 - * Return a low guess at the load of a migration-source cpu.
263 + * Return a low guess at the load of a migration-source cpu weighted
264 + * according to the scheduling class and "nice" value.
265 *
266 * We want to under-estimate the load of migration sources, to
267 * balance conservatively.
268 @@ -947,24 +1021,36 @@ void kick_process(task_t *p)
269 static inline unsigned long source_load(int cpu, int type)
270 {
271 runqueue_t *rq = cpu_rq(cpu);
272 - unsigned long load_now = rq->nr_running * SCHED_LOAD_SCALE;
273 +
274 if (type == 0)
275 - return load_now;
276 + return rq->raw_weighted_load;
277
278 - return min(rq->cpu_load[type-1], load_now);
279 + return min(rq->cpu_load[type-1], rq->raw_weighted_load);
280 }
281
282 /*
283 - * Return a high guess at the load of a migration-target cpu
284 + * Return a high guess at the load of a migration-target cpu weighted
285 + * according to the scheduling class and "nice" value.
286 */
287 static inline unsigned long target_load(int cpu, int type)
288 {
289 runqueue_t *rq = cpu_rq(cpu);
290 - unsigned long load_now = rq->nr_running * SCHED_LOAD_SCALE;
291 +
292 if (type == 0)
293 - return load_now;
294 + return rq->raw_weighted_load;
295
296 - return max(rq->cpu_load[type-1], load_now);
297 + return max(rq->cpu_load[type-1], rq->raw_weighted_load);
298 +}
299 +
300 +/*
301 + * Return the average load per task on the cpu's run queue
302 + */
303 +static inline unsigned long cpu_avg_load_per_task(int cpu)
304 +{
305 + runqueue_t *rq = cpu_rq(cpu);
306 + unsigned long n = rq->nr_running;
307 +
308 + return n ? rq->raw_weighted_load / n : rq->raw_weighted_load;
309 }
310
311 /*
312 @@ -1216,17 +1302,19 @@ static int try_to_wake_up(task_t *p, uns
313
314 if (this_sd->flags & SD_WAKE_AFFINE) {
315 unsigned long tl = this_load;
316 + unsigned long tl_per_task = cpu_avg_load_per_task(this_cpu);
317 +
318 /*
319 * If sync wakeup then subtract the (maximum possible)
320 * effect of the currently running task from the load
321 * of the current CPU:
322 */
323 if (sync)
324 - tl -= SCHED_LOAD_SCALE;
325 + tl -= current->load_weight;
326
327 if ((tl <= load &&
328 - tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
329 - 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
330 + tl + target_load(cpu, idx) <= tl_per_task) ||
331 + 100*(tl + p->load_weight) <= imbalance*load) {
332 /*
333 * This domain has SD_WAKE_AFFINE and
334 * p is cache cold in this domain, and
335 @@ -1425,7 +1513,7 @@ void fastcall wake_up_new_task(task_t *p
336 list_add_tail(&p->run_list, &current->run_list);
337 p->array = current->array;
338 p->array->nr_active++;
339 - rq->nr_running++;
340 + inc_nr_running(p, rq);
341 }
342 set_need_resched();
343 } else
344 @@ -1770,9 +1858,9 @@ void pull_task(runqueue_t *src_rq, prio_
345 runqueue_t *this_rq, prio_array_t *this_array, int this_cpu)
346 {
347 dequeue_task(p, src_array);
348 - src_rq->nr_running--;
349 + dec_nr_running(p, src_rq);
350 set_task_cpu(p, this_cpu);
351 - this_rq->nr_running++;
352 + inc_nr_running(p, this_rq);
353 enqueue_task(p, this_array);
354 p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
355 + this_rq->timestamp_last_tick;
356 @@ -1820,24 +1908,27 @@ int can_migrate_task(task_t *p, runqueue
357 }
358
359 /*
360 - * move_tasks tries to move up to max_nr_move tasks from busiest to this_rq,
361 - * as part of a balancing operation within "domain". Returns the number of
362 - * tasks moved.
363 + * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
364 + * load from busiest to this_rq, as part of a balancing operation within
365 + * "domain". Returns the number of tasks moved.
366 *
367 * Called with both runqueues locked.
368 */
369 static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest,
370 - unsigned long max_nr_move, struct sched_domain *sd,
371 - enum idle_type idle, int *all_pinned)
372 + unsigned long max_nr_move, unsigned long max_load_move,
373 + struct sched_domain *sd, enum idle_type idle,
374 + int *all_pinned)
375 {
376 prio_array_t *array, *dst_array;
377 struct list_head *head, *curr;
378 int idx, pulled = 0, pinned = 0;
379 + long rem_load_move;
380 task_t *tmp;
381
382 - if (max_nr_move == 0)
383 + if (max_nr_move == 0 || max_load_move == 0)
384 goto out;
385
386 + rem_load_move = max_load_move;
387 pinned = 1;
388
389 /*
390 @@ -1878,7 +1969,8 @@ skip_queue:
391
392 curr = curr->prev;
393
394 - if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
395 + if (tmp->load_weight > rem_load_move ||
396 + !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
397 if (curr != head)
398 goto skip_queue;
399 idx++;
400 @@ -1892,9 +1984,13 @@ skip_queue:
401
402 pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
403 pulled++;
404 + rem_load_move -= tmp->load_weight;
405
406 - /* We only want to steal up to the prescribed number of tasks. */
407 - if (pulled < max_nr_move) {
408 + /*
409 + * We only want to steal up to the prescribed number of tasks
410 + * and the prescribed amount of weighted load.
411 + */
412 + if (pulled < max_nr_move && rem_load_move > 0) {
413 if (curr != head)
414 goto skip_queue;
415 idx++;
416 @@ -1915,7 +2011,7 @@ out:
417
418 /*
419 * find_busiest_group finds and returns the busiest CPU group within the
420 - * domain. It calculates and returns the number of tasks which should be
421 + * domain. It calculates and returns the amount of weighted load which should be
422 * moved to restore balance via the imbalance parameter.
423 */
424 static struct sched_group *
425 @@ -1925,9 +2021,13 @@ find_busiest_group(struct sched_domain *
426 struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
427 unsigned long max_load, avg_load, total_load, this_load, total_pwr;
428 unsigned long max_pull;
429 + unsigned long busiest_load_per_task, busiest_nr_running;
430 + unsigned long this_load_per_task, this_nr_running;
431 int load_idx;
432
433 max_load = this_load = total_load = total_pwr = 0;
434 + busiest_load_per_task = busiest_nr_running = 0;
435 + this_load_per_task = this_nr_running = 0;
436 if (idle == NOT_IDLE)
437 load_idx = sd->busy_idx;
438 else if (idle == NEWLY_IDLE)
439 @@ -1939,13 +2039,16 @@ find_busiest_group(struct sched_domain *
440 unsigned long load;
441 int local_group;
442 int i;
443 + unsigned long sum_nr_running, sum_weighted_load;
444
445 local_group = cpu_isset(this_cpu, group->cpumask);
446
447 /* Tally up the load of all CPUs in the group */
448 - avg_load = 0;
449 + sum_weighted_load = sum_nr_running = avg_load = 0;
450
451 for_each_cpu_mask(i, group->cpumask) {
452 + runqueue_t *rq = cpu_rq(i);
453 +
454 if (*sd_idle && !idle_cpu(i))
455 *sd_idle = 0;
456
457 @@ -1956,6 +2059,8 @@ find_busiest_group(struct sched_domain *
458 load = source_load(i, load_idx);
459
460 avg_load += load;
461 + sum_nr_running += rq->nr_running;
462 + sum_weighted_load += rq->raw_weighted_load;
463 }
464
465 total_load += avg_load;
466 @@ -1967,14 +2072,18 @@ find_busiest_group(struct sched_domain *
467 if (local_group) {
468 this_load = avg_load;
469 this = group;
470 + this_nr_running = sum_nr_running;
471 + this_load_per_task = sum_weighted_load;
472 } else if (avg_load > max_load) {
473 max_load = avg_load;
474 busiest = group;
475 + busiest_nr_running = sum_nr_running;
476 + busiest_load_per_task = sum_weighted_load;
477 }
478 group = group->next;
479 } while (group != sd->groups);
480
481 - if (!busiest || this_load >= max_load || max_load <= SCHED_LOAD_SCALE)
482 + if (!busiest || this_load >= max_load || busiest_nr_running <= 1)
483 goto out_balanced;
484
485 avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
486 @@ -1983,6 +2092,7 @@ find_busiest_group(struct sched_domain *
487 100*max_load <= sd->imbalance_pct*this_load)
488 goto out_balanced;
489
490 + busiest_load_per_task /= busiest_nr_running;
491 /*
492 * We're trying to get all the cpus to the average_load, so we don't
493 * want to push ourselves above the average load, nor do we wish to
494 @@ -1996,19 +2106,25 @@ find_busiest_group(struct sched_domain *
495 */
496
497 /* Don't want to pull so many tasks that a group would go idle */
498 - max_pull = min(max_load - avg_load, max_load - SCHED_LOAD_SCALE);
499 + max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);
500
501 /* How much load to actually move to equalise the imbalance */
502 *imbalance = min(max_pull * busiest->cpu_power,
503 (avg_load - this_load) * this->cpu_power)
504 / SCHED_LOAD_SCALE;
505
506 - if (*imbalance < SCHED_LOAD_SCALE) {
507 + /*
508 + * if *imbalance is less than the average load per runnable task
509 + * there is no gaurantee that any tasks will be moved so we'll have
510 + * a think about bumping its value to force at least one task to be
511 + * moved
512 + */
513 + if (*imbalance < busiest_load_per_task) {
514 unsigned long pwr_now = 0, pwr_move = 0;
515 unsigned long tmp;
516
517 - if (max_load - this_load >= SCHED_LOAD_SCALE*2) {
518 - *imbalance = 1;
519 + if (max_load - this_load >= busiest_load_per_task*2) {
520 + *imbalance = busiest_load_per_task;
521 return busiest;
522 }
523
524 @@ -2018,35 +2134,39 @@ find_busiest_group(struct sched_domain *
525 * moving them.
526 */
527
528 - pwr_now += busiest->cpu_power*min(SCHED_LOAD_SCALE, max_load);
529 - pwr_now += this->cpu_power*min(SCHED_LOAD_SCALE, this_load);
530 + pwr_now += busiest->cpu_power *
531 + min(busiest_load_per_task, max_load);
532 + if (this_nr_running)
533 + this_load_per_task /= this_nr_running;
534 + pwr_now += this->cpu_power *
535 + min(this_load_per_task, this_load);
536 pwr_now /= SCHED_LOAD_SCALE;
537
538 /* Amount of load we'd subtract */
539 - tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/busiest->cpu_power;
540 + tmp = busiest_load_per_task*SCHED_LOAD_SCALE/busiest->cpu_power;
541 if (max_load > tmp)
542 - pwr_move += busiest->cpu_power*min(SCHED_LOAD_SCALE,
543 - max_load - tmp);
544 + pwr_move += busiest->cpu_power *
545 + min(busiest_load_per_task, max_load - tmp);
546
547 /* Amount of load we'd add */
548 if (max_load*busiest->cpu_power <
549 - SCHED_LOAD_SCALE*SCHED_LOAD_SCALE)
550 + busiest_load_per_task*SCHED_LOAD_SCALE)
551 tmp = max_load*busiest->cpu_power/this->cpu_power;
552 else
553 - tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/this->cpu_power;
554 - pwr_move += this->cpu_power*min(SCHED_LOAD_SCALE, this_load + tmp);
555 + tmp = busiest_load_per_task*SCHED_LOAD_SCALE/this->cpu_power;
556 + pwr_move += this->cpu_power*min(this_load_per_task, this_load + tmp);
557 pwr_move /= SCHED_LOAD_SCALE;
558
559 /* Move if we gain throughput */
560 - if (pwr_move <= pwr_now)
561 + if (pwr_move > pwr_now)
562 + *imbalance = busiest_load_per_task;
563 + /* or if there's a reasonable chance that *imbalance is big
564 + * enough to cause a move
565 + */
566 + else if (*imbalance <= busiest_load_per_task / 2)
567 goto out_balanced;
568 -
569 - *imbalance = 1;
570 - return busiest;
571 }
572
573 - /* Get rid of the scaling factor, rounding down as we divide */
574 - *imbalance = *imbalance / SCHED_LOAD_SCALE;
575 return busiest;
576
577 out_balanced:
578 @@ -2083,6 +2203,7 @@ static runqueue_t *find_busiest_queue(st
579 */
580 #define MAX_PINNED_INTERVAL 512
581
582 +#define minus_1_or_zero(n) ((n) > 0 ? (n) - 1 : 0)
583 /*
584 * Check this_cpu to ensure it is balanced within domain. Attempt to move
585 * tasks if there is an imbalance.
586 @@ -2130,6 +2251,7 @@ static int load_balance(int this_cpu, ru
587 */
588 double_rq_lock(this_rq, busiest);
589 nr_moved = move_tasks(this_rq, this_cpu, busiest,
590 + minus_1_or_zero(busiest->nr_running),
591 imbalance, sd, idle, &all_pinned);
592 double_rq_unlock(this_rq, busiest);
593
594 @@ -2248,6 +2370,7 @@ static int load_balance_newidle(int this
595 /* Attempt to move tasks */
596 double_lock_balance(this_rq, busiest);
597 nr_moved = move_tasks(this_rq, this_cpu, busiest,
598 + minus_1_or_zero(busiest->nr_running),
599 imbalance, sd, NEWLY_IDLE, NULL);
600 spin_unlock(&busiest->lock);
601 }
602 @@ -2328,7 +2451,8 @@ static void active_load_balance(runqueue
603
604 schedstat_inc(sd, alb_cnt);
605
606 - if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE, NULL))
607 + if (move_tasks(target_rq, target_cpu, busiest_rq, 1,
608 + RTPRIO_TO_LOAD_WEIGHT(100), sd, SCHED_IDLE, NULL))
609 schedstat_inc(sd, alb_pushed);
610 else
611 schedstat_inc(sd, alb_failed);
612 @@ -2356,7 +2480,7 @@ static void rebalance_tick(int this_cpu,
613 struct sched_domain *sd;
614 int i;
615
616 - this_load = this_rq->nr_running * SCHED_LOAD_SCALE;
617 + this_load = this_rq->raw_weighted_load;
618 /* Update our load */
619 for (i = 0; i < 3; i++) {
620 unsigned long new_load = this_load;
621 @@ -3466,17 +3590,21 @@ void set_user_nice(task_t *p, long nice)
622 goto out_unlock;
623 }
624 array = p->array;
625 - if (array)
626 + if (array) {
627 dequeue_task(p, array);
628 + dec_raw_weighted_load(rq, p);
629 + }
630
631 old_prio = p->prio;
632 new_prio = NICE_TO_PRIO(nice);
633 delta = new_prio - old_prio;
634 p->static_prio = NICE_TO_PRIO(nice);
635 + set_load_weight(p);
636 p->prio += delta;
637
638 if (array) {
639 enqueue_task(p, array);
640 + inc_raw_weighted_load(rq, p);
641 /*
642 * If the task increased its priority or is running and
643 * lowered its priority, then reschedule its CPU:
644 @@ -3612,6 +3740,7 @@ static void __setscheduler(struct task_s
645 if (policy == SCHED_BATCH)
646 p->sleep_avg = 0;
647 }
648 + set_load_weight(p);
649 }
650
651 /**
652 @@ -6049,6 +6178,7 @@ void __init sched_init(void)
653 }
654 }
655
656 + set_load_weight(&init_task);
657 /*
658 * The boot idle thread does lazy MMU switching as well:
659 */