Contents of /alx-src/tags/kernel26-2.6.12-alx-r9/Documentation/sched-design.txt
Parent Directory | Revision Log
Revision 630 -
(show annotations)
(download)
Wed Mar 4 11:03:09 2009 UTC (15 years, 6 months ago) by niro
File MIME type: text/plain
File size: 7912 byte(s)
Wed Mar 4 11:03:09 2009 UTC (15 years, 6 months ago) by niro
File MIME type: text/plain
File size: 7912 byte(s)
Tag kernel26-2.6.12-alx-r9
1 | Goals, Design and Implementation of the |
2 | new ultra-scalable O(1) scheduler |
3 | |
4 | |
5 | This is an edited version of an email Ingo Molnar sent to |
6 | lkml on 4 Jan 2002. It describes the goals, design, and |
7 | implementation of Ingo's new ultra-scalable O(1) scheduler. |
8 | Last Updated: 18 April 2002. |
9 | |
10 | |
11 | Goal |
12 | ==== |
13 | |
14 | The main goal of the new scheduler is to keep all the good things we know |
15 | and love about the current Linux scheduler: |
16 | |
17 | - good interactive performance even during high load: if the user |
18 | types or clicks then the system must react instantly and must execute |
19 | the user tasks smoothly, even during considerable background load. |
20 | |
21 | - good scheduling/wakeup performance with 1-2 runnable processes. |
22 | |
23 | - fairness: no process should stay without any timeslice for any |
24 | unreasonable amount of time. No process should get an unjustly high |
25 | amount of CPU time. |
26 | |
27 | - priorities: less important tasks can be started with lower priority, |
28 | more important tasks with higher priority. |
29 | |
30 | - SMP efficiency: no CPU should stay idle if there is work to do. |
31 | |
32 | - SMP affinity: processes which run on one CPU should stay affine to |
33 | that CPU. Processes should not bounce between CPUs too frequently. |
34 | |
35 | - plus additional scheduler features: RT scheduling, CPU binding. |
36 | |
37 | and the goal is also to add a few new things: |
38 | |
39 | - fully O(1) scheduling. Are you tired of the recalculation loop |
40 | blowing the L1 cache away every now and then? Do you think the goodness |
41 | loop is taking a bit too long to finish if there are lots of runnable |
42 | processes? This new scheduler takes no prisoners: wakeup(), schedule(), |
43 | the timer interrupt are all O(1) algorithms. There is no recalculation |
44 | loop. There is no goodness loop either. |
45 | |
46 | - 'perfect' SMP scalability. With the new scheduler there is no 'big' |
47 | runqueue_lock anymore - it's all per-CPU runqueues and locks - two |
48 | tasks on two separate CPUs can wake up, schedule and context-switch |
49 | completely in parallel, without any interlocking. All |
50 | scheduling-relevant data is structured for maximum scalability. |
51 | |
52 | - better SMP affinity. The old scheduler has a particular weakness that |
53 | causes the random bouncing of tasks between CPUs if/when higher |
54 | priority/interactive tasks, this was observed and reported by many |
55 | people. The reason is that the timeslice recalculation loop first needs |
56 | every currently running task to consume its timeslice. But when this |
57 | happens on eg. an 8-way system, then this property starves an |
58 | increasing number of CPUs from executing any process. Once the last |
59 | task that has a timeslice left has finished using up that timeslice, |
60 | the recalculation loop is triggered and other CPUs can start executing |
61 | tasks again - after having idled around for a number of timer ticks. |
62 | The more CPUs, the worse this effect. |
63 | |
64 | Furthermore, this same effect causes the bouncing effect as well: |
65 | whenever there is such a 'timeslice squeeze' of the global runqueue, |
66 | idle processors start executing tasks which are not affine to that CPU. |
67 | (because the affine tasks have finished off their timeslices already.) |
68 | |
69 | The new scheduler solves this problem by distributing timeslices on a |
70 | per-CPU basis, without having any global synchronization or |
71 | recalculation. |
72 | |
73 | - batch scheduling. A significant proportion of computing-intensive tasks |
74 | benefit from batch-scheduling, where timeslices are long and processes |
75 | are roundrobin scheduled. The new scheduler does such batch-scheduling |
76 | of the lowest priority tasks - so nice +19 jobs will get |
77 | 'batch-scheduled' automatically. With this scheduler, nice +19 jobs are |
78 | in essence SCHED_IDLE, from an interactiveness point of view. |
79 | |
80 | - handle extreme loads more smoothly, without breakdown and scheduling |
81 | storms. |
82 | |
83 | - O(1) RT scheduling. For those RT folks who are paranoid about the |
84 | O(nr_running) property of the goodness loop and the recalculation loop. |
85 | |
86 | - run fork()ed children before the parent. Andrea has pointed out the |
87 | advantages of this a few months ago, but patches for this feature |
88 | do not work with the old scheduler as well as they should, |
89 | because idle processes often steal the new child before the fork()ing |
90 | CPU gets to execute it. |
91 | |
92 | |
93 | Design |
94 | ====== |
95 | |
96 | the core of the new scheduler are the following mechanizms: |
97 | |
98 | - *two*, priority-ordered 'priority arrays' per CPU. There is an 'active' |
99 | array and an 'expired' array. The active array contains all tasks that |
100 | are affine to this CPU and have timeslices left. The expired array |
101 | contains all tasks which have used up their timeslices - but this array |
102 | is kept sorted as well. The active and expired array is not accessed |
103 | directly, it's accessed through two pointers in the per-CPU runqueue |
104 | structure. If all active tasks are used up then we 'switch' the two |
105 | pointers and from now on the ready-to-go (former-) expired array is the |
106 | active array - and the empty active array serves as the new collector |
107 | for expired tasks. |
108 | |
109 | - there is a 64-bit bitmap cache for array indices. Finding the highest |
110 | priority task is thus a matter of two x86 BSFL bit-search instructions. |
111 | |
112 | the split-array solution enables us to have an arbitrary number of active |
113 | and expired tasks, and the recalculation of timeslices can be done |
114 | immediately when the timeslice expires. Because the arrays are always |
115 | access through the pointers in the runqueue, switching the two arrays can |
116 | be done very quickly. |
117 | |
118 | this is a hybride priority-list approach coupled with roundrobin |
119 | scheduling and the array-switch method of distributing timeslices. |
120 | |
121 | - there is a per-task 'load estimator'. |
122 | |
123 | one of the toughest things to get right is good interactive feel during |
124 | heavy system load. While playing with various scheduler variants i found |
125 | that the best interactive feel is achieved not by 'boosting' interactive |
126 | tasks, but by 'punishing' tasks that want to use more CPU time than there |
127 | is available. This method is also much easier to do in an O(1) fashion. |
128 | |
129 | to establish the actual 'load' the task contributes to the system, a |
130 | complex-looking but pretty accurate method is used: there is a 4-entry |
131 | 'history' ringbuffer of the task's activities during the last 4 seconds. |
132 | This ringbuffer is operated without much overhead. The entries tell the |
133 | scheduler a pretty accurate load-history of the task: has it used up more |
134 | CPU time or less during the past N seconds. [the size '4' and the interval |
135 | of 4x 1 seconds was found by lots of experimentation - this part is |
136 | flexible and can be changed in both directions.] |
137 | |
138 | the penalty a task gets for generating more load than the CPU can handle |
139 | is a priority decrease - there is a maximum amount to this penalty |
140 | relative to their static priority, so even fully CPU-bound tasks will |
141 | observe each other's priorities, and will share the CPU accordingly. |
142 | |
143 | the SMP load-balancer can be extended/switched with additional parallel |
144 | computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs |
145 | can be supported easily by changing the load-balancer. Right now it's |
146 | tuned for my SMP systems. |
147 | |
148 | i skipped the prev->mm == next->mm advantage - no workload i know of shows |
149 | any sensitivity to this. It can be added back by sacrificing O(1) |
150 | schedule() [the current and one-lower priority list can be searched for a |
151 | that->mm == current->mm condition], but costs a fair number of cycles |
152 | during a number of important workloads, so i wanted to avoid this as much |
153 | as possible. |
154 | |
155 | - the SMP idle-task startup code was still racy and the new scheduler |
156 | triggered this. So i streamlined the idle-setup code a bit. We do not call |
157 | into schedule() before all processors have started up fully and all idle |
158 | threads are in place. |
159 | |
160 | - the patch also cleans up a number of aspects of sched.c - moves code |
161 | into other areas of the kernel where it's appropriate, and simplifies |
162 | certain code paths and data constructs. As a result, the new scheduler's |
163 | code is smaller than the old one. |
164 | |
165 | Ingo |