Contents of /alx-src/tags/kernel26-2.6.12-alx-r9/Documentation/cpusets.txt
Parent Directory | Revision Log
Revision 630 -
(show annotations)
(download)
Wed Mar 4 11:03:09 2009 UTC (15 years, 6 months ago) by niro
File MIME type: text/plain
File size: 16648 byte(s)
Wed Mar 4 11:03:09 2009 UTC (15 years, 6 months ago) by niro
File MIME type: text/plain
File size: 16648 byte(s)
Tag kernel26-2.6.12-alx-r9
1 | CPUSETS |
2 | ------- |
3 | |
4 | Copyright (C) 2004 BULL SA. |
5 | Written by Simon.Derr@bull.net |
6 | |
7 | Portions Copyright (c) 2004 Silicon Graphics, Inc. |
8 | Modified by Paul Jackson <pj@sgi.com> |
9 | |
10 | CONTENTS: |
11 | ========= |
12 | |
13 | 1. Cpusets |
14 | 1.1 What are cpusets ? |
15 | 1.2 Why are cpusets needed ? |
16 | 1.3 How are cpusets implemented ? |
17 | 1.4 How do I use cpusets ? |
18 | 2. Usage Examples and Syntax |
19 | 2.1 Basic Usage |
20 | 2.2 Adding/removing cpus |
21 | 2.3 Setting flags |
22 | 2.4 Attaching processes |
23 | 3. Questions |
24 | 4. Contact |
25 | |
26 | 1. Cpusets |
27 | ========== |
28 | |
29 | 1.1 What are cpusets ? |
30 | ---------------------- |
31 | |
32 | Cpusets provide a mechanism for assigning a set of CPUs and Memory |
33 | Nodes to a set of tasks. |
34 | |
35 | Cpusets constrain the CPU and Memory placement of tasks to only |
36 | the resources within a tasks current cpuset. They form a nested |
37 | hierarchy visible in a virtual file system. These are the essential |
38 | hooks, beyond what is already present, required to manage dynamic |
39 | job placement on large systems. |
40 | |
41 | Each task has a pointer to a cpuset. Multiple tasks may reference |
42 | the same cpuset. Requests by a task, using the sched_setaffinity(2) |
43 | system call to include CPUs in its CPU affinity mask, and using the |
44 | mbind(2) and set_mempolicy(2) system calls to include Memory Nodes |
45 | in its memory policy, are both filtered through that tasks cpuset, |
46 | filtering out any CPUs or Memory Nodes not in that cpuset. The |
47 | scheduler will not schedule a task on a CPU that is not allowed in |
48 | its cpus_allowed vector, and the kernel page allocator will not |
49 | allocate a page on a node that is not allowed in the requesting tasks |
50 | mems_allowed vector. |
51 | |
52 | If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct |
53 | ancestor or descendent, may share any of the same CPUs or Memory Nodes. |
54 | |
55 | User level code may create and destroy cpusets by name in the cpuset |
56 | virtual file system, manage the attributes and permissions of these |
57 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
58 | specify and query to which cpuset a task is assigned, and list the |
59 | task pids assigned to a cpuset. |
60 | |
61 | |
62 | 1.2 Why are cpusets needed ? |
63 | ---------------------------- |
64 | |
65 | The management of large computer systems, with many processors (CPUs), |
66 | complex memory cache hierarchies and multiple Memory Nodes having |
67 | non-uniform access times (NUMA) presents additional challenges for |
68 | the efficient scheduling and memory placement of processes. |
69 | |
70 | Frequently more modest sized systems can be operated with adequate |
71 | efficiency just by letting the operating system automatically share |
72 | the available CPU and Memory resources amongst the requesting tasks. |
73 | |
74 | But larger systems, which benefit more from careful processor and |
75 | memory placement to reduce memory access times and contention, |
76 | and which typically represent a larger investment for the customer, |
77 | can benefit from explictly placing jobs on properly sized subsets of |
78 | the system. |
79 | |
80 | This can be especially valuable on: |
81 | |
82 | * Web Servers running multiple instances of the same web application, |
83 | * Servers running different applications (for instance, a web server |
84 | and a database), or |
85 | * NUMA systems running large HPC applications with demanding |
86 | performance characteristics. |
87 | |
88 | These subsets, or "soft partitions" must be able to be dynamically |
89 | adjusted, as the job mix changes, without impacting other concurrently |
90 | executing jobs. |
91 | |
92 | The kernel cpuset patch provides the minimum essential kernel |
93 | mechanisms required to efficiently implement such subsets. It |
94 | leverages existing CPU and Memory Placement facilities in the Linux |
95 | kernel to avoid any additional impact on the critical scheduler or |
96 | memory allocator code. |
97 | |
98 | |
99 | 1.3 How are cpusets implemented ? |
100 | --------------------------------- |
101 | |
102 | Cpusets provide a Linux kernel (2.6.7 and above) mechanism to constrain |
103 | which CPUs and Memory Nodes are used by a process or set of processes. |
104 | |
105 | The Linux kernel already has a pair of mechanisms to specify on which |
106 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory |
107 | Nodes it may obtain memory (mbind, set_mempolicy). |
108 | |
109 | Cpusets extends these two mechanisms as follows: |
110 | |
111 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the |
112 | kernel. |
113 | - Each task in the system is attached to a cpuset, via a pointer |
114 | in the task structure to a reference counted cpuset structure. |
115 | - Calls to sched_setaffinity are filtered to just those CPUs |
116 | allowed in that tasks cpuset. |
117 | - Calls to mbind and set_mempolicy are filtered to just |
118 | those Memory Nodes allowed in that tasks cpuset. |
119 | - The root cpuset contains all the systems CPUs and Memory |
120 | Nodes. |
121 | - For any cpuset, one can define child cpusets containing a subset |
122 | of the parents CPU and Memory Node resources. |
123 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for |
124 | browsing and manipulation from user space. |
125 | - A cpuset may be marked exclusive, which ensures that no other |
126 | cpuset (except direct ancestors and descendents) may contain |
127 | any overlapping CPUs or Memory Nodes. |
128 | - You can list all the tasks (by pid) attached to any cpuset. |
129 | |
130 | The implementation of cpusets requires a few, simple hooks |
131 | into the rest of the kernel, none in performance critical paths: |
132 | |
133 | - in main/init.c, to initialize the root cpuset at system boot. |
134 | - in fork and exit, to attach and detach a task from its cpuset. |
135 | - in sched_setaffinity, to mask the requested CPUs by what's |
136 | allowed in that tasks cpuset. |
137 | - in sched.c migrate_all_tasks(), to keep migrating tasks within |
138 | the CPUs allowed by their cpuset, if possible. |
139 | - in the mbind and set_mempolicy system calls, to mask the requested |
140 | Memory Nodes by what's allowed in that tasks cpuset. |
141 | - in page_alloc, to restrict memory to allowed nodes. |
142 | - in vmscan.c, to restrict page recovery to the current cpuset. |
143 | |
144 | In addition a new file system, of type "cpuset" may be mounted, |
145 | typically at /dev/cpuset, to enable browsing and modifying the cpusets |
146 | presently known to the kernel. No new system calls are added for |
147 | cpusets - all support for querying and modifying cpusets is via |
148 | this cpuset file system. |
149 | |
150 | Each task under /proc has an added file named 'cpuset', displaying |
151 | the cpuset name, as the path relative to the root of the cpuset file |
152 | system. |
153 | |
154 | The /proc/<pid>/status file for each task has two added lines, |
155 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) |
156 | and mems_allowed (on which Memory Nodes it may obtain memory), |
157 | in the format seen in the following example: |
158 | |
159 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff |
160 | Mems_allowed: ffffffff,ffffffff |
161 | |
162 | Each cpuset is represented by a directory in the cpuset file system |
163 | containing the following files describing that cpuset: |
164 | |
165 | - cpus: list of CPUs in that cpuset |
166 | - mems: list of Memory Nodes in that cpuset |
167 | - cpu_exclusive flag: is cpu placement exclusive? |
168 | - mem_exclusive flag: is memory placement exclusive? |
169 | - tasks: list of tasks (by pid) attached to that cpuset |
170 | |
171 | New cpusets are created using the mkdir system call or shell |
172 | command. The properties of a cpuset, such as its flags, allowed |
173 | CPUs and Memory Nodes, and attached tasks, are modified by writing |
174 | to the appropriate file in that cpusets directory, as listed above. |
175 | |
176 | The named hierarchical structure of nested cpusets allows partitioning |
177 | a large system into nested, dynamically changeable, "soft-partitions". |
178 | |
179 | The attachment of each task, automatically inherited at fork by any |
180 | children of that task, to a cpuset allows organizing the work load |
181 | on a system into related sets of tasks such that each set is constrained |
182 | to using the CPUs and Memory Nodes of a particular cpuset. A task |
183 | may be re-attached to any other cpuset, if allowed by the permissions |
184 | on the necessary cpuset file system directories. |
185 | |
186 | Such management of a system "in the large" integrates smoothly with |
187 | the detailed placement done on individual tasks and memory regions |
188 | using the sched_setaffinity, mbind and set_mempolicy system calls. |
189 | |
190 | The following rules apply to each cpuset: |
191 | |
192 | - Its CPUs and Memory Nodes must be a subset of its parents. |
193 | - It can only be marked exclusive if its parent is. |
194 | - If its cpu or memory is exclusive, they may not overlap any sibling. |
195 | |
196 | These rules, and the natural hierarchy of cpusets, enable efficient |
197 | enforcement of the exclusive guarantee, without having to scan all |
198 | cpusets every time any of them change to ensure nothing overlaps a |
199 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) |
200 | to represent the cpuset hierarchy provides for a familiar permission |
201 | and name space for cpusets, with a minimum of additional kernel code. |
202 | |
203 | 1.4 How do I use cpusets ? |
204 | -------------------------- |
205 | |
206 | In order to minimize the impact of cpusets on critical kernel |
207 | code, such as the scheduler, and due to the fact that the kernel |
208 | does not support one task updating the memory placement of another |
209 | task directly, the impact on a task of changing its cpuset CPU |
210 | or Memory Node placement, or of changing to which cpuset a task |
211 | is attached, is subtle. |
212 | |
213 | If a cpuset has its Memory Nodes modified, then for each task attached |
214 | to that cpuset, the next time that the kernel attempts to allocate |
215 | a page of memory for that task, the kernel will notice the change |
216 | in the tasks cpuset, and update its per-task memory placement to |
217 | remain within the new cpusets memory placement. If the task was using |
218 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with |
219 | its new cpuset, then the task will continue to use whatever subset |
220 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task |
221 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed |
222 | in the new cpuset, then the task will be essentially treated as if it |
223 | was MPOL_BIND bound to the new cpuset (even though its numa placement, |
224 | as queried by get_mempolicy(), doesn't change). If a task is moved |
225 | from one cpuset to another, then the kernel will adjust the tasks |
226 | memory placement, as above, the next time that the kernel attempts |
227 | to allocate a page of memory for that task. |
228 | |
229 | If a cpuset has its CPUs modified, then each task using that |
230 | cpuset does _not_ change its behavior automatically. In order to |
231 | minimize the impact on the critical scheduling code in the kernel, |
232 | tasks will continue to use their prior CPU placement until they |
233 | are rebound to their cpuset, by rewriting their pid to the 'tasks' |
234 | file of their cpuset. If a task had been bound to some subset of its |
235 | cpuset using the sched_setaffinity() call, and if any of that subset |
236 | is still allowed in its new cpuset settings, then the task will be |
237 | restricted to the intersection of the CPUs it was allowed on before, |
238 | and its new cpuset CPU placement. If, on the other hand, there is |
239 | no overlap between a tasks prior placement and its new cpuset CPU |
240 | placement, then the task will be allowed to run on any CPU allowed |
241 | in its new cpuset. If a task is moved from one cpuset to another, |
242 | its CPU placement is updated in the same way as if the tasks pid is |
243 | rewritten to the 'tasks' file of its current cpuset. |
244 | |
245 | In summary, the memory placement of a task whose cpuset is changed is |
246 | updated by the kernel, on the next allocation of a page for that task, |
247 | but the processor placement is not updated, until that tasks pid is |
248 | rewritten to the 'tasks' file of its cpuset. This is done to avoid |
249 | impacting the scheduler code in the kernel with a check for changes |
250 | in a tasks processor placement. |
251 | |
252 | There is an exception to the above. If hotplug funtionality is used |
253 | to remove all the CPUs that are currently assigned to a cpuset, |
254 | then the kernel will automatically update the cpus_allowed of all |
255 | tasks attached to CPUs in that cpuset to allow all CPUs. When memory |
256 | hotplug functionality for removing Memory Nodes is available, a |
257 | similar exception is expected to apply there as well. In general, |
258 | the kernel prefers to violate cpuset placement, over starving a task |
259 | that has had all its allowed CPUs or Memory Nodes taken offline. User |
260 | code should reconfigure cpusets to only refer to online CPUs and Memory |
261 | Nodes when using hotplug to add or remove such resources. |
262 | |
263 | There is a second exception to the above. GFP_ATOMIC requests are |
264 | kernel internal allocations that must be satisfied, immediately. |
265 | The kernel may drop some request, in rare cases even panic, if a |
266 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within |
267 | the current tasks cpuset, then we relax the cpuset, and look for |
268 | memory anywhere we can find it. It's better to violate the cpuset |
269 | than stress the kernel. |
270 | |
271 | To start a new job that is to be contained within a cpuset, the steps are: |
272 | |
273 | 1) mkdir /dev/cpuset |
274 | 2) mount -t cpuset none /dev/cpuset |
275 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in |
276 | the /dev/cpuset virtual file system. |
277 | 4) Start a task that will be the "founding father" of the new job. |
278 | 5) Attach that task to the new cpuset by writing its pid to the |
279 | /dev/cpuset tasks file for that cpuset. |
280 | 6) fork, exec or clone the job tasks from this founding father task. |
281 | |
282 | For example, the following sequence of commands will setup a cpuset |
283 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, |
284 | and then start a subshell 'sh' in that cpuset: |
285 | |
286 | mount -t cpuset none /dev/cpuset |
287 | cd /dev/cpuset |
288 | mkdir Charlie |
289 | cd Charlie |
290 | /bin/echo 2-3 > cpus |
291 | /bin/echo 1 > mems |
292 | /bin/echo $$ > tasks |
293 | sh |
294 | # The subshell 'sh' is now running in cpuset Charlie |
295 | # The next line should display '/Charlie' |
296 | cat /proc/self/cpuset |
297 | |
298 | In the case that a change of cpuset includes wanting to move already |
299 | allocated memory pages, consider further the work of IWAMOTO |
300 | Toshihiro <iwamoto@valinux.co.jp> for page remapping and memory |
301 | hotremoval, which can be found at: |
302 | |
303 | http://people.valinux.co.jp/~iwamoto/mh.html |
304 | |
305 | The integration of cpusets with such memory migration is not yet |
306 | available. |
307 | |
308 | In the future, a C library interface to cpusets will likely be |
309 | available. For now, the only way to query or modify cpusets is |
310 | via the cpuset file system, using the various cd, mkdir, echo, cat, |
311 | rmdir commands from the shell, or their equivalent from C. |
312 | |
313 | The sched_setaffinity calls can also be done at the shell prompt using |
314 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy |
315 | calls can be done at the shell prompt using the numactl command |
316 | (part of Andi Kleen's numa package). |
317 | |
318 | 2. Usage Examples and Syntax |
319 | ============================ |
320 | |
321 | 2.1 Basic Usage |
322 | --------------- |
323 | |
324 | Creating, modifying, using the cpusets can be done through the cpuset |
325 | virtual filesystem. |
326 | |
327 | To mount it, type: |
328 | # mount -t cpuset none /dev/cpuset |
329 | |
330 | Then under /dev/cpuset you can find a tree that corresponds to the |
331 | tree of the cpusets in the system. For instance, /dev/cpuset |
332 | is the cpuset that holds the whole system. |
333 | |
334 | If you want to create a new cpuset under /dev/cpuset: |
335 | # cd /dev/cpuset |
336 | # mkdir my_cpuset |
337 | |
338 | Now you want to do something with this cpuset. |
339 | # cd my_cpuset |
340 | |
341 | In this directory you can find several files: |
342 | # ls |
343 | cpus cpu_exclusive mems mem_exclusive tasks |
344 | |
345 | Reading them will give you information about the state of this cpuset: |
346 | the CPUs and Memory Nodes it can use, the processes that are using |
347 | it, its properties. By writing to these files you can manipulate |
348 | the cpuset. |
349 | |
350 | Set some flags: |
351 | # /bin/echo 1 > cpu_exclusive |
352 | |
353 | Add some cpus: |
354 | # /bin/echo 0-7 > cpus |
355 | |
356 | Now attach your shell to this cpuset: |
357 | # /bin/echo $$ > tasks |
358 | |
359 | You can also create cpusets inside your cpuset by using mkdir in this |
360 | directory. |
361 | # mkdir my_sub_cs |
362 | |
363 | To remove a cpuset, just use rmdir: |
364 | # rmdir my_sub_cs |
365 | This will fail if the cpuset is in use (has cpusets inside, or has |
366 | processes attached). |
367 | |
368 | 2.2 Adding/removing cpus |
369 | ------------------------ |
370 | |
371 | This is the syntax to use when writing in the cpus or mems files |
372 | in cpuset directories: |
373 | |
374 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 |
375 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 |
376 | |
377 | 2.3 Setting flags |
378 | ----------------- |
379 | |
380 | The syntax is very simple: |
381 | |
382 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' |
383 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' |
384 | |
385 | 2.4 Attaching processes |
386 | ----------------------- |
387 | |
388 | # /bin/echo PID > tasks |
389 | |
390 | Note that it is PID, not PIDs. You can only attach ONE task at a time. |
391 | If you have several tasks to attach, you have to do it one after another: |
392 | |
393 | # /bin/echo PID1 > tasks |
394 | # /bin/echo PID2 > tasks |
395 | ... |
396 | # /bin/echo PIDn > tasks |
397 | |
398 | |
399 | 3. Questions |
400 | ============ |
401 | |
402 | Q: what's up with this '/bin/echo' ? |
403 | A: bash's builtin 'echo' command does not check calls to write() against |
404 | errors. If you use it in the cpuset file system, you won't be |
405 | able to tell whether a command succeeded or failed. |
406 | |
407 | Q: When I attach processes, only the first of the line gets really attached ! |
408 | A: We can only return one error code per call to write(). So you should also |
409 | put only ONE pid. |
410 | |
411 | 4. Contact |
412 | ========== |
413 | |
414 | Web: http://www.bullopensource.org/cpuset |