Annotation of /trunk/mkinitrd-magellan/busybox/docs/keep_data_small.txt
Parent Directory | Revision Log
Revision 1123 -
(hide annotations)
(download)
Wed Aug 18 21:56:57 2010 UTC (13 years, 9 months ago) by niro
File MIME type: text/plain
File size: 8233 byte(s)
Wed Aug 18 21:56:57 2010 UTC (13 years, 9 months ago) by niro
File MIME type: text/plain
File size: 8233 byte(s)
-updated to busybox-1.17.1
1 | niro | 816 | Keeping data small |
2 | |||
3 | When many applets are compiled into busybox, all rw data and | ||
4 | bss for each applet are concatenated. Including those from libc, | ||
5 | if static busybox is built. When busybox is started, _all_ this data | ||
6 | is allocated, not just that one part for selected applet. | ||
7 | |||
8 | What "allocated" exactly means, depends on arch. | ||
9 | On NOMMU it's probably bites the most, actually using real | ||
10 | RAM for rwdata and bss. On i386, bss is lazily allocated | ||
11 | by COWed zero pages. Not sure about rwdata - also COW? | ||
12 | |||
13 | In order to keep busybox NOMMU and small-mem systems friendly | ||
14 | we should avoid large global data in our applets, and should | ||
15 | minimize usage of libc functions which implicitly use | ||
16 | such structures. | ||
17 | |||
18 | Small experiment to measure "parasitic" bbox memory consumption: | ||
19 | here we start 1000 "busybox sleep 10" in parallel. | ||
20 | busybox binary is practically allyesconfig static one, | ||
21 | built against uclibc. Run on x86-64 machine with 64-bit kernel: | ||
22 | |||
23 | bash-3.2# nmeter '%t %c %m %p %[pn]' | ||
24 | 23:17:28 .......... 168M 0 147 | ||
25 | 23:17:29 .......... 168M 0 147 | ||
26 | 23:17:30 U......... 168M 1 147 | ||
27 | 23:17:31 SU........ 181M 244 391 | ||
28 | 23:17:32 SSSSUUU... 223M 757 1147 | ||
29 | 23:17:33 UUU....... 223M 0 1147 | ||
30 | 23:17:34 U......... 223M 1 1147 | ||
31 | 23:17:35 .......... 223M 0 1147 | ||
32 | 23:17:36 .......... 223M 0 1147 | ||
33 | 23:17:37 S......... 223M 0 1147 | ||
34 | 23:17:38 .......... 223M 1 1147 | ||
35 | 23:17:39 .......... 223M 0 1147 | ||
36 | 23:17:40 .......... 223M 0 1147 | ||
37 | 23:17:41 .......... 210M 0 906 | ||
38 | 23:17:42 .......... 168M 1 147 | ||
39 | 23:17:43 .......... 168M 0 147 | ||
40 | |||
41 | This requires 55M of memory. Thus 1 trivial busybox applet | ||
42 | takes 55k of memory on 64-bit x86 kernel. | ||
43 | |||
44 | On 32-bit kernel we need ~26k per applet. | ||
45 | |||
46 | Script: | ||
47 | |||
48 | i=1000; while test $i != 0; do | ||
49 | echo -n . | ||
50 | busybox sleep 30 & | ||
51 | i=$((i - 1)) | ||
52 | done | ||
53 | echo | ||
54 | wait | ||
55 | |||
56 | (Data from NOMMU arches are sought. Provide 'size busybox' output too) | ||
57 | |||
58 | |||
59 | Example 1 | ||
60 | |||
61 | One example how to reduce global data usage is in | ||
62 | archival/libunarchive/decompress_unzip.c: | ||
63 | |||
64 | /* This is somewhat complex-looking arrangement, but it allows | ||
65 | * to place decompressor state either in bss or in | ||
66 | * malloc'ed space simply by changing #defines below. | ||
67 | * Sizes on i386: | ||
68 | * text data bss dec hex | ||
69 | * 5256 0 108 5364 14f4 - bss | ||
70 | * 4915 0 0 4915 1333 - malloc | ||
71 | */ | ||
72 | #define STATE_IN_BSS 0 | ||
73 | #define STATE_IN_MALLOC 1 | ||
74 | |||
75 | (see the rest of the file to get the idea) | ||
76 | |||
77 | This example completely eliminates globals in that module. | ||
78 | Required memory is allocated in unpack_gz_stream() [its main module] | ||
79 | and then passed down to all subroutines which need to access 'globals' | ||
80 | as a parameter. | ||
81 | |||
82 | |||
83 | Example 2 | ||
84 | |||
85 | In case you don't want to pass this additional parameter everywhere, | ||
86 | take a look at archival/gzip.c. Here all global data is replaced by | ||
87 | single global pointer (ptr_to_globals) to allocated storage. | ||
88 | |||
89 | In order to not duplicate ptr_to_globals in every applet, you can | ||
90 | reuse single common one. It is defined in libbb/messages.c | ||
91 | as struct globals *const ptr_to_globals, but the struct globals is | ||
92 | NOT defined in libbb.h. You first define your own struct: | ||
93 | |||
94 | struct globals { int a; char buf[1000]; }; | ||
95 | |||
96 | and then declare that ptr_to_globals is a pointer to it: | ||
97 | |||
98 | #define G (*ptr_to_globals) | ||
99 | |||
100 | ptr_to_globals is declared as constant pointer. | ||
101 | This helps gcc understand that it won't change, resulting in noticeably | ||
102 | smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro: | ||
103 | |||
104 | SET_PTR_TO_GLOBALS(xzalloc(sizeof(G))); | ||
105 | |||
106 | Typically it is done in <applet>_main(). | ||
107 | |||
108 | Now you can reference "globals" by G.a, G.buf and so on, in any function. | ||
109 | |||
110 | |||
111 | bb_common_bufsiz1 | ||
112 | |||
113 | There is one big common buffer in bss - bb_common_bufsiz1. It is a much | ||
114 | earlier mechanism to reduce bss usage. Each applet can use it for | ||
115 | its needs. Library functions are prohibited from using it. | ||
116 | |||
117 | 'G.' trick can be done using bb_common_bufsiz1 instead of malloced buffer: | ||
118 | |||
119 | #define G (*(struct globals*)&bb_common_bufsiz1) | ||
120 | |||
121 | Be careful, though, and use it only if globals fit into bb_common_bufsiz1. | ||
122 | Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change | ||
123 | from one libc to another, you have to add compile-time check for it: | ||
124 | |||
125 | if (sizeof(struct globals) > sizeof(bb_common_bufsiz1)) | ||
126 | BUG_<applet>_globals_too_big(); | ||
127 | |||
128 | |||
129 | Drawbacks | ||
130 | |||
131 | You have to initialize it by hand. xzalloc() can be helpful in clearing | ||
132 | allocated storage to 0, but anything more must be done by hand. | ||
133 | |||
134 | All global variables are prefixed by 'G.' now. If this makes code | ||
135 | less readable, use #defines: | ||
136 | |||
137 | #define dev_fd (G.dev_fd) | ||
138 | #define sector (G.sector) | ||
139 | |||
140 | |||
141 | Word of caution | ||
142 | |||
143 | If applet doesn't use much of global data, converting it to use | ||
144 | one of above methods is not worth the resulting code obfuscation. | ||
145 | If you have less than ~300 bytes of global data - don't bother. | ||
146 | |||
147 | |||
148 | niro | 1123 | Finding non-shared duplicated strings |
149 | |||
150 | strings busybox | sort | uniq -c | sort -nr | ||
151 | |||
152 | |||
153 | niro | 816 | gcc's data alignment problem |
154 | |||
155 | The following attribute added in vi.c: | ||
156 | |||
157 | static int tabstop; | ||
158 | static struct termios term_orig __attribute__ ((aligned (4))); | ||
159 | static struct termios term_vi __attribute__ ((aligned (4))); | ||
160 | |||
161 | reduces bss size by 32 bytes, because gcc sometimes aligns structures to | ||
162 | ridiculously large values. asm output diff for above example: | ||
163 | |||
164 | tabstop: | ||
165 | .zero 4 | ||
166 | .section .bss.term_orig,"aw",@nobits | ||
167 | - .align 32 | ||
168 | + .align 4 | ||
169 | .type term_orig, @object | ||
170 | .size term_orig, 60 | ||
171 | term_orig: | ||
172 | .zero 60 | ||
173 | .section .bss.term_vi,"aw",@nobits | ||
174 | - .align 32 | ||
175 | + .align 4 | ||
176 | .type term_vi, @object | ||
177 | .size term_vi, 60 | ||
178 | |||
179 | gcc doesn't seem to have options for altering this behaviour. | ||
180 | |||
181 | gcc 3.4.3 and 4.1.1 tested: | ||
182 | char c = 1; | ||
183 | // gcc aligns to 32 bytes if sizeof(struct) >= 32 | ||
184 | struct { | ||
185 | int a,b,c,d; | ||
186 | int i1,i2,i3; | ||
187 | } s28 = { 1 }; // struct will be aligned to 4 bytes | ||
188 | struct { | ||
189 | int a,b,c,d; | ||
190 | int i1,i2,i3,i4; | ||
191 | } s32 = { 1 }; // struct will be aligned to 32 bytes | ||
192 | // same for arrays | ||
193 | char vc31[31] = { 1 }; // unaligned | ||
194 | char vc32[32] = { 1 }; // aligned to 32 bytes | ||
195 | |||
196 | -fpack-struct=1 reduces alignment of s28 to 1 (but probably | ||
197 | will break layout of many libc structs) but s32 and vc32 | ||
198 | are still aligned to 32 bytes. | ||
199 | |||
200 | I will try to cook up a patch to add a gcc option for disabling it. | ||
201 | Meanwhile, this is where it can be disabled in gcc source: | ||
202 | |||
203 | gcc/config/i386/i386.c | ||
204 | int | ||
205 | ix86_data_alignment (tree type, int align) | ||
206 | { | ||
207 | #if 0 | ||
208 | if (AGGREGATE_TYPE_P (type) | ||
209 | && TYPE_SIZE (type) | ||
210 | && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST | ||
211 | && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256 | ||
212 | || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256) | ||
213 | return 256; | ||
214 | #endif | ||
215 | |||
216 | Result (non-static busybox built against glibc): | ||
217 | |||
218 | # size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox | ||
219 | text data bss dec hex filename | ||
220 | 634416 2736 23856 661008 a1610 busybox | ||
221 | 632580 2672 22944 658196 a0b14 busybox_noalign | ||
222 | niro | 984 | |
223 | |||
224 | |||
225 | Keeping code small | ||
226 | |||
227 | Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once", | ||
228 | produce "make bloatcheck", see the biggest auto-inlined functions. | ||
229 | Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE | ||
230 | to some of these functions. In 1.16.x timeframe, the results were | ||
231 | (annotated "make bloatcheck" output): | ||
232 | |||
233 | function old new delta | ||
234 | expand_vars_to_list - 1712 +1712 win | ||
235 | lzo1x_optimize - 1429 +1429 win | ||
236 | arith_apply - 1326 +1326 win | ||
237 | read_interfaces - 1163 +1163 loss, leave w/o NOINLINE | ||
238 | logdir_open - 1148 +1148 win | ||
239 | check_deps - 1148 +1148 loss | ||
240 | rewrite - 1039 +1039 win | ||
241 | run_pipe 358 1396 +1038 win | ||
242 | write_status_file - 1029 +1029 almost the same, leave w/o NOINLINE | ||
243 | dump_identity - 987 +987 win | ||
244 | mainQSort3 - 921 +921 win | ||
245 | parse_one_line - 916 +916 loss | ||
246 | summarize - 897 +897 almost the same | ||
247 | do_shm - 884 +884 win | ||
248 | cpio_o - 863 +863 win | ||
249 | subCommand - 841 +841 loss | ||
250 | receive - 834 +834 loss | ||
251 | |||
252 | 855 bytes saved in total. | ||
253 | |||
254 | scripts/mkdiff_obj_bloat may be useful to automate this process: run | ||
255 | "scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE" | ||
256 | and select modules which shrank. |