diff --git a/Documentation/power/tuxonice-internals.txt b/Documentation/power/tuxonice-internals.txt new file mode 100644 index 0000000..7a96186 --- /dev/null +++ b/Documentation/power/tuxonice-internals.txt @@ -0,0 +1,477 @@ + TuxOnIce 3.0 Internal Documentation. + Updated to 26 March 2009 + +1. Introduction. + + TuxOnIce 3.0 is an addition to the Linux Kernel, designed to + allow the user to quickly shutdown and quickly boot a computer, without + needing to close documents or programs. It is equivalent to the + hibernate facility in some laptops. This implementation, however, + requires no special BIOS or hardware support. + + The code in these files is based upon the original implementation + prepared by Gabor Kuti and additional work by Pavel Machek and a + host of others. This code has been substantially reworked by Nigel + Cunningham, again with the help and testing of many others, not the + least of whom is Michael Frank. At its heart, however, the operation is + essentially the same as Gabor's version. + +2. Overview of operation. + + The basic sequence of operations is as follows: + + a. Quiesce all other activity. + b. Ensure enough memory and storage space are available, and attempt + to free memory/storage if necessary. + c. Allocate the required memory and storage space. + d. Write the image. + e. Power down. + + There are a number of complicating factors which mean that things are + not as simple as the above would imply, however... + + o The activity of each process must be stopped at a point where it will + not be holding locks necessary for saving the image, or unexpectedly + restart operations due to something like a timeout and thereby make + our image inconsistent. + + o It is desirous that we sync outstanding I/O to disk before calculating + image statistics. This reduces corruption if one should suspend but + then not resume, and also makes later parts of the operation safer (see + below). + + o We need to get as close as we can to an atomic copy of the data. + Inconsistencies in the image will result in inconsistent memory contents at + resume time, and thus in instability of the system and/or file system + corruption. This would appear to imply a maximum image size of one half of + the amount of RAM, but we have a solution... (again, below). + + o In 2.6, we choose to play nicely with the other suspend-to-disk + implementations. + +3. Detailed description of internals. + + a. Quiescing activity. + + Safely quiescing the system is achieved using three separate but related + aspects. + + First, we note that the vast majority of processes don't need to run during + suspend. They can be 'frozen'. We therefore implement a refrigerator + routine, which processes enter and in which they remain until the cycle is + complete. Processes enter the refrigerator via try_to_freeze() invocations + at appropriate places. A process cannot be frozen in any old place. It + must not be holding locks that will be needed for writing the image or + freezing other processes. For this reason, userspace processes generally + enter the refrigerator via the signal handling code, and kernel threads at + the place in their event loops where they drop locks and yield to other + processes or sleep. + + The task of freezing processes is complicated by the fact that there can be + interdependencies between processes. Freezing process A before process B may + mean that process B cannot be frozen, because it stops at waiting for + process A rather than in the refrigerator. This issue is seen where + userspace waits on freezeable kernel threads or fuse filesystem threads. To + address this issue, we implement the following algorithm for quiescing + activity: + + - Freeze filesystems (including fuse - userspace programs starting + new requests are immediately frozen; programs already running + requests complete their work before being frozen in the next + step) + - Freeze userspace + - Thaw filesystems (this is safe now that userspace is frozen and no + fuse requests are outstanding). + - Invoke sys_sync (noop on fuse). + - Freeze filesystems + - Freeze kernel threads + + If we need to free memory, we thaw kernel threads and filesystems, but not + userspace. We can then free caches without worrying about deadlocks due to + swap files being on frozen filesystems or such like. + + b. Ensure enough memory & storage are available. + + We have a number of constraints to meet in order to be able to successfully + suspend and resume. + + First, the image will be written in two parts, described below. One of these + parts needs to have an atomic copy made, which of course implies a maximum + size of one half of the amount of system memory. The other part ('pageset') + is not atomically copied, and can therefore be as large or small as desired. + + Second, we have constraints on the amount of storage available. In these + calculations, we may also consider any compression that will be done. The + cryptoapi module allows the user to configure an expected compression ratio. + + Third, the user can specify an arbitrary limit on the image size, in + megabytes. This limit is treated as a soft limit, so that we don't fail the + attempt to suspend if we cannot meet this constraint. + + c. Allocate the required memory and storage space. + + Having done the initial freeze, we determine whether the above constraints + are met, and seek to allocate the metadata for the image. If the constraints + are not met, or we fail to allocate the required space for the metadata, we + seek to free the amount of memory that we calculate is needed and try again. + We allow up to four iterations of this loop before aborting the cycle. If we + do fail, it should only be because of a bug in TuxOnIce's calculations. + + These steps are merged together in the prepare_image function, found in + prepare_image.c. The functions are merged because of the cyclical nature + of the problem of calculating how much memory and storage is needed. Since + the data structures containing the information about the image must + themselves take memory and use storage, the amount of memory and storage + required changes as we prepare the image. Since the changes are not large, + only one or two iterations will be required to achieve a solution. + + The recursive nature of the algorithm is miminised by keeping user space + frozen while preparing the image, and by the fact that our records of which + pages are to be saved and which pageset they are saved in use bitmaps (so + that changes in number or fragmentation of the pages to be saved don't + feedback via changes in the amount of memory needed for metadata). The + recursiveness is thus limited to any extra slab pages allocated to store the + extents that record storage used, and the effects of seeking to free memory. + + d. Write the image. + + We previously mentioned the need to create an atomic copy of the data, and + the half-of-memory limitation that is implied in this. This limitation is + circumvented by dividing the memory to be saved into two parts, called + pagesets. + + Pageset2 contains most of the page cache - the pages on the active and + inactive LRU lists that aren't needed or modified while TuxOnIce is + running, so they can be safely written without an atomic copy. They are + therefore saved first and reloaded last. While saving these pages, + TuxOnIce carefully ensures that the work of writing the pages doesn't make + the image inconsistent. With the support for Kernel (Video) Mode Setting + going into the kernel at the time of writing, we need to check for pages + on the LRU that are used by KMS, and exclude them from pageset2. They are + atomically copied as part of pageset 1. + + Once pageset2 has been saved, we prepare to do the atomic copy of remaining + memory. As part of the preparation, we power down drivers, thereby providing + them with the opportunity to have their state recorded in the image. The + amount of memory allocated by drivers for this is usually negligible, but if + DRI is in use, video drivers may require significants amounts. Ideally we + would be able to query drivers while preparing the image as to the amount of + memory they will need. Unfortunately no such mechanism exists at the time of + writing. For this reason, TuxOnIce allows the user to set an + 'extra_pages_allowance', which is used to seek to ensure sufficient memory + is available for drivers at this point. TuxOnIce also lets the user set this + value to 0. In this case, a test driver suspend is done while preparing the + image, and the difference (plus a margin) used instead. TuxOnIce will also + automatically restart the hibernation process (twice at most) if it finds + that the extra pages allowance is not sufficient. It will then use what was + actually needed (plus a margin, again). Failure to hibernate should thus + be an extremely rare occurence. + + Having suspended the drivers, we save the CPU context before making an + atomic copy of pageset1, resuming the drivers and saving the atomic copy. + After saving the two pagesets, we just need to save our metadata before + powering down. + + As we mentioned earlier, the contents of pageset2 pages aren't needed once + they've been saved. We therefore use them as the destination of our atomic + copy. In the unlikely event that pageset1 is larger, extra pages are + allocated while the image is being prepared. This is normally only a real + possibility when the system has just been booted and the page cache is + small. + + This is where we need to be careful about syncing, however. Pageset2 will + probably contain filesystem meta data. If this is overwritten with pageset1 + and then a sync occurs, the filesystem will be corrupted - at least until + resume time and another sync of the restored data. Since there is a + possibility that the user might not resume or (may it never be!) that + TuxOnIce might oops, we do our utmost to avoid syncing filesystems after + copying pageset1. + + e. Power down. + + Powering down uses standard kernel routines. TuxOnIce supports powering down + using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off. + Supporting suspend to ram (S3) as a power off option might sound strange, + but it allows the user to quickly get their system up and running again if + the battery doesn't run out (we just need to re-read the overwritten pages) + and if the battery does run out (or the user removes power), they can still + resume. + +4. Data Structures. + + TuxOnIce uses three main structures to store its metadata and configuration + information: + + a) Pageflags bitmaps. + + TuxOnIce records which pages will be in pageset1, pageset2, the destination + of the atomic copy and the source of the atomically restored image using + bitmaps. The code used is that written for swsusp, with small improvements + to match TuxOnIce's requirements. + + The pageset1 bitmap is thus easily stored in the image header for use at + resume time. + + As mentioned above, using bitmaps also means that the amount of memory and + storage required for recording the above information is constant. This + greatly simplifies the work of preparing the image. In earlier versions of + TuxOnIce, extents were used to record which pages would be stored. In that + case, however, eating memory could result in greater fragmentation of the + lists of pages, which in turn required more memory to store the extents and + more storage in the image header. These could in turn require further + freeing of memory, and another iteration. All of this complexity is removed + by having bitmaps. + + Bitmaps also make a lot of sense because TuxOnIce only ever iterates + through the lists. There is therefore no cost to not being able to find the + nth page in order 0 time. We only need to worry about the cost of finding + the n+1th page, given the location of the nth page. Bitwise optimisations + help here. + + b) Extents for block data. + + TuxOnIce supports writing the image to multiple block devices. In the case + of swap, multiple partitions and/or files may be in use, and we happily use + them all (with the exception of compcache pages, which we allocate but do + not use). This use of multiple block devices is accomplished as follows: + + Whatever the actual source of the allocated storage, the destination of the + image can be viewed in terms of one or more block devices, and on each + device, a list of sectors. To simplify matters, we only use contiguous, + PAGE_SIZE aligned sectors, like the swap code does. + + Since sector numbers on each bdev may well not start at 0, it makes much + more sense to use extents here. Contiguous ranges of pages can thus be + represented in the extents by contiguous values. + + Variations in block size are taken account of in transforming this data + into the parameters for bio submission. + + We can thus implement a layer of abstraction wherein the core of TuxOnIce + doesn't have to worry about which device we're currently writing to or + where in the device we are. It simply requests that the next page in the + pageset or header be written, leaving the details to this lower layer. + The lower layer remembers where in the sequence of devices and blocks each + pageset starts. The header always starts at the beginning of the allocated + storage. + + So extents are: + + struct extent { + unsigned long minimum, maximum; + struct extent *next; + } + + These are combined into chains of extents for a device: + + struct extent_chain { + int size; /* size of the extent ie sum (max-min+1) */ + int allocs, frees; + char *name; + struct extent *first, *last_touched; + }; + + For each bdev, we need to store a little more info: + + struct suspend_bdev_info { + struct block_device *bdev; + dev_t dev_t; + int bmap_shift; + int blocks_per_page; + }; + + The dev_t is used to identify the device in the stored image. As a result, + we expect devices at resume time to have the same major and minor numbers + as they had while suspending. This is primarily a concern where the user + utilises LVM for storage, as they will need to dmsetup their partitions in + such a way as to maintain this consistency at resume time. + + bmap_shift and blocks_per_page apply the effects of variations in blocks + per page settings for the filesystem and underlying bdev. For most + filesystems, these are the same, but for xfs, they can have independant + values. + + Combining these two structures together, we have everything we need to + record what devices and what blocks on each device are being used to + store the image, and to submit i/o using bio_submit. + + The last elements in the picture are a means of recording how the storage + is being used. + + We do this first and foremost by implementing a layer of abstraction on + top of the devices and extent chains which allows us to view however many + devices there might be as one long storage tape, with a single 'head' that + tracks a 'current position' on the tape: + + struct extent_iterate_state { + struct extent_chain *chains; + int num_chains; + int current_chain; + struct extent *current_extent; + unsigned long current_offset; + }; + + That is, *chains points to an array of size num_chains of extent chains. + For the filewriter, this is always a single chain. For the swapwriter, the + array is of size MAX_SWAPFILES. + + current_chain, current_extent and current_offset thus point to the current + index in the chains array (and into a matching array of struct + suspend_bdev_info), the current extent in that chain (to optimise access), + and the current value in the offset. + + The image is divided into three parts: + - The header + - Pageset 1 + - Pageset 2 + + The header always starts at the first device and first block. We know its + size before we begin to save the image because we carefully account for + everything that will be stored in it. + + The second pageset (LRU) is stored first. It begins on the next page after + the end of the header. + + The first pageset is stored second. It's start location is only known once + pageset2 has been saved, since pageset2 may be compressed as it is written. + This location is thus recorded at the end of saving pageset2. It is page + aligned also. + + Since this information is needed at resume time, and the location of extents + in memory will differ at resume time, this needs to be stored in a portable + way: + + struct extent_iterate_saved_state { + int chain_num; + int extent_num; + unsigned long offset; + }; + + We can thus implement a layer of abstraction wherein the core of TuxOnIce + doesn't have to worry about which device we're currently writing to or + where in the device we are. It simply requests that the next page in the + pageset or header be written, leaving the details to this layer, and + invokes the routines to remember and restore the position, without having + to worry about the details of how the data is arranged on disk or such like. + + c) Modules + + One aim in designing TuxOnIce was to make it flexible. We wanted to allow + for the implementation of different methods of transforming a page to be + written to disk and different methods of getting the pages stored. + + In early versions (the betas and perhaps Suspend1), compression support was + inlined in the image writing code, and the data structures and code for + managing swap were intertwined with the rest of the code. A number of people + had expressed interest in implementing image encryption, and alternative + methods of storing the image. + + In order to achieve this, TuxOnIce was given a modular design. + + A module is a single file which encapsulates the functionality needed + to transform a pageset of data (encryption or compression, for example), + or to write the pageset to a device. The former type of module is called + a 'page-transformer', the later a 'writer'. + + Modules are linked together in pipeline fashion. There may be zero or more + page transformers in a pipeline, and there is always exactly one writer. + The pipeline follows this pattern: + + --------------------------------- + | TuxOnIce Core | + --------------------------------- + | + | + --------------------------------- + | Page transformer 1 | + --------------------------------- + | + | + --------------------------------- + | Page transformer 2 | + --------------------------------- + | + | + --------------------------------- + | Writer | + --------------------------------- + + During the writing of an image, the core code feeds pages one at a time + to the first module. This module performs whatever transformations it + implements on the incoming data, completely consuming the incoming data and + feeding output in a similar manner to the next module. + + All routines are SMP safe, and the final result of the transformations is + written with an index (provided by the core) and size of the output by the + writer. As a result, we can have multithreaded I/O without needing to + worry about the sequence in which pages are written (or read). + + During reading, the pipeline works in the reverse direction. The core code + calls the first module with the address of a buffer which should be filled. + (Note that the buffer size is always PAGE_SIZE at this time). This module + will in turn request data from the next module and so on down until the + writer is made to read from the stored image. + + Part of definition of the structure of a module thus looks like this: + + int (*rw_init) (int rw, int stream_number); + int (*rw_cleanup) (int rw); + int (*write_chunk) (struct page *buffer_page); + int (*read_chunk) (struct page *buffer_page, int sync); + + It should be noted that the _cleanup routine may be called before the + full stream of data has been read or written. While writing the image, + the user may (depending upon settings) choose to abort suspending, and + if we are in the midst of writing the last portion of the image, a portion + of the second pageset may be reread. This may also happen if an error + occurs and we seek to abort the process of writing the image. + + The modular design is also useful in a number of other ways. It provides + a means where by we can add support for: + + - providing overall initialisation and cleanup routines; + - serialising configuration information in the image header; + - providing debugging information to the user; + - determining memory and image storage requirements; + - dis/enabling components at run-time; + - configuring the module (see below); + + ...and routines for writers specific to their work: + - Parsing a resume= location; + - Determining whether an image exists; + - Marking a resume as having been attempted; + - Invalidating an image; + + Since some parts of the core - the user interface and storage manager + support - have use for some of these functions, they are registered as + 'miscellaneous' modules as well. + + d) Sysfs data structures. + + This brings us naturally to support for configuring TuxOnIce. We desired to + provide a way to make TuxOnIce as flexible and configurable as possible. + The user shouldn't have to reboot just because they want to now hibernate to + a file instead of a partition, for example. + + To accomplish this, TuxOnIce implements a very generic means whereby the + core and modules can register new sysfs entries. All TuxOnIce entries use + a single _store and _show routine, both of which are found in + tuxonice_sysfs.c in the kernel/power directory. These routines handle the + most common operations - getting and setting the values of bits, integers, + longs, unsigned longs and strings in one place, and allow overrides for + customised get and set options as well as side-effect routines for all + reads and writes. + + When combined with some simple macros, a new sysfs entry can then be defined + in just a couple of lines: + + SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1, + 2048, 0, NULL), + + This defines a sysfs entry named "progress_granularity" which is rw and + allows the user to access an integer stored at &progress_granularity, giving + it a value between 1 and 2048 inclusive. + + Sysfs entries are registered under /sys/power/tuxonice, and entries for + modules are located in a subdirectory named after the module. + diff --git a/Documentation/power/tuxonice.txt b/Documentation/power/tuxonice.txt new file mode 100644 index 0000000..49199ff --- /dev/null +++ b/Documentation/power/tuxonice.txt @@ -0,0 +1,750 @@ + --- TuxOnIce, version 3.0 --- + +1. What is it? +2. Why would you want it? +3. What do you need to use it? +4. Why not just use the version already in the kernel? +5. How do you use it? +6. What do all those entries in /sys/power/tuxonice do? +7. How do you get support? +8. I think I've found a bug. What should I do? +9. When will XXX be supported? +10 How does it work? +11. Who wrote TuxOnIce? + +1. What is it? + + Imagine you're sitting at your computer, working away. For some reason, you + need to turn off your computer for a while - perhaps it's time to go home + for the day. When you come back to your computer next, you're going to want + to carry on where you left off. Now imagine that you could push a button and + have your computer store the contents of its memory to disk and power down. + Then, when you next start up your computer, it loads that image back into + memory and you can carry on from where you were, just as if you'd never + turned the computer off. You have far less time to start up, no reopening of + applications or finding what directory you put that file in yesterday. + That's what TuxOnIce does. + + TuxOnIce has a long heritage. It began life as work by Gabor Kuti, who, + with some help from Pavel Machek, got an early version going in 1999. The + project was then taken over by Florent Chabaud while still in alpha version + numbers. Nigel Cunningham came on the scene when Florent was unable to + continue, moving the project into betas, then 1.0, 2.0 and so on up to + the present series. During the 2.0 series, the name was contracted to + Suspend2 and the website suspend2.net created. Beginning around July 2007, + a transition to calling the software TuxOnIce was made, to seek to help + make it clear that TuxOnIce is more concerned with hibernation than suspend + to ram. + + Pavel Machek's swsusp code, which was merged around 2.5.17 retains the + original name, and was essentially a fork of the beta code until Rafael + Wysocki came on the scene in 2005 and began to improve it further. + +2. Why would you want it? + + Why wouldn't you want it? + + Being able to save the state of your system and quickly restore it improves + your productivity - you get a useful system in far less time than through + the normal boot process. You also get to be completely 'green', using zero + power, or as close to that as possible (the computer may still provide + minimal power to some devices, so they can initiate a power on, but that + will be the same amount of power as would be used if you told the computer + to shutdown. + +3. What do you need to use it? + + a. Kernel Support. + + i) The TuxOnIce patch. + + TuxOnIce is part of the Linux Kernel. This version is not part of Linus's + 2.6 tree at the moment, so you will need to download the kernel source and + apply the latest patch. Having done that, enable the appropriate options in + make [menu|x]config (under Power Management Options - look for "Enhanced + Hibernation"), compile and install your kernel. TuxOnIce works with SMP, + Highmem, preemption, fuse filesystems, x86-32, PPC and x86_64. + + TuxOnIce patches are available from http://tuxonice.net. + + ii) Compression support. + + Compression support is implemented via the cryptoapi. You will therefore want + to select any Cryptoapi transforms that you want to use on your image from + the Cryptoapi menu while configuring your kernel. Part of the TuxOnIce patch + adds a new cryptoapi compression called LZF. We recommend the use of this + compression method - it is very fast and still achieves good compression. + + You can also tell TuxOnIce to write it's image to an encrypted and/or + compressed filesystem/swap partition. In that case, you don't need to do + anything special for TuxOnIce when it comes to kernel configuration. + + iii) Configuring other options. + + While you're configuring your kernel, try to configure as much as possible + to build as modules. We recommend this because there are a number of drivers + that are still in the process of implementing proper power management + support. In those cases, the best way to work around their current lack is + to build them as modules and remove the modules while hibernating. You might + also bug the driver authors to get their support up to speed, or even help! + + b. Storage. + + i) Swap. + + TuxOnIce can store the hibernation image in your swap partition, a swap file or + a combination thereof. Whichever combination you choose, you will probably + want to create enough swap space to store the largest image you could have, + plus the space you'd normally use for swap. A good rule of thumb would be + to calculate the amount of swap you'd want without using TuxOnIce, and then + add the amount of memory you have. This swapspace can be arranged in any way + you'd like. It can be in one partition or file, or spread over a number. The + only requirement is that they be active when you start a hibernation cycle. + + There is one exception to this requirement. TuxOnIce has the ability to turn + on one swap file or partition at the start of hibernating and turn it back off + at the end. If you want to ensure you have enough memory to store a image + when your memory is fully used, you might want to make one swap partition or + file for 'normal' use, and another for TuxOnIce to activate & deactivate + automatically. (Further details below). + + ii) Normal files. + + TuxOnIce includes a 'file allocator'. The file allocator can store your + image in a simple file. Since Linux has the concept of everything being a + file, this is more powerful than it initially sounds. If, for example, you + were to set up a network block device file, you could hibernate to a network + server. This has been tested and works to a point, but nbd itself isn't + stateless enough for our purposes. + + Take extra care when setting up the file allocator. If you just type + commands without thinking and then try to hibernate, you could cause + irreversible corruption on your filesystems! Make sure you have backups. + + Most people will only want to hibernate to a local file. To achieve that, do + something along the lines of: + + echo "TuxOnIce" > /hibernation-file + dd if=/dev/zero bs=1M count=512 >> hibernation-file + + This will create a 512MB file called /hibernation-file. To get TuxOnIce to use + it: + + echo /hibernation-file > /sys/power/tuxonice/file/target + + Then + + cat /sys/power/tuxonice/resume + + Put the results of this into your bootloader's configuration (see also step + C, below): + + ---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE--- + # cat /sys/power/tuxonice/resume + file:/dev/hda2:0x1e001 + + In this example, we would edit the append= line of our lilo.conf|menu.lst + so that it included: + + resume=file:/dev/hda2:0x1e001 + ---EXAMPLE-ONLY-DON'T-COPY-AND-PASTE--- + + For those who are thinking 'Could I make the file sparse?', the answer is + 'No!'. At the moment, there is no way for TuxOnIce to fill in the holes in + a sparse file while hibernating. In the longer term (post merge!), I'd like + to change things so that the file could be dynamically resized and have + holes filled as needed. Right now, however, that's not possible and not a + priority. + + c. Bootloader configuration. + + Using TuxOnIce also requires that you add an extra parameter to + your lilo.conf or equivalent. Here's an example for a swap partition: + + append="resume=swap:/dev/hda1" + + This would tell TuxOnIce that /dev/hda1 is a swap partition you + have. TuxOnIce will use the swap signature of this partition as a + pointer to your data when you hibernate. This means that (in this example) + /dev/hda1 doesn't need to be _the_ swap partition where all of your data + is actually stored. It just needs to be a swap partition that has a + valid signature. + + You don't need to have a swap partition for this purpose. TuxOnIce + can also use a swap file, but usage is a little more complex. Having made + your swap file, turn it on and do + + cat /sys/power/tuxonice/swap/headerlocations + + (this assumes you've already compiled your kernel with TuxOnIce + support and booted it). The results of the cat command will tell you + what you need to put in lilo.conf: + + For swap partitions like /dev/hda1, simply use resume=/dev/hda1. + For swapfile `swapfile`, use resume=swap:/dev/hda2:0x242d. + + If the swapfile changes for any reason (it is moved to a different + location, it is deleted and recreated, or the filesystem is + defragmented) then you will have to check + /sys/power/tuxonice/swap/headerlocations for a new resume_block value. + + Once you've compiled and installed the kernel and adjusted your bootloader + configuration, you should only need to reboot for the most basic part + of TuxOnIce to be ready. + + If you only compile in the swap allocator, or only compile in the file + allocator, you don't need to add the "swap:" part of the resume= + parameters above. resume=/dev/hda2:0x242d will work just as well. If you + have compiled both and your storage is on swap, you can also use this + format (the swap allocator is the default allocator). + + When compiling your kernel, one of the options in the 'Power Management + Support' menu, just above the 'Enhanced Hibernation (TuxOnIce)' entry is + called 'Default resume partition'. This can be used to set a default value + for the resume= parameter. + + d. The hibernate script. + + Since the driver model in 2.6 kernels is still being developed, you may need + to do more than just configure TuxOnIce. Users of TuxOnIce usually start the + process via a script which prepares for the hibernation cycle, tells the + kernel to do its stuff and then restore things afterwards. This script might + involve: + + - Switching to a text console and back if X doesn't like the video card + status on resume. + - Un/reloading drivers that don't play well with hibernation. + + Note that you might not be able to unload some drivers if there are + processes using them. You might have to kill off processes that hold + devices open. Hint: if your X server accesses an USB mouse, doing a + 'chvt' to a text console releases the device and you can unload the + module. + + Check out the latest script (available on tuxonice.net). + + e. The userspace user interface. + + TuxOnIce has very limited support for displaying status if you only apply + the kernel patch - it can printk messages, but that is all. In addition, + some of the functions mentioned in this document (such as cancelling a cycle + or performing interactive debugging) are unavailable. To utilise these + functions, or simply get a nice display, you need the 'userui' component. + Userui comes in three flavours, usplash, fbsplash and text. Text should + work on any console. Usplash and fbsplash require the appropriate + (distro specific?) support. + + To utilise a userui, TuxOnIce just needs to be told where to find the + userspace binary: + + echo "/usr/local/sbin/tuxoniceui_fbsplash" > /sys/power/tuxonice/user_interface/program + + The hibernate script can do this for you, and a default value for this + setting can be configured when compiling the kernel. This path is also + stored in the image header, so if you have an initrd or initramfs, you can + use the userui during the first part of resuming (prior to the atomic + restore) by putting the binary in the same path in your initrd/ramfs. + Alternatively, you can put it in a different location and do an echo + similar to the above prior to the echo > do_resume. The value saved in the + image header will then be ignored. + +4. Why not just use the version already in the kernel? + + The version in the vanilla kernel has a number of drawbacks. The most + serious of these are: + - it has a maximum image size of 1/2 total memory; + - it doesn't allocate storage until after it has snapshotted memory. + This means that you can't be sure hibernating will work until you + see it start to write the image; + - it does not allow you to press escape to cancel a cycle; + - it does not allow you to press escape to cancel resuming; + - it does not allow you to automatically swapon a file when + starting a cycle; + - it does not allow you to use multiple swap partitions or files; + - it does not allow you to use ordinary files; + - it just invalidates an image and continues to boot if you + accidentally boot the wrong kernel after hibernating; + - it doesn't support any sort of nice display while hibernating; + - it is moving toward requiring that you have an initrd/initramfs + to ever have a hope of resuming (uswsusp). While uswsusp will + address some of the concerns above, it won't address all of them, + and will be more complicated to get set up; + - it doesn't have support for suspend-to-both (write a hibernation + image, then suspend to ram; I think this is known as ReadySafe + under M$). + +5. How do you use it? + + A hibernation cycle can be started directly by doing: + + echo > /sys/power/tuxonice/do_hibernate + + In practice, though, you'll probably want to use the hibernate script + to unload modules, configure the kernel the way you like it and so on. + In that case, you'd do (as root): + + hibernate + + See the hibernate script's man page for more details on the options it + takes. + + If you're using the text or splash user interface modules, one feature of + TuxOnIce that you might find useful is that you can press Escape at any time + during hibernating, and the process will be aborted. + + Due to the way hibernation works, this means you'll have your system back and + perfectly usable almost instantly. The only exception is when it's at the + very end of writing the image. Then it will need to reload a small (usually + 4-50MBs, depending upon the image characteristics) portion first. + + Likewise, when resuming, you can press escape and resuming will be aborted. + The computer will then powerdown again according to settings at that time for + the powerdown method or rebooting. + + You can change the settings for powering down while the image is being + written by pressing 'R' to toggle rebooting and 'O' to toggle between + suspending to ram and powering down completely). + + If you run into problems with resuming, adding the "noresume" option to + the kernel command line will let you skip the resume step and recover your + system. This option shouldn't normally be needed, because TuxOnIce modifies + the image header prior to the atomic restore, and will thus prompt you + if it detects that you've tried to resume an image before (this flag is + removed if you press Escape to cancel a resume, so you won't be prompted + then). + + Recent kernels (2.6.24 onwards) add support for resuming from a different + kernel to the one that was hibernated (thanks to Rafael for his work on + this - I've just embraced and enhanced the support for TuxOnIce). This + should further reduce the need for you to use the noresume option. + +6. What do all those entries in /sys/power/tuxonice do? + + /sys/power/tuxonice is the directory which contains files you can use to + tune and configure TuxOnIce to your liking. The exact contents of + the directory will depend upon the version of TuxOnIce you're + running and the options you selected at compile time. In the following + descriptions, names in brackets refer to compile time options. + (Note that they're all dependant upon you having selected CONFIG_TUXONICE + in the first place!). + + Since the values of these settings can open potential security risks, the + writeable ones are accessible only to the root user. You may want to + configure sudo to allow you to invoke your hibernate script as an ordinary + user. + + - checksum/enabled + + Use cryptoapi hashing routines to verify that Pageset2 pages don't change + while we're saving the first part of the image, and to get any pages that + do change resaved in the atomic copy. This should normally not be needed, + but if you're seeing issues, please enable this. If your issues stop you + being able to resume, enable this option, hibernate and cancel the cycle + after the atomic copy is done. If the debugging info shows a non-zero + number of pages resaved, please report this to Nigel. + + - compression/algorithm + + Set the cryptoapi algorithm used for compressing the image. + + - compression/expected_compression + + These values allow you to set an expected compression ratio, which TuxOnice + will use in calculating whether it meets constraints on the image size. If + this expected compression ratio is not attained, the hibernation cycle will + abort, so it is wise to allow some spare. You can see what compression + ratio is achieved in the logs after hibernating. + + - debug_info: + + This file returns information about your configuration that may be helpful + in diagnosing problems with hibernating. + + - do_hibernate: + + When anything is written to this file, the kernel side of TuxOnIce will + begin to attempt to write an image to disk and power down. You'll normally + want to run the hibernate script instead, to get modules unloaded first. + + - do_resume: + + When anything is written to this file TuxOnIce will attempt to read and + restore an image. If there is no image, it will return almost immediately. + If an image exists, the echo > will never return. Instead, the original + kernel context will be restored and the original echo > do_hibernate will + return. + + - */enabled + + These option can be used to temporarily disable various parts of TuxOnIce. + + - extra_pages_allowance + + When TuxOnIce does its atomic copy, it calls the driver model suspend + and resume methods. If you have DRI enabled with a driver such as fglrx, + this can result in the driver allocating a substantial amount of memory + for storing its state. Extra_pages_allowance tells TuxOnIce how much + extra memory it should ensure is available for those allocations. If + your attempts at hibernating end with a message in dmesg indicating that + insufficient extra pages were allowed, you need to increase this value. + + - file/target: + + Read this value to get the current setting. Write to it to point TuxOnice + at a new storage location for the file allocator. See section 3.b.ii above + for details of how to set up the file allocator. + + - freezer_test + + This entry can be used to get TuxOnIce to just test the freezer and prepare + an image without actually doing a hibernation cycle. It is useful for + diagnosing freezing and image preparation issues. + + - image_exists: + + Can be used in a script to determine whether a valid image exists at the + location currently pointed to by resume=. Returns up to three lines. + The first is whether an image exists (-1 for unsure, otherwise 0 or 1). + If an image eixsts, additional lines will return the machine and version. + Echoing anything to this entry removes any current image. + + - image_size_limit: + + The maximum size of hibernation image written to disk, measured in megabytes + (1024*1024). + + - last_result: + + The result of the last hibernation cycle, as defined in + include/linux/suspend-debug.h with the values SUSPEND_ABORTED to + SUSPEND_KEPT_IMAGE. This is a bitmask. + + - log_everything (CONFIG_PM_DEBUG): + + Setting this option results in all messages printed being logged. Normally, + only a subset are logged, so as to not slow the process and not clutter the + logs. Useful for debugging. It can be toggled during a cycle by pressing + 'L'. + + - pause_between_steps (CONFIG_PM_DEBUG): + + This option is used during debugging, to make TuxOnIce pause between + each step of the process. It is ignored when the nice display is on. + + - powerdown_method: + + Used to select a method by which TuxOnIce should powerdown after writing the + image. Currently: + + 0: Don't use ACPI to power off. + 3: Attempt to enter Suspend-to-ram. + 4: Attempt to enter ACPI S4 mode. + 5: Attempt to power down via ACPI S5 mode. + + Note that these options are highly dependant upon your hardware & software: + + 3: When succesful, your machine suspends to ram instead of powering off. + The advantage of using this mode is that it doesn't matter whether your + battery has enough charge to make it through to your next resume. If it + lasts, you will simply resume from suspend to ram (and the image on disk + will be discarded). If the battery runs out, you will resume from disk + instead. The disadvantage is that it takes longer than a normal + suspend-to-ram to enter the state, since the suspend-to-disk image needs + to be written first. + 4/5: When successful, your machine will be off and comsume (almost) no power. + But it might still react to some external events like opening the lid or + trafic on a network or usb device. For the bios, resume is then the same + as warm boot, similar to a situation where you used the command `reboot' + to reboot your machine. If your machine has problems on warm boot or if + you want to protect your machine with the bios password, this is probably + not the right choice. Mode 4 may be necessary on some machines where ACPI + wake up methods need to be run to properly reinitialise hardware after a + hibernation cycle. + 0: Switch the machine completely off. The only possible wakeup is the power + button. For the bios, resume is then the same as a cold boot, in + particular you would have to provide your bios boot password if your + machine uses that feature for booting. + + - progressbar_granularity_limit: + + This option can be used to limit the granularity of the progress bar + displayed with a bootsplash screen. The value is the maximum number of + steps. That is, 10 will make the progress bar jump in 10% increments. + + - reboot: + + This option causes TuxOnIce to reboot rather than powering down + at the end of saving an image. It can be toggled during a cycle by pressing + 'R'. + + - resume_commandline: + + This entry can be read after resuming to see the commandline that was used + when resuming began. You might use this to set up two bootloader entries + that are the same apart from the fact that one includes a extra append= + argument "at_work=1". You could then grep resume_commandline in your + post-resume scripts and configure networking (for example) differently + depending upon whether you're at home or work. resume_commandline can be + set to arbitrary text if you wish to remove sensitive contents. + + - swap/swapfilename: + + This entry is used to specify the swapfile or partition that + TuxOnIce will attempt to swapon/swapoff automatically. Thus, if + I normally use /dev/hda1 for swap, and want to use /dev/hda2 for specifically + for my hibernation image, I would + + echo /dev/hda2 > /sys/power/tuxonice/swap/swapfile + + /dev/hda2 would then be automatically swapon'd and swapoff'd. Note that the + swapon and swapoff occur while other processes are frozen (including kswapd) + so this swap file will not be used up when attempting to free memory. The + parition/file is also given the highest priority, so other swapfiles/partitions + will only be used to save the image when this one is filled. + + The value of this file is used by headerlocations along with any currently + activated swapfiles/partitions. + + - swap/headerlocations: + + This option tells you the resume= options to use for swap devices you + currently have activated. It is particularly useful when you only want to + use a swap file to store your image. See above for further details. + + - userui_program + + This entry is used to tell TuxOnice what userspace program to use for + providing a user interface while hibernating. The program uses a netlink + socket to pass messages back and forward to the kernel, allowing all of the + functions formerly implemented in the kernel user interface components. + + - user_interface/debug_sections (CONFIG_PM_DEBUG): + + This value, together with the console log level, controls what debugging + information is displayed. The console log level determines the level of + detail, and this value determines what detail is displayed. This value is + a bit vector, and the meaning of the bits can be found in the kernel tree + in include/linux/tuxonice.h. It can be overridden using the kernel's + command line option suspend_dbg. + + - user_interface/default_console_level (CONFIG_PM_DEBUG): + + This determines the value of the console log level at the start of a + hibernation cycle. If debugging is compiled in, the console log level can be + changed during a cycle by pressing the digit keys. Meanings are: + + 0: Nice display. + 1: Nice display plus numerical progress. + 2: Errors only. + 3: Low level debugging info. + 4: Medium level debugging info. + 5: High level debugging info. + 6: Verbose debugging info. + + - user_interface/enable_escape: + + Setting this to "1" will enable you abort a hibernation cycle or resuming by + pressing escape, "0" (default) disables this feature. Note that enabling + this option means that you cannot initiate a hibernation cycle and then walk +away + from your computer, expecting it to be secure. With feature disabled, + you can validly have this expectation once TuxOnice begins to write the + image to disk. (Prior to this point, it is possible that TuxOnice might + about because of failure to freeze all processes or because constraints + on its ability to save the image are not met). + + - version: + + The version of TuxOnIce you have compiled into the currently running kernel. + +7. How do you get support? + + Glad you asked. TuxOnIce is being actively maintained and supported + by Nigel (the guy doing most of the kernel coding at the moment), Bernard + (who maintains the hibernate script and userspace user interface components) + and its users. + + Resources availble include HowTos, FAQs and a Wiki, all available via + tuxonice.net. You can find the mailing lists there. + +8. I think I've found a bug. What should I do? + + By far and a way, the most common problems people have with TuxOnIce + related to drivers not having adequate power management support. In this + case, it is not a bug with TuxOnIce, but we can still help you. As we + mentioned above, such issues can usually be worked around by building the + functionality as modules and unloading them while hibernating. Please visit + the Wiki for up-to-date lists of known issues and work arounds. + + If this information doesn't help, try running: + + hibernate --bug-report + + ..and sending the output to the users mailing list. + + Good information on how to provide us with useful information from an + oops is found in the file REPORTING-BUGS, in the top level directory + of the kernel tree. If you get an oops, please especially note the + information about running what is printed on the screen through ksymoops. + The raw information is useless. + +9. When will XXX be supported? + + If there's a feature missing from TuxOnIce that you'd like, feel free to + ask. We try to be obliging, within reason. + + Patches are welcome. Please send to the list. + +10. How does it work? + + TuxOnIce does its work in a number of steps. + + a. Freezing system activity. + + The first main stage in hibernating is to stop all other activity. This is + achieved in stages. Processes are considered in fours groups, which we will + describe in reverse order for clarity's sake: Threads with the PF_NOFREEZE + flag, kernel threads without this flag, userspace processes with the + PF_SYNCTHREAD flag and all other processes. The first set (PF_NOFREEZE) are + untouched by the refrigerator code. They are allowed to run during hibernating + and resuming, and are used to support user interaction, storage access or the + like. Other kernel threads (those unneeded while hibernating) are frozen last. + This leaves us with userspace processes that need to be frozen. When a + process enters one of the *_sync system calls, we set a PF_SYNCTHREAD flag on + that process for the duration of that call. Processes that have this flag are + frozen after processes without it, so that we can seek to ensure that dirty + data is synced to disk as quickly as possible in a situation where other + processes may be submitting writes at the same time. Freezing the processes + that are submitting data stops new I/O from being submitted. Syncthreads can + then cleanly finish their work. So the order is: + + - Userspace processes without PF_SYNCTHREAD or PF_NOFREEZE; + - Userspace processes with PF_SYNCTHREAD (they won't have NOFREEZE); + - Kernel processes without PF_NOFREEZE. + + b. Eating memory. + + For a successful hibernation cycle, you need to have enough disk space to store the + image and enough memory for the various limitations of TuxOnIce's + algorithm. You can also specify a maximum image size. In order to attain + to those constraints, TuxOnIce may 'eat' memory. If, after freezing + processes, the constraints aren't met, TuxOnIce will thaw all the + other processes and begin to eat memory until its calculations indicate + the constraints are met. It will then freeze processes again and recheck + its calculations. + + c. Allocation of storage. + + Next, TuxOnIce allocates the storage that will be used to save + the image. + + The core of TuxOnIce knows nothing about how or where pages are stored. We + therefore request the active allocator (remember you might have compiled in + more than one!) to allocate enough storage for our expect image size. If + this request cannot be fulfilled, we eat more memory and try again. If it + is fulfiled, we seek to allocate additional storage, just in case our + expected compression ratio (if any) isn't achieved. This time, however, we + just continue if we can't allocate enough storage. + + If these calls to our allocator change the characteristics of the image + such that we haven't allocated enough memory, we also loop. (The allocator + may well need to allocate space for its storage information). + + d. Write the first part of the image. + + TuxOnIce stores the image in two sets of pages called 'pagesets'. + Pageset 2 contains pages on the active and inactive lists; essentially + the page cache. Pageset 1 contains all other pages, including the kernel. + We use two pagesets for one important reason: We need to make an atomic copy + of the kernel to ensure consistency of the image. Without a second pageset, + that would limit us to an image that was at most half the amount of memory + available. Using two pagesets allows us to store a full image. Since pageset + 2 pages won't be needed in saving pageset 1, we first save pageset 2 pages. + We can then make our atomic copy of the remaining pages using both pageset 2 + pages and any other pages that are free. While saving both pagesets, we are + careful not to corrupt the image. Among other things, we use lowlevel block + I/O routines that don't change the pagecache contents. + + The next step, then, is writing pageset 2. + + e. Suspending drivers and storing processor context. + + Having written pageset2, TuxOnIce calls the power management functions to + notify drivers of the hibernation, and saves the processor state in preparation + for the atomic copy of memory we are about to make. + + f. Atomic copy. + + At this stage, everything else but the TuxOnIce code is halted. Processes + are frozen or idling, drivers are quiesced and have stored (ideally and where + necessary) their configuration in memory we are about to atomically copy. + In our lowlevel architecture specific code, we have saved the CPU state. + We can therefore now do our atomic copy before resuming drivers etc. + + g. Save the atomic copy (pageset 1). + + TuxOnice can then write the atomic copy of the remaining pages. Since we + have copied the pages into other locations, we can continue to use the + normal block I/O routines without fear of corruption our image. + + f. Save the image header. + + Nearly there! We save our settings and other parameters needed for + reloading pageset 1 in an 'image header'. We also tell our allocator to + serialise its data at this stage, so that it can reread the image at resume + time. + + g. Set the image header. + + Finally, we edit the header at our resume= location. The signature is + changed by the allocator to reflect the fact that an image exists, and to + point to the start of that data if necessary (swap allocator). + + h. Power down. + + Or reboot if we're debugging and the appropriate option is selected. + + Whew! + + Reloading the image. + -------------------- + + Reloading the image is essentially the reverse of all the above. We load + our copy of pageset 1, being careful to choose locations that aren't going + to be overwritten as we copy it back (We start very early in the boot + process, so there are no other processes to quiesce here). We then copy + pageset 1 back to its original location in memory and restore the process + context. We are now running with the original kernel. Next, we reload the + pageset 2 pages, free the memory and swap used by TuxOnIce, restore + the pageset header and restart processes. Sounds easy in comparison to + hibernating, doesn't it! + + There is of course more to TuxOnIce than this, but this explanation + should be a good start. If there's interest, I'll write further + documentation on range pages and the low level I/O. + +11. Who wrote TuxOnIce? + + (Answer based on the writings of Florent Chabaud, credits in files and + Nigel's limited knowledge; apologies to anyone missed out!) + + The main developers of TuxOnIce have been... + + Gabor Kuti + Pavel Machek + Florent Chabaud + Bernard Blackham + Nigel Cunningham + + Significant portions of swsusp, the code in the vanilla kernel which + TuxOnIce enhances, have been worked on by Rafael Wysocki. Thanks should + also be expressed to him. + + The above mentioned developers have been aided in their efforts by a host + of hundreds, if not thousands of testers and people who have submitted bug + fixes & suggestions. Of special note are the efforts of Michael Frank, who + had his computers repetitively hibernate and resume for literally tens of + thousands of cycles and developed scripts to stress the system and test + TuxOnIce far beyond the point most of us (Nigel included!) would consider + testing. His efforts have contributed as much to TuxOnIce as any of the + names above. diff --git a/MAINTAINERS b/MAINTAINERS index 5d460c9..a39a4fb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4343,6 +4343,13 @@ P: Maciej W. Rozycki M: macro@linux-mips.org S: Maintained +TUXONICE (ENHANCED HIBERNATION) +P: Nigel Cunningham +M: nigel@tuxonice.net +L: tuxonice-devel@tuxonice.net +W: http://tuxonice.net +S: Maintained + U14-34F SCSI DRIVER P: Dario Ballabio M: ballabio_dario@emc.com diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index 58bcaeb..f5e16a4 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -388,6 +388,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable) change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0)); } +EXPORT_SYMBOL_GPL(kernel_map_pages); #endif /* CONFIG_DEBUG_PAGEALLOC */ static int fixmaps; diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 4526b3a..6851f01 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -605,6 +605,7 @@ void machine_restart(char *cmd) { machine_ops.restart(cmd); } +EXPORT_SYMBOL_GPL(machine_restart); void machine_halt(void) { diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 7233bd7..0079a2c 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1154,6 +1154,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable) */ __flush_tlb_all(); } +EXPORT_SYMBOL_GPL(kernel_map_pages); #ifdef CONFIG_HIBERNATION @@ -1168,7 +1169,7 @@ bool kernel_page_present(struct page *page) pte = lookup_address((unsigned long)page_address(page), &level); return (pte_val(*pte) & _PAGE_PRESENT); } - +EXPORT_SYMBOL_GPL(kernel_page_present); #endif /* CONFIG_HIBERNATION */ #endif /* CONFIG_DEBUG_PAGEALLOC */ diff --git a/arch/x86/power/cpu_64.c b/arch/x86/power/cpu_64.c index e3b6cf7..bdd74a8 100644 --- a/arch/x86/power/cpu_64.c +++ b/arch/x86/power/cpu_64.c @@ -10,6 +10,7 @@ #include #include +#include #include #include #include @@ -76,6 +77,7 @@ void save_processor_state(void) { __save_processor_state(&saved_context); } +EXPORT_SYMBOL_GPL(save_processor_state); static void do_fpu_end(void) { diff --git a/arch/x86/power/hibernate_32.c b/arch/x86/power/hibernate_32.c index 81197c6..ff7e534 100644 --- a/arch/x86/power/hibernate_32.c +++ b/arch/x86/power/hibernate_32.c @@ -8,6 +8,7 @@ #include #include +#include #include #include @@ -163,6 +164,7 @@ int swsusp_arch_resume(void) restore_image(); return 0; } +EXPORT_SYMBOL_GPL(swsusp_arch_resume); /* * pfn_is_nosave - check if given pfn is in the 'nosave' section diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c index 6dd000d..b42e72a 100644 --- a/arch/x86/power/hibernate_64.c +++ b/arch/x86/power/hibernate_64.c @@ -10,6 +10,7 @@ #include #include +#include #include #include #include @@ -117,6 +118,7 @@ int swsusp_arch_resume(void) restore_image(); return 0; } +EXPORT_SYMBOL_GPL(swsusp_arch_resume); /* * pfn_is_nosave - check if given pfn is in the 'nosave' section @@ -167,3 +169,4 @@ int arch_hibernation_header_restore(void *addr) restore_cr3 = rdr->cr3; return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL; } +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore); diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 1352312..f705dce 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -183,6 +183,7 @@ int wait_for_device_probe(void) async_synchronize_full(); return 0; } +EXPORT_SYMBOL_GPL(wait_for_device_probe); /** * driver_probe_device - attempt to bind device & driver together diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 2d14f4a..dc9ef3b 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -54,6 +54,7 @@ void device_pm_lock(void) { mutex_lock(&dpm_list_mtx); } +EXPORT_SYMBOL_GPL(device_pm_lock); /** * device_pm_unlock - unlock the list of active devices used by the PM core @@ -62,6 +63,7 @@ void device_pm_unlock(void) { mutex_unlock(&dpm_list_mtx); } +EXPORT_SYMBOL_GPL(device_pm_unlock); /** * device_pm_add - add a device to the list of active devices diff --git a/drivers/char/vt.c b/drivers/char/vt.c index 7900bd6..a5a8717 100644 --- a/drivers/char/vt.c +++ b/drivers/char/vt.c @@ -187,6 +187,7 @@ int fg_console; int last_console; int want_console = -1; int kmsg_redirect; +EXPORT_SYMBOL_GPL(kmsg_redirect); /* * For each existing display, we have a pointer to console currently visible diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 88d3368..5817513 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -136,7 +136,8 @@ drm_gem_object_alloc(struct drm_device *dev, size_t size) obj = kcalloc(1, sizeof(*obj), GFP_KERNEL); obj->dev = dev; - obj->filp = shmem_file_setup("drm mm object", size, VM_NORESERVE); + obj->filp = shmem_file_setup("drm mm object", size, + VM_NORESERVE | VM_ATOMIC_COPY); if (IS_ERR(obj->filp)) { kfree(obj); return NULL; diff --git a/drivers/md/md.c b/drivers/md/md.c index a307f87..6eb04ea 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5778,7 +5778,6 @@ void md_done_sync(mddev_t *mddev, int blocks, int ok) } } - /* md_write_start(mddev, bi) * If we need to update some array metadata (e.g. 'active' flag * in superblock) before writing, schedule a superblock update @@ -5922,6 +5921,9 @@ void md_do_sync(mddev_t *mddev) mddev->curr_resync = 2; try_again: + while (freezer_is_on()) + yield(); + if (kthread_should_stop()) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); goto skip; @@ -5943,6 +5945,10 @@ void md_do_sync(mddev_t *mddev) * time 'round when curr_resync == 2 */ continue; + + while (freezer_is_on()) + yield(); + /* We need to wait 'interruptible' so as not to * contribute to the load average, and not to * be caught by 'softlockup' @@ -5955,6 +5961,7 @@ void md_do_sync(mddev_t *mddev) " share one or more physical units)\n", desc, mdname(mddev), mdname(mddev2)); mddev_put(mddev2); + try_to_freeze(); if (signal_pending(current)) flush_signals(current); schedule(); @@ -6038,6 +6045,10 @@ void md_do_sync(mddev_t *mddev) mddev->resync_max > j || kthread_should_stop()); } + + while (freezer_is_on()) + yield(); + if (kthread_should_stop()) goto interrupted; sectors = mddev->pers->sync_request(mddev, j, &skipped, @@ -6081,6 +6092,9 @@ void md_do_sync(mddev_t *mddev) last_mark = next; } + while (freezer_is_on()) + yield(); + if (kthread_should_stop()) goto interrupted; diff --git a/fs/buffer.c b/fs/buffer.c index 891e1c7..cb8c84a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -310,6 +310,93 @@ int thaw_bdev(struct block_device *bdev, struct super_block *sb) } EXPORT_SYMBOL(thaw_bdev); +#ifdef CONFIG_FS_FREEZER_DEBUG +#define FS_PRINTK(fmt, args...) printk(fmt, ## args) +#else +#define FS_PRINTK(fmt, args...) +#endif + +/* #define DEBUG_FS_FREEZING */ + +/** + * freeze_filesystems - lock all filesystems and force them into a consistent + * state + * @which: What combination of fuse & non-fuse to freeze. + */ +void freeze_filesystems(int which) +{ + struct super_block *sb; + + lockdep_off(); + + /* + * Freeze in reverse order so filesystems dependant upon others are + * frozen in the right order (eg. loopback on ext3). + */ + list_for_each_entry_reverse(sb, &super_blocks, s_list) { + FS_PRINTK(KERN_INFO "Considering %s.%s: (root %p, bdev %x)", + sb->s_type->name ? sb->s_type->name : "?", + sb->s_subtype ? sb->s_subtype : "", sb->s_root, + sb->s_bdev ? sb->s_bdev->bd_dev : 0); + + if (sb->s_type->fs_flags & FS_IS_FUSE && + sb->s_frozen == SB_UNFROZEN && + which & FS_FREEZER_FUSE) { + sb->s_frozen = SB_FREEZE_TRANS; + sb->s_flags |= MS_FROZEN; + FS_PRINTK("Fuse filesystem done.\n"); + continue; + } + + if (!sb->s_root || !sb->s_bdev || + (sb->s_frozen == SB_FREEZE_TRANS) || + (sb->s_flags & MS_RDONLY) || + (sb->s_flags & MS_FROZEN) || + !(which & FS_FREEZER_NORMAL)) { + FS_PRINTK(KERN_INFO "Nope.\n"); + continue; + } + + FS_PRINTK(KERN_INFO "Freezing %x... ", sb->s_bdev->bd_dev); + freeze_bdev(sb->s_bdev); + sb->s_flags |= MS_FROZEN; + FS_PRINTK(KERN_INFO "Done.\n"); + } + + lockdep_on(); +} + +/** + * thaw_filesystems - unlock all filesystems + * @which: What combination of fuse & non-fuse to thaw. + */ +void thaw_filesystems(int which) +{ + struct super_block *sb; + + lockdep_off(); + + list_for_each_entry(sb, &super_blocks, s_list) { + if (!(sb->s_flags & MS_FROZEN)) + continue; + + if (sb->s_type->fs_flags & FS_IS_FUSE) { + if (!(which & FS_FREEZER_FUSE)) + continue; + + sb->s_frozen = SB_UNFROZEN; + } else { + if (!(which & FS_FREEZER_NORMAL)) + continue; + + thaw_bdev(sb->s_bdev, sb); + } + sb->s_flags &= ~MS_FROZEN; + } + + lockdep_on(); +} + /* * Various filesystems appear to want __find_get_block to be non-blocking. * But it's the page lock which protects the buffers. To get around this, diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 3e5637f..f3c5cd6 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -8,6 +8,7 @@ #include #include #include +#include /* A global variable is a bit ugly, but it keeps the code simple */ int sysctl_drop_caches; @@ -33,7 +34,7 @@ static void drop_pagecache_sb(struct super_block *sb) iput(toput_inode); } -static void drop_pagecache(void) +void drop_pagecache(void) { struct super_block *sb; @@ -61,6 +62,7 @@ static void drop_slab(void) nr_objects = shrink_slab(1000, GFP_KERNEL, 1000); } while (nr_objects > 10); } +EXPORT_SYMBOL_GPL(drop_pagecache); int drop_caches_sysctl_handler(ctl_table *table, int write, struct file *file, void __user *buffer, size_t *length, loff_t *ppos) diff --git a/fs/fuse/control.c b/fs/fuse/control.c index 99c99df..cadffd8 100644 --- a/fs/fuse/control.c +++ b/fs/fuse/control.c @@ -209,6 +209,7 @@ static void fuse_ctl_kill_sb(struct super_block *sb) static struct file_system_type fuse_ctl_fs_type = { .owner = THIS_MODULE, .name = "fusectl", + .fs_flags = FS_IS_FUSE, .get_sb = fuse_ctl_get_sb, .kill_sb = fuse_ctl_kill_sb, }; diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index ba76b68..e9942d4 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -7,6 +7,7 @@ */ #include "fuse_i.h" +#include "fuse.h" #include #include @@ -16,6 +17,7 @@ #include #include #include +#include MODULE_ALIAS_MISCDEV(FUSE_MINOR); @@ -752,6 +754,8 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov, if (!fc) return -EPERM; + FUSE_MIGHT_FREEZE(file->f_mapping->host->i_sb, "fuse_dev_read"); + restart: spin_lock(&fc->lock); err = -EAGAIN; @@ -912,6 +916,9 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, const struct iovec *iov, if (!fc) return -EPERM; + FUSE_MIGHT_FREEZE(iocb->ki_filp->f_mapping->host->i_sb, + "fuse_dev_write"); + fuse_copy_init(&cs, fc, 0, NULL, iov, nr_segs); if (nbytes < sizeof(struct fuse_out_header)) return -EINVAL; diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index fdff346..1fd3cd1 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -7,12 +7,14 @@ */ #include "fuse_i.h" +#include "fuse.h" #include #include #include #include #include +#include #if BITS_PER_LONG >= 64 static inline void fuse_dentry_settime(struct dentry *entry, u64 time) @@ -174,6 +176,9 @@ static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd) return 0; fc = get_fuse_conn(inode); + + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_dentry_revalidate"); + req = fuse_get_req(fc); if (IS_ERR(req)) return 0; @@ -268,6 +273,8 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, struct qstr *name, if (name->len > FUSE_NAME_MAX) goto out; + FUSE_MIGHT_FREEZE(sb, "fuse_lookup_name"); + req = fuse_get_req(fc); err = PTR_ERR(req); if (IS_ERR(req)) @@ -331,6 +338,8 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry, if (err) goto out_err; + FUSE_MIGHT_FREEZE(dir->i_sb, "fuse_lookup"); + err = -EIO; if (inode && get_node_id(inode) == FUSE_ROOT_ID) goto out_iput; @@ -402,6 +411,8 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry, int mode, if (IS_ERR(forget_req)) return PTR_ERR(forget_req); + FUSE_MIGHT_FREEZE(dir->i_sb, "fuse_create_open"); + req = fuse_get_req(fc); err = PTR_ERR(req); if (IS_ERR(req)) @@ -488,6 +499,8 @@ static int create_new_entry(struct fuse_conn *fc, struct fuse_req *req, int err; struct fuse_req *forget_req; + FUSE_MIGHT_FREEZE(dir->i_sb, "create_new_entry"); + forget_req = fuse_get_req(fc); if (IS_ERR(forget_req)) { fuse_put_request(fc, req); @@ -585,7 +598,11 @@ static int fuse_mkdir(struct inode *dir, struct dentry *entry, int mode) { struct fuse_mkdir_in inarg; struct fuse_conn *fc = get_fuse_conn(dir); - struct fuse_req *req = fuse_get_req(fc); + struct fuse_req *req; + + FUSE_MIGHT_FREEZE(dir->i_sb, "fuse_mkdir"); + + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); @@ -605,7 +622,11 @@ static int fuse_symlink(struct inode *dir, struct dentry *entry, { struct fuse_conn *fc = get_fuse_conn(dir); unsigned len = strlen(link) + 1; - struct fuse_req *req = fuse_get_req(fc); + struct fuse_req *req; + + FUSE_MIGHT_FREEZE(dir->i_sb, "fuse_symlink"); + + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); @@ -622,7 +643,11 @@ static int fuse_unlink(struct inode *dir, struct dentry *entry) { int err; struct fuse_conn *fc = get_fuse_conn(dir); - struct fuse_req *req = fuse_get_req(fc); + struct fuse_req *req; + + FUSE_MIGHT_FREEZE(dir->i_sb, "fuse_unlink"); + + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); @@ -655,7 +680,11 @@ static int fuse_rmdir(struct inode *dir, struct dentry *entry) { int err; struct fuse_conn *fc = get_fuse_conn(dir); - struct fuse_req *req = fuse_get_req(fc); + struct fuse_req *req; + + FUSE_MIGHT_FREEZE(dir->i_sb, "fuse_rmdir"); + + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 821d10f..4b8b86f 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -7,11 +7,13 @@ */ #include "fuse_i.h" +#include "fuse.h" #include #include #include #include +#include static const struct file_operations fuse_direct_io_file_operations; @@ -23,6 +25,8 @@ static int fuse_send_open(struct inode *inode, struct file *file, int isdir, struct fuse_req *req; int err; + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_send_open"); + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); @@ -279,6 +283,8 @@ static int fuse_flush(struct file *file, fl_owner_t id) if (fc->no_flush) return 0; + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_flush"); + req = fuse_get_req_nofail(fc, file); memset(&inarg, 0, sizeof(inarg)); inarg.fh = ff->fh; @@ -330,6 +336,8 @@ int fuse_fsync_common(struct file *file, struct dentry *de, int datasync, if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir)) return 0; + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_fsync_common"); + /* * Start writeback against all dirty pages of the inode, then * wait for all outstanding writes, before sending the FSYNC @@ -438,6 +446,8 @@ static int fuse_readpage(struct file *file, struct page *page) if (is_bad_inode(inode)) goto out; + FUSE_MIGHT_FREEZE(file->f_mapping->host->i_sb, "fuse_readpage"); + /* * Page writeback can extend beyond the liftime of the * page-cache page, so make sure we read a properly synced @@ -538,6 +548,9 @@ static int fuse_readpages_fill(void *_data, struct page *page) struct inode *inode = data->inode; struct fuse_conn *fc = get_fuse_conn(inode); + FUSE_MIGHT_FREEZE(data->file->f_mapping->host->i_sb, + "fuse_readpages_fill"); + fuse_wait_on_page_writeback(inode, page->index); if (req->num_pages && @@ -568,6 +581,8 @@ static int fuse_readpages(struct file *file, struct address_space *mapping, if (is_bad_inode(inode)) goto out; + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_readpages"); + data.file = file; data.inode = inode; data.req = fuse_get_req(fc); @@ -685,6 +700,8 @@ static int fuse_buffered_write(struct file *file, struct inode *inode, if (is_bad_inode(inode)) return -EIO; + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_buffered_write"); + /* * Make sure writepages on the same page are not mixed up with * plain writes. @@ -839,6 +856,8 @@ static ssize_t fuse_perform_write(struct file *file, struct fuse_req *req; ssize_t count; + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_perform_write"); + req = fuse_get_req(fc); if (IS_ERR(req)) { err = PTR_ERR(req); @@ -973,6 +992,8 @@ static ssize_t fuse_direct_io(struct file *file, const char __user *buf, if (is_bad_inode(inode)) return -EIO; + FUSE_MIGHT_FREEZE(file->f_mapping->host->i_sb, "fuse_direct_io"); + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); @@ -1330,6 +1351,8 @@ static int fuse_getlk(struct file *file, struct file_lock *fl) struct fuse_lk_out outarg; int err; + FUSE_MIGHT_FREEZE(file->f_mapping->host->i_sb, "fuse_getlk"); + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); @@ -1365,6 +1388,8 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock) if (fl->fl_flags & FL_CLOSE) return 0; + FUSE_MIGHT_FREEZE(file->f_mapping->host->i_sb, "fuse_setlk"); + req = fuse_get_req(fc); if (IS_ERR(req)) return PTR_ERR(req); @@ -1431,6 +1456,8 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block) if (!inode->i_sb->s_bdev || fc->no_bmap) return 0; + FUSE_MIGHT_FREEZE(inode->i_sb, "fuse_bmap"); + req = fuse_get_req(fc); if (IS_ERR(req)) return 0; diff --git a/fs/fuse/fuse.h b/fs/fuse/fuse.h new file mode 100644 index 0000000..170e49a --- /dev/null +++ b/fs/fuse/fuse.h @@ -0,0 +1,13 @@ +#define FUSE_MIGHT_FREEZE(superblock, desc) \ +do { \ + int printed = 0; \ + while (superblock->s_frozen != SB_UNFROZEN) { \ + if (!printed) { \ + printk(KERN_INFO "%d frozen in " desc ".\n", \ + current->pid); \ + printed = 1; \ + } \ + try_to_freeze(); \ + yield(); \ + } \ +} while (0) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 459b73d..f4dfbf5 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -925,7 +925,7 @@ static int fuse_get_sb(struct file_system_type *fs_type, static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_IS_FUSE, .get_sb = fuse_get_sb, .kill_sb = kill_anon_super, }; @@ -944,7 +944,7 @@ static struct file_system_type fuseblk_fs_type = { .name = "fuseblk", .get_sb = fuse_get_sb_blk, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE, + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_IS_FUSE, }; static inline int register_fuseblk(void) diff --git a/fs/namei.c b/fs/namei.c index bbc15c2..11271b2 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2210,6 +2210,8 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry) if (!dir->i_op->unlink) return -EPERM; + vfs_check_frozen(dir->i_sb, SB_FREEZE_WRITE); + DQUOT_INIT(dir); mutex_lock(&dentry->d_inode->i_mutex); diff --git a/fs/super.c b/fs/super.c index 6ce5014..5030f0f 100644 --- a/fs/super.c +++ b/fs/super.c @@ -44,6 +44,8 @@ LIST_HEAD(super_blocks); +EXPORT_SYMBOL_GPL(super_blocks); + DEFINE_SPINLOCK(sb_lock); /** diff --git a/include/linux/Kbuild b/include/linux/Kbuild index 106c3ba..17a95d2 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -208,6 +208,7 @@ unifdef-y += filter.h unifdef-y += flat.h unifdef-y += futex.h unifdef-y += fs.h +unifdef-y += freezer.h unifdef-y += gameport.h unifdef-y += generic_serial.h unifdef-y += hayesesp.h diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index bd7ac79..09044e6 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -172,6 +172,11 @@ wait_queue_head_t *bh_waitq_head(struct buffer_head *bh); int fsync_bdev(struct block_device *); struct super_block *freeze_bdev(struct block_device *); int thaw_bdev(struct block_device *, struct super_block *); +#define FS_FREEZER_FUSE 1 +#define FS_FREEZER_NORMAL 2 +#define FS_FREEZER_ALL (FS_FREEZER_FUSE | FS_FREEZER_NORMAL) +void freeze_filesystems(int which); +void thaw_filesystems(int which); int fsync_super(struct super_block *); int fsync_no_super(struct block_device *); struct buffer_head *__find_get_block(struct block_device *bdev, sector_t block, diff --git a/include/linux/freezer.h b/include/linux/freezer.h index 5a361f8..c775cd1 100644 --- a/include/linux/freezer.h +++ b/include/linux/freezer.h @@ -121,6 +121,23 @@ static inline void set_freezable(void) current->flags &= ~PF_NOFREEZE; } +#ifdef CONFIG_PM_SLEEP +extern int freezer_state; +#define FREEZER_OFF 0 +#define FREEZER_FILESYSTEMS_FROZEN 1 +#define FREEZER_USERSPACE_FROZEN 2 +#define FREEZER_FULLY_ON 3 + +static inline int freezer_is_on(void) +{ + return freezer_state == FREEZER_FULLY_ON; +} +#else +static inline int freezer_is_on(void) { return 0; } +#endif + +extern void thaw_kernel_threads(void); + /* * Tell the freezer that the current task should be frozen by it and that it * should send a fake signal to the task to freeze it. @@ -172,6 +189,8 @@ static inline int freeze_processes(void) { BUG(); return 0; } static inline void thaw_processes(void) {} static inline int try_to_freeze(void) { return 0; } +static inline int freezer_is_on(void) { return 0; } +static inline void thaw_kernel_threads(void) { } static inline void freezer_do_not_count(void) {} static inline void freezer_count(void) {} diff --git a/include/linux/fs.h b/include/linux/fs.h index 92734c0..249df0f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -8,6 +8,7 @@ #include #include +#include /* * It's silly to have NR_OPEN bigger than NR_FILE, but you can change @@ -109,6 +110,7 @@ struct inodes_stat_t { #define FS_REQUIRES_DEV 1 #define FS_BINARY_MOUNTDATA 2 #define FS_HAS_SUBTYPE 4 +#define FS_IS_FUSE 8 /* Fuse filesystem - bdev freeze these too */ #define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() * during rename() internally. @@ -141,6 +143,7 @@ struct inodes_stat_t { #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */ #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ +#define MS_FROZEN (1<<24) /* Frozen by freeze_filesystems() */ #define MS_ACTIVE (1<<30) #define MS_NOUSER (1<<31) @@ -167,6 +170,8 @@ struct inodes_stat_t { #define S_NOCMTIME 128 /* Do not update file c/mtime */ #define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */ #define S_PRIVATE 512 /* Inode is fs-internal */ +#define S_ATOMIC_COPY 1024 /* Pages mapped with this inode need to be + atomically copied (gem) */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1216,8 +1221,11 @@ enum { SB_FREEZE_TRANS = 2, }; -#define vfs_check_frozen(sb, level) \ - wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level))) +#define vfs_check_frozen(sb, level) do { \ + freezer_do_not_count(); \ + wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level))); \ + freezer_count(); \ +} while (0) #define get_fs_excl() atomic_inc(¤t->fs_excl) #define put_fs_excl() atomic_dec(¤t->fs_excl) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3daa05f..8497a33 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -104,6 +104,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */ #define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */ #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ +#define VM_ATOMIC_COPY 0x40000000 /* This VM should be atomically copied by TuxOnIce */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS @@ -1305,6 +1306,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, unsigned long lru_pages); +void drop_pagecache(void); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/include/linux/netlink.h b/include/linux/netlink.h index 51b09a1..a35eb65 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -24,6 +24,8 @@ /* leave room for NETLINK_DM (DM Events) */ #define NETLINK_SCSITRANSPORT 18 /* SCSI Transports */ #define NETLINK_ECRYPTFS 19 +#define NETLINK_TOI_USERUI 20 /* TuxOnIce's userui */ +#define NETLINK_TOI_USM 21 /* Userspace storage manager */ #define MAX_LINKS 32 diff --git a/include/linux/suspend.h b/include/linux/suspend.h index c7d9bb1..8e7de98 100644 --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -295,4 +295,70 @@ static inline void register_nosave_region_late(unsigned long b, unsigned long e) extern struct mutex pm_mutex; +enum { + TOI_CAN_HIBERNATE, + TOI_CAN_RESUME, + TOI_RESUME_DEVICE_OK, + TOI_NORESUME_SPECIFIED, + TOI_SANITY_CHECK_PROMPT, + TOI_CONTINUE_REQ, + TOI_RESUMED_BEFORE, + TOI_BOOT_TIME, + TOI_NOW_RESUMING, + TOI_IGNORE_LOGLEVEL, + TOI_TRYING_TO_RESUME, + TOI_LOADING_ALT_IMAGE, + TOI_STOP_RESUME, + TOI_IO_STOPPED, + TOI_NOTIFIERS_PREPARE, + TOI_CLUSTER_MODE, + TOI_BOOT_KERNEL, +}; + +#ifdef CONFIG_TOI + +/* Used in init dir files */ +extern unsigned long toi_state; +#define set_toi_state(bit) (set_bit(bit, &toi_state)) +#define clear_toi_state(bit) (clear_bit(bit, &toi_state)) +#define test_toi_state(bit) (test_bit(bit, &toi_state)) +extern int toi_running; + +#define test_action_state(bit) (test_bit(bit, &toi_bkd.toi_action)) +extern int toi_try_hibernate(void); + +#else /* !CONFIG_TOI */ + +#define toi_state (0) +#define set_toi_state(bit) do { } while (0) +#define clear_toi_state(bit) do { } while (0) +#define test_toi_state(bit) (0) +#define toi_running (0) + +static inline int toi_try_hibernate(void) { return 0; } +#define test_action_state(bit) (0) + +#endif /* CONFIG_TOI */ + +#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_TOI +extern void toi_try_resume(void); +#else +#define toi_try_resume() do { } while (0) +#endif + +extern int resume_attempted; +extern int software_resume(void); + +static inline void check_resume_attempted(void) +{ + if (resume_attempted) + return; + + software_resume(); +} +#else +#define check_resume_attempted() do { } while (0) +#define resume_attempted (0) +#endif #endif /* _LINUX_SUSPEND_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index d302155..8ef7de2 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -168,6 +168,7 @@ struct swap_list_t { extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; extern unsigned int nr_free_buffer_pages(void); +extern unsigned int nr_unallocated_buffer_pages(void); extern unsigned int nr_free_pagecache_pages(void); /* Definition of global_page_state not available yet */ diff --git a/init/do_mounts.c b/init/do_mounts.c index 8d4ff5a..df19830 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -142,6 +142,7 @@ fail: done: return res; } +EXPORT_SYMBOL_GPL(name_to_dev_t); static int __init root_dev_setup(char *line) { @@ -411,6 +412,8 @@ void __init prepare_namespace(void) if (is_floppy && rd_doload && rd_load_disk(0)) ROOT_DEV = Root_RAM0; + check_resume_attempted(); + mount_root(); out: sys_mount(".", "/", NULL, MS_MOVE, NULL); diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c index 614241b..f3ea292 100644 --- a/init/do_mounts_initrd.c +++ b/init/do_mounts_initrd.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include "do_mounts.h" @@ -68,6 +69,11 @@ static void __init handle_initrd(void) current->flags &= ~PF_FREEZER_SKIP; + if (!resume_attempted) + printk(KERN_ERR "TuxOnIce: No attempt was made to resume from " + "any image that might exist.\n"); + clear_toi_state(TOI_BOOT_TIME); + /* move initrd to rootfs' /old */ sys_fchdir(old_fd); sys_mount("/", ".", NULL, MS_MOVE, NULL); diff --git a/init/main.c b/init/main.c index 83697e1..3485b50 100644 --- a/init/main.c +++ b/init/main.c @@ -115,6 +115,7 @@ extern void softirq_init(void); char __initdata boot_command_line[COMMAND_LINE_SIZE]; /* Untouched saved command line (eg. for /proc) */ char *saved_command_line; +EXPORT_SYMBOL_GPL(saved_command_line); /* Command line for parameter parsing */ static char *static_command_line; diff --git a/kernel/cpu.c b/kernel/cpu.c index 79e40f0..eb36db0 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -415,6 +415,7 @@ int disable_nonboot_cpus(void) stop_machine_destroy(); return error; } +EXPORT_SYMBOL_GPL(disable_nonboot_cpus); void __ref enable_nonboot_cpus(void) { @@ -439,6 +440,7 @@ void __ref enable_nonboot_cpus(void) out: cpu_maps_update_done(); } +EXPORT_SYMBOL_GPL(enable_nonboot_cpus); static int alloc_frozen_cpus(void) { diff --git a/kernel/fork.c b/kernel/fork.c index 4854c2c..3bfddab 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -80,6 +80,7 @@ int max_threads; /* tunable limit on nr_threads */ DEFINE_PER_CPU(unsigned long, process_counts) = 0; __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ +EXPORT_SYMBOL_GPL(tasklist_lock); DEFINE_TRACE(sched_process_fork); diff --git a/kernel/kmod.c b/kernel/kmod.c index a27a5f6..8ed006a 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -317,6 +317,7 @@ int usermodehelper_disable(void) usermodehelper_disabled = 0; return -EAGAIN; } +EXPORT_SYMBOL_GPL(usermodehelper_disable); /** * usermodehelper_enable - allow new helpers to be started again @@ -325,6 +326,7 @@ void usermodehelper_enable(void) { usermodehelper_disabled = 0; } +EXPORT_SYMBOL_GPL(usermodehelper_enable); static void helper_lock(void) { diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 23bd4da..2520b0e 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -38,6 +38,13 @@ config CAN_PM_TRACE def_bool y depends on PM_DEBUG && PM_SLEEP && EXPERIMENTAL +config FS_FREEZER_DEBUG + bool "Filesystem freezer debugging" + depends on PM_DEBUG + default n + ---help--- + This option enables debugging of the filesystem freezing code. + config PM_TRACE bool help @@ -179,6 +186,256 @@ config PM_STD_PARTITION suspended image to. It will simply pick the first available swap device. +menuconfig TOI_CORE + tristate "Enhanced Hibernation (TuxOnIce)" + depends on HIBERNATION + default y + ---help--- + TuxOnIce is the 'new and improved' suspend support. + + See the TuxOnIce home page (tuxonice.net) + for FAQs, HOWTOs and other documentation. + + comment "Image Storage (you need at least one allocator)" + depends on TOI_CORE + + config TOI_FILE + tristate "File Allocator" + depends on TOI_CORE + default y + ---help--- + This option enables support for storing an image in a + simple file. This should be possible, but we're still + testing it. + + config TOI_SWAP + tristate "Swap Allocator" + depends on TOI_CORE && SWAP + default y + ---help--- + This option enables support for storing an image in your + swap space. + + comment "General Options" + depends on TOI_CORE + + config TOI_DEFAULT_PRE_HIBERNATE + string "Default pre-hibernate command" + depends on TOI_CORE + ---help--- + This entry allows you to specify a command to be run prior + to starting a hibernation cycle. If this command returns + a non-zero result code, hibernating will be aborted. If + you're starting hibernation via the hibernate script, + this value should probably be blank. + + config TOI_DEFAULT_POST_HIBERNATE + string "Default post-resume command" + depends on TOI_CORE + ---help--- + This entry allows you to specify a command to be run after + completing a hibernation cycle. The return code of this + command is ignored. If you're starting hibernation via the + hibernate script, this value should probably be blank. + + config TOI_CRYPTO + tristate "Compression support" + depends on TOI_CORE && CRYPTO + default y + ---help--- + This option adds support for using cryptoapi compression + algorithms. Compression is particularly useful as + the LZF support that comes with the TuxOnIce patch can double + your suspend and resume speed. + + You probably want this, so say Y here. + + comment "No compression support available without Cryptoapi support." + depends on TOI_CORE && !CRYPTO + + config TOI_USERUI + tristate "Userspace User Interface support" + depends on TOI_CORE && NET && (VT || SERIAL_CONSOLE) + default y + ---help--- + This option enabled support for a userspace based user interface + to TuxOnIce, which allows you to have a nice display while suspending + and resuming, and also enables features such as pressing escape to + cancel a cycle or interactive debugging. + + config TOI_USERUI_DEFAULT_PATH + string "Default userui program location" + default "/usr/local/sbin/tuxoniceui_text" + depends on TOI_USERUI + ---help--- + This entry allows you to specify a default path to the userui binary. + + config TOI_KEEP_IMAGE + bool "Allow Keep Image Mode" + depends on TOI_CORE + ---help--- + This option allows you to keep and image and reuse it. It is intended + __ONLY__ for use with systems where all filesystems are mounted read- + only (kiosks, for example). To use it, compile this option in and boot + normally. Set the KEEP_IMAGE flag in /sys/power/tuxonice and suspend. + When you resume, the image will not be removed. You will be unable to turn + off swap partitions (assuming you are using the swap allocator), but future + suspends simply do a power-down. The image can be updated using the + kernel command line parameter suspend_act= to turn off the keep image + bit. Keep image mode is a little less user friendly on purpose - it + should not be used without thought! + + config TOI_REPLACE_SWSUSP + bool "Replace swsusp by default" + default y + depends on TOI_CORE + ---help--- + TuxOnIce can replace swsusp. This option makes that the default state, + requiring you to echo 0 > /sys/power/tuxonice/replace_swsusp if you want + to use the vanilla kernel functionality. Note that your initrd/ramfs will + need to do this before trying to resume, too. + With overriding swsusp enabled, echoing disk to /sys/power/state will + start a TuxOnIce cycle. If resume= doesn't specify an allocator and both + the swap and file allocators are compiled in, the swap allocator will be + used by default. + + config TOI_IGNORE_LATE_INITCALL + bool "Wait for initrd/ramfs to run, by default" + default n + depends on TOI_CORE + ---help--- + When booting, TuxOnIce can check for an image and start to resume prior + to any initrd/ramfs running (via a late initcall). + + If you don't have an initrd/ramfs, this is what you want to happen - + otherwise you won't be able to safely resume. You should set this option + to 'No'. + + If, however, you want your initrd/ramfs to run anyway before resuming, + you need to tell TuxOnIce to ignore that earlier opportunity to resume. + This can be done either by using this compile time option, or by + overriding this option with the boot-time parameter toi_initramfs_resume_only=1. + + Note that if TuxOnIce can't resume at the earlier opportunity, the + value of this option won't matter - the initramfs/initrd (if any) will + run anyway. + + menuconfig TOI_CLUSTER + tristate "Cluster support" + default n + depends on TOI_CORE && NET && BROKEN + ---help--- + Support for linking multiple machines in a cluster so that they suspend + and resume together. + + config TOI_DEFAULT_CLUSTER_INTERFACE + string "Default cluster interface" + depends on TOI_CLUSTER + ---help--- + The default interface on which to communicate with other nodes in + the cluster. + + If no value is set here, cluster support will be disabled by default. + + config TOI_DEFAULT_CLUSTER_KEY + string "Default cluster key" + default "Default" + depends on TOI_CLUSTER + ---help--- + The default key used by this node. All nodes in the same cluster + have the same key. Multiple clusters may coexist on the same lan + by using different values for this key. + + config TOI_CLUSTER_IMAGE_TIMEOUT + int "Timeout when checking for image" + default 15 + depends on TOI_CLUSTER + ---help--- + Timeout (seconds) before continuing to boot when waiting to see + whether other nodes might have an image. Set to -1 to wait + indefinitely. In WAIT_UNTIL_NODES is non zero, we might continue + booting sooner than this timeout. + + config TOI_CLUSTER_WAIT_UNTIL_NODES + int "Nodes without image before continuing" + default 0 + depends on TOI_CLUSTER + ---help--- + When booting and no image is found, we wait to see if other nodes + have an image before continuing to boot. This value lets us + continue after seeing a certain number of nodes without an image, + instead of continuing to wait for the timeout. Set to 0 to only + use the timeout. + + config TOI_DEFAULT_CLUSTER_PRE_HIBERNATE + string "Default pre-hibernate script" + depends on TOI_CLUSTER + ---help--- + The default script to be called when starting to hibernate. + + config TOI_DEFAULT_CLUSTER_POST_HIBERNATE + string "Default post-hibernate script" + depends on TOI_CLUSTER + ---help--- + The default script to be called after resuming from hibernation. + + config TOI_DEFAULT_WAIT + int "Default waiting time for emergency boot messages" + default "25" + range -1 32768 + depends on TOI_CORE + help + TuxOnIce can display warnings very early in the process of resuming, + if (for example) it appears that you have booted a kernel that doesn't + match an image on disk. It can then give you the opportunity to either + continue booting that kernel, or reboot the machine. This option can be + used to control how long to wait in such circumstances. -1 means wait + forever. 0 means don't wait at all (do the default action, which will + generally be to continue booting and remove the image). Values of 1 or + more indicate a number of seconds (up to 255) to wait before doing the + default. + + config TOI_DEFAULT_EXTRA_PAGES_ALLOWANCE + int "Default extra pages allowance" + default "2000" + range 500 32768 + depends on TOI_CORE + help + This value controls the default for the allowance TuxOnIce makes for + drivers to allocate extra memory during the atomic copy. The default + value of 2000 will be okay in most cases. If you are using + DRI, the easiest way to find what value to use is to try to hibernate + and look at how many pages were actually needed in the sysfs entry + /sys/power/tuxonice/debug_info (first number on the last line), adding + a little extra because the value is not always the same. + + config TOI_CHECKSUM + bool "Checksum pageset2" + default n + depends on TOI_CORE + select CRYPTO + select CRYPTO_ALGAPI + select CRYPTO_MD4 + ---help--- + Adds support for checksumming pageset2 pages, to ensure you really get an + atomic copy. Since some filesystems (XFS especially) change metadata even + when there's no other activity, we need this to check for pages that have + been changed while we were saving the page cache. If your debugging output + always says no pages were resaved, you may be able to safely disable this + option. + +config TOI + bool + depends on TOI_CORE!=n + default y + +config TOI_EXPORTS + bool + depends on TOI_SWAP=m || TOI_FILE=m || \ + TOI_CRYPTO=m || TOI_CLUSTER=m || \ + TOI_USERUI=m || TOI_CORE=m + default y + config APM_EMULATION tristate "Advanced Power Management Emulation" depends on PM && SYS_SUPPORTS_APM_EMULATION diff --git a/kernel/power/Makefile b/kernel/power/Makefile index 720ea4f..c1681d0 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -3,6 +3,36 @@ ifeq ($(CONFIG_PM_DEBUG),y) EXTRA_CFLAGS += -DDEBUG endif +obj-y := main.o + +tuxonice_core-objs := tuxonice_modules.o tuxonice_sysfs.o tuxonice_highlevel.o \ + tuxonice_io.o tuxonice_pagedir.o tuxonice_prepare_image.o \ + tuxonice_extent.o tuxonice_pageflags.o tuxonice_ui.o \ + tuxonice_power_off.o tuxonice_atomic_copy.o + +obj-$(CONFIG_TOI) += tuxonice_builtin.o + +ifdef CONFIG_PM_DEBUG +tuxonice_core-objs += tuxonice_alloc.o +endif + +ifdef CONFIG_TOI_CHECKSUM +tuxonice_core-objs += tuxonice_checksum.o +endif + +ifdef CONFIG_NET +tuxonice_core-objs += tuxonice_storage.o tuxonice_netlink.o +endif + +obj-$(CONFIG_TOI_CORE) += tuxonice_core.o +obj-$(CONFIG_TOI_CRYPTO) += tuxonice_compress.o + +obj-$(CONFIG_TOI_SWAP) += tuxonice_block_io.o tuxonice_swap.o +obj-$(CONFIG_TOI_FILE) += tuxonice_block_io.o tuxonice_file.o +obj-$(CONFIG_TOI_CLUSTER) += tuxonice_cluster.o + +obj-$(CONFIG_TOI_USERUI) += tuxonice_userui.o + obj-$(CONFIG_PM) += main.o obj-$(CONFIG_PM_SLEEP) += console.o obj-$(CONFIG_FREEZER) += process.o diff --git a/kernel/power/disk.c b/kernel/power/disk.c index 4a4a206..4b8067e 100644 --- a/kernel/power/disk.c +++ b/kernel/power/disk.c @@ -24,10 +24,12 @@ #include #include "power.h" - +#include "tuxonice.h" static int noresume = 0; -static char resume_file[256] = CONFIG_PM_STD_PARTITION; +char resume_file[256] = CONFIG_PM_STD_PARTITION; +EXPORT_SYMBOL_GPL(resume_file); + dev_t swsusp_resume_device; sector_t swsusp_resume_block; @@ -113,55 +115,60 @@ static int hibernation_test(int level) { return 0; } * hibernation */ -static int platform_begin(int platform_mode) +int platform_begin(int platform_mode) { return (platform_mode && hibernation_ops) ? hibernation_ops->begin() : 0; } +EXPORT_SYMBOL_GPL(platform_begin); /** * platform_end - tell the platform driver that we've entered the * working state */ -static void platform_end(int platform_mode) +void platform_end(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->end(); } +EXPORT_SYMBOL_GPL(platform_end); /** * platform_pre_snapshot - prepare the machine for hibernation using the * platform driver if so configured and return an error code if it fails */ -static int platform_pre_snapshot(int platform_mode) +int platform_pre_snapshot(int platform_mode) { return (platform_mode && hibernation_ops) ? hibernation_ops->pre_snapshot() : 0; } +EXPORT_SYMBOL_GPL(platform_pre_snapshot); /** * platform_leave - prepare the machine for switching to the normal mode * of operation using the platform driver (called with interrupts disabled) */ -static void platform_leave(int platform_mode) +void platform_leave(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->leave(); } +EXPORT_SYMBOL_GPL(platform_leave); /** * platform_finish - switch the machine to the normal mode of operation * using the platform driver (must be called after platform_prepare()) */ -static void platform_finish(int platform_mode) +void platform_finish(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->finish(); } +EXPORT_SYMBOL_GPL(platform_finish); /** * platform_pre_restore - prepare the platform for the restoration from a @@ -169,11 +176,12 @@ static void platform_finish(int platform_mode) * called, platform_restore_cleanup() must be called. */ -static int platform_pre_restore(int platform_mode) +int platform_pre_restore(int platform_mode) { return (platform_mode && hibernation_ops) ? hibernation_ops->pre_restore() : 0; } +EXPORT_SYMBOL_GPL(platform_pre_restore); /** * platform_restore_cleanup - switch the platform to the normal mode of @@ -182,22 +190,24 @@ static int platform_pre_restore(int platform_mode) * regardless of the result of platform_pre_restore(). */ -static void platform_restore_cleanup(int platform_mode) +void platform_restore_cleanup(int platform_mode) { if (platform_mode && hibernation_ops) hibernation_ops->restore_cleanup(); } +EXPORT_SYMBOL_GPL(platform_restore_cleanup); /** * platform_recover - recover the platform from a failure to suspend * devices. */ -static void platform_recover(int platform_mode) +void platform_recover(int platform_mode) { if (platform_mode && hibernation_ops && hibernation_ops->recover) hibernation_ops->recover(); } +EXPORT_SYMBOL_GPL(platform_recover); /** * create_image - freeze devices that need to be frozen with interrupts @@ -407,6 +417,7 @@ int hibernation_restore(int platform_mode) pm_restore_console(); return error; } +EXPORT_SYMBOL_GPL(hibernation_platform_enter); /** * hibernation_platform_enter - enter the hibernation state using the @@ -523,6 +534,9 @@ int hibernate(void) { int error; + if (test_action_state(TOI_REPLACE_SWSUSP)) + return toi_try_hibernate(); + mutex_lock(&pm_mutex); /* The snapshot device should not be opened while we're running */ if (!atomic_add_unless(&snapshot_device_available, -1, 0)) { @@ -600,10 +614,19 @@ int hibernate(void) * */ -static int software_resume(void) +int software_resume(void) { int error; unsigned int flags; + resume_attempted = 1; + + /* + * We can't know (until an image header - if any - is loaded), whether + * we did override swsusp. We therefore ensure that both are tried. + */ + if (test_action_state(TOI_REPLACE_SWSUSP)) + printk(KERN_INFO "Replacing swsusp.\n"); + toi_try_resume(); /* * If the user said "noresume".. bail out early. @@ -918,6 +941,7 @@ static int __init resume_offset_setup(char *str) static int __init noresume_setup(char *str) { noresume = 1; + set_toi_state(TOI_NORESUME_SPECIFIED); return 1; } diff --git a/kernel/power/main.c b/kernel/power/main.c index c9632f8..91aa900 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -26,6 +26,7 @@ #include "power.h" DEFINE_MUTEX(pm_mutex); +EXPORT_SYMBOL_GPL(pm_mutex); unsigned int pm_flags; EXPORT_SYMBOL(pm_flags); @@ -34,7 +35,8 @@ EXPORT_SYMBOL(pm_flags); /* Routines for PM-transition notifications */ -static BLOCKING_NOTIFIER_HEAD(pm_chain_head); +BLOCKING_NOTIFIER_HEAD(pm_chain_head); +EXPORT_SYMBOL_GPL(pm_chain_head); int register_pm_notifier(struct notifier_block *nb) { @@ -204,6 +206,7 @@ void suspend_set_ops(struct platform_suspend_ops *ops) suspend_ops = ops; mutex_unlock(&pm_mutex); } +EXPORT_SYMBOL_GPL(pm_notifier_call_chain); /** * suspend_valid_only_mem - generic memory-only valid callback @@ -449,6 +452,7 @@ static int enter_state(suspend_state_t state) mutex_unlock(&pm_mutex); return error; } +EXPORT_SYMBOL_GPL(suspend_devices_and_enter); /** @@ -471,6 +475,7 @@ EXPORT_SYMBOL(pm_suspend); #endif /* CONFIG_SUSPEND */ struct kobject *power_kobj; +EXPORT_SYMBOL_GPL(power_kobj); /** * state - control system power state. diff --git a/kernel/power/power.h b/kernel/power/power.h index 46b5ec7..f946b3b 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -1,3 +1,10 @@ +/* + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + */ + +#ifndef KERNEL_POWER_POWER_H +#define KERNEL_POWER_POWER_H + #include #include #include @@ -31,8 +38,12 @@ static inline char *check_image_kernel(struct swsusp_info *info) return arch_hibernation_header_restore(info) ? "architecture specific data" : NULL; } +#else +extern char *check_image_kernel(struct swsusp_info *info); #endif /* CONFIG_ARCH_HIBERNATION_HEADER */ +extern int init_header(struct swsusp_info *info); +extern char resume_file[256]; /* * Keep some memory free so that I/O operations can succeed without paging * [Might this be more than 4 MB?] @@ -49,6 +60,7 @@ static inline char *check_image_kernel(struct swsusp_info *info) extern int hibernation_snapshot(int platform_mode); extern int hibernation_restore(int platform_mode); extern int hibernation_platform_enter(void); +extern void platform_recover(int platform_mode); #endif extern int pfn_is_nosave(unsigned long); @@ -63,6 +75,8 @@ static struct kobj_attribute _name##_attr = { \ .store = _name##_store, \ } +extern struct pbe *restore_pblist; + /* Preferred image size in bytes (default 500 MB) */ extern unsigned long image_size; extern int in_suspend; @@ -223,3 +237,89 @@ static inline void suspend_thaw_processes(void) { } #endif + +extern struct page *saveable_page(struct zone *z, unsigned long p); +#ifdef CONFIG_HIGHMEM +extern struct page *saveable_highmem_page(struct zone *z, unsigned long p); +#else +static +inline struct page *saveable_highmem_page(struct zone *z, unsigned long p) +{ + return NULL; +} +#endif + +#define PBES_PER_PAGE (PAGE_SIZE / sizeof(struct pbe)) +extern struct list_head nosave_regions; + +/** + * This structure represents a range of page frames the contents of which + * should not be saved during the suspend. + */ + +struct nosave_region { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; +}; + +#ifndef PHYS_PFN_OFFSET +#define PHYS_PFN_OFFSET 0 +#endif + +#define ZONE_START(thiszone) ((thiszone)->zone_start_pfn - PHYS_PFN_OFFSET) + +#define BM_END_OF_MAP (~0UL) + +#define BM_BITS_PER_BLOCK (PAGE_SIZE << 3) + +struct bm_block { + struct list_head hook; /* hook into a list of bitmap blocks */ + unsigned long start_pfn; /* pfn represented by the first bit */ + unsigned long end_pfn; /* pfn represented by the last bit plus 1 */ + unsigned long *data; /* bitmap representing pages */ +}; + +/* struct bm_position is used for browsing memory bitmaps */ + +struct bm_position { + struct bm_block *block; + int bit; +}; + +struct memory_bitmap { + struct list_head blocks; /* list of bitmap blocks */ + struct linked_page *p_list; /* list of pages used to store zone + * bitmap objects and bitmap block + * objects + */ + struct bm_position cur; /* most recently used bit position */ + struct bm_position iter; /* most recently used bit position + * when iterating over a bitmap. + */ +}; + + +extern int memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, + int safe_needed); +extern void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free); +extern void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn); +extern void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn); +extern int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn); +extern unsigned long memory_bm_next_pfn(struct memory_bitmap *bm); +extern void memory_bm_position_reset(struct memory_bitmap *bm); +extern void memory_bm_clear(struct memory_bitmap *bm); +extern void memory_bm_copy(struct memory_bitmap *source, + struct memory_bitmap *dest); +extern void memory_bm_dup(struct memory_bitmap *source, + struct memory_bitmap *dest); + +#ifdef CONFIG_TOI +struct toi_module_ops; +extern int memory_bm_read(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)); +extern int memory_bm_write(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)); +#endif + +#endif diff --git a/kernel/power/process.c b/kernel/power/process.c index ca63401..ac19ff7 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -13,6 +13,10 @@ #include #include #include +#include + +int freezer_state; +EXPORT_SYMBOL_GPL(freezer_state); /* * Timeout for stopping processes @@ -86,7 +90,8 @@ static int try_to_freeze_tasks(bool sig_only) do_each_thread(g, p) { task_lock(p); if (freezing(p) && !freezer_should_skip(p)) - printk(KERN_ERR " %s\n", p->comm); + printk(KERN_ERR " %s (%d) failed to freeze.\n", + p->comm, p->pid); cancel_freezing(p); task_unlock(p); } while_each_thread(g, p); @@ -106,22 +111,31 @@ int freeze_processes(void) { int error; - printk("Freezing user space processes ... "); + printk(KERN_INFO "Stopping fuse filesystems.\n"); + freeze_filesystems(FS_FREEZER_FUSE); + freezer_state = FREEZER_FILESYSTEMS_FROZEN; + printk(KERN_INFO "Freezing user space processes ... "); error = try_to_freeze_tasks(true); if (error) goto Exit; - printk("done.\n"); + printk(KERN_INFO "done.\n"); - printk("Freezing remaining freezable tasks ... "); + sys_sync(); + printk(KERN_INFO "Stopping normal filesystems.\n"); + freeze_filesystems(FS_FREEZER_NORMAL); + freezer_state = FREEZER_USERSPACE_FROZEN; + printk(KERN_INFO "Freezing remaining freezable tasks ... "); error = try_to_freeze_tasks(false); if (error) goto Exit; printk("done."); + freezer_state = FREEZER_FULLY_ON; Exit: BUG_ON(in_atomic()); printk("\n"); return error; } +EXPORT_SYMBOL_GPL(freeze_processes); static void thaw_tasks(bool nosig_only) { @@ -145,10 +159,40 @@ static void thaw_tasks(bool nosig_only) void thaw_processes(void) { - printk("Restarting tasks ... "); - thaw_tasks(true); + int old_state = freezer_state; + + if (old_state == FREEZER_OFF) + return; + + /* + * Change state beforehand because thawed tasks might submit I/O + * immediately. + */ + freezer_state = FREEZER_OFF; + + printk(KERN_INFO "Restarting all filesystems ...\n"); + thaw_filesystems(FS_FREEZER_ALL); + + printk(KERN_INFO "Restarting tasks ... "); + + if (old_state == FREEZER_FULLY_ON) + thaw_tasks(true); thaw_tasks(false); schedule(); printk("done.\n"); } +EXPORT_SYMBOL_GPL(thaw_processes); +void thaw_kernel_threads(void) +{ + freezer_state = FREEZER_USERSPACE_FROZEN; + printk(KERN_INFO "Restarting normal filesystems.\n"); + thaw_filesystems(FS_FREEZER_NORMAL); + thaw_tasks(true); +} + +/* + * It's ugly putting this EXPORT down here, but it's necessary so that it + * doesn't matter whether the fs-freezing patch is applied or not. + */ +EXPORT_SYMBOL_GPL(thaw_kernel_threads); diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index f5fc2d7..63959e7 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -34,6 +34,8 @@ #include #include "power.h" +#include "tuxonice_builtin.h" +#include "tuxonice_pagedir.h" static int swsusp_page_is_free(struct page *); static void swsusp_set_page_forbidden(struct page *); @@ -45,6 +47,10 @@ static void swsusp_unset_page_forbidden(struct page *); * directly to their "original" page frames. */ struct pbe *restore_pblist; +EXPORT_SYMBOL_GPL(restore_pblist); + +int resume_attempted; +EXPORT_SYMBOL_GPL(resume_attempted); /* Pointer to an auxiliary buffer (1 page) */ static void *buffer; @@ -87,6 +93,9 @@ static void *get_image_page(gfp_t gfp_mask, int safe_needed) unsigned long get_safe_page(gfp_t gfp_mask) { + if (toi_running) + return toi_get_nonconflicting_page(); + return (unsigned long)get_image_page(gfp_mask, PG_SAFE); } @@ -223,47 +232,24 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size) * the represented memory area. */ -#define BM_END_OF_MAP (~0UL) - -#define BM_BITS_PER_BLOCK (PAGE_SIZE << 3) - -struct bm_block { - struct list_head hook; /* hook into a list of bitmap blocks */ - unsigned long start_pfn; /* pfn represented by the first bit */ - unsigned long end_pfn; /* pfn represented by the last bit plus 1 */ - unsigned long *data; /* bitmap representing pages */ -}; - static inline unsigned long bm_block_bits(struct bm_block *bb) { return bb->end_pfn - bb->start_pfn; } -/* strcut bm_position is used for browsing memory bitmaps */ - -struct bm_position { - struct bm_block *block; - int bit; -}; - -struct memory_bitmap { - struct list_head blocks; /* list of bitmap blocks */ - struct linked_page *p_list; /* list of pages used to store zone - * bitmap objects and bitmap block - * objects - */ - struct bm_position cur; /* most recently used bit position */ -}; - /* Functions that operate on memory bitmaps */ -static void memory_bm_position_reset(struct memory_bitmap *bm) +void memory_bm_position_reset(struct memory_bitmap *bm) { bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook); bm->cur.bit = 0; + + bm->iter.block = list_entry(bm->blocks.next, struct bm_block, hook); + bm->iter.bit = 0; } +EXPORT_SYMBOL_GPL(memory_bm_position_reset); -static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free); +void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free); /** * create_bm_block_list - create a list of block bitmap objects @@ -374,7 +360,7 @@ static int create_mem_extents(struct list_head *list, gfp_t gfp_mask) /** * memory_bm_create - allocate memory for a memory bitmap */ -static int +int memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed) { struct chain_allocator ca; @@ -430,11 +416,12 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed) memory_bm_free(bm, PG_UNSAFE_CLEAR); goto Exit; } +EXPORT_SYMBOL_GPL(memory_bm_create); /** * memory_bm_free - free memory occupied by the memory bitmap @bm */ -static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free) +void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free) { struct bm_block *bb; @@ -446,6 +433,7 @@ static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free) INIT_LIST_HEAD(&bm->blocks); } +EXPORT_SYMBOL_GPL(memory_bm_free); /** * memory_bm_find_bit - find the bit in the bitmap @bm that corresponds @@ -484,7 +472,7 @@ static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn, return 0; } -static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn) +void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn) { void *addr; unsigned int bit; @@ -494,8 +482,9 @@ static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn) BUG_ON(error); set_bit(bit, addr); } +EXPORT_SYMBOL_GPL(memory_bm_set_bit); -static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn) +int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn) { void *addr; unsigned int bit; @@ -507,7 +496,7 @@ static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn) return error; } -static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn) +void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn) { void *addr; unsigned int bit; @@ -517,8 +506,9 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn) BUG_ON(error); clear_bit(bit, addr); } +EXPORT_SYMBOL_GPL(memory_bm_clear_bit); -static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn) +int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn) { void *addr; unsigned int bit; @@ -528,6 +518,7 @@ static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn) BUG_ON(error); return test_bit(bit, addr); } +EXPORT_SYMBOL_GPL(memory_bm_test_bit); static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn) { @@ -546,43 +537,178 @@ static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn) * this function. */ -static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm) +unsigned long memory_bm_next_pfn(struct memory_bitmap *bm) { struct bm_block *bb; int bit; - bb = bm->cur.block; + bb = bm->iter.block; do { - bit = bm->cur.bit; + bit = bm->iter.bit; bit = find_next_bit(bb->data, bm_block_bits(bb), bit); if (bit < bm_block_bits(bb)) goto Return_pfn; bb = list_entry(bb->hook.next, struct bm_block, hook); - bm->cur.block = bb; - bm->cur.bit = 0; + bm->iter.block = bb; + bm->iter.bit = 0; } while (&bb->hook != &bm->blocks); memory_bm_position_reset(bm); return BM_END_OF_MAP; Return_pfn: - bm->cur.bit = bit + 1; + bm->iter.bit = bit + 1; return bb->start_pfn + bit; } +EXPORT_SYMBOL_GPL(memory_bm_next_pfn); -/** - * This structure represents a range of page frames the contents of which - * should not be saved during the suspend. - */ +void memory_bm_clear(struct memory_bitmap *bm) +{ + unsigned long pfn; -struct nosave_region { - struct list_head list; - unsigned long start_pfn; - unsigned long end_pfn; -}; + memory_bm_position_reset(bm); + pfn = memory_bm_next_pfn(bm); + while (pfn != BM_END_OF_MAP) { + memory_bm_clear_bit(bm, pfn); + pfn = memory_bm_next_pfn(bm); + } +} +EXPORT_SYMBOL_GPL(memory_bm_clear); + +void memory_bm_copy(struct memory_bitmap *source, struct memory_bitmap *dest) +{ + unsigned long pfn; + + memory_bm_position_reset(source); + pfn = memory_bm_next_pfn(source); + while (pfn != BM_END_OF_MAP) { + memory_bm_set_bit(dest, pfn); + pfn = memory_bm_next_pfn(source); + } +} +EXPORT_SYMBOL_GPL(memory_bm_copy); + +void memory_bm_dup(struct memory_bitmap *source, struct memory_bitmap *dest) +{ + memory_bm_clear(dest); + memory_bm_copy(source, dest); +} +EXPORT_SYMBOL_GPL(memory_bm_dup); + +#ifdef CONFIG_TOI +#define DEFINE_MEMORY_BITMAP(name) \ +struct memory_bitmap *name; \ +EXPORT_SYMBOL_GPL(name) + +DEFINE_MEMORY_BITMAP(pageset1_map); +DEFINE_MEMORY_BITMAP(pageset1_copy_map); +DEFINE_MEMORY_BITMAP(pageset2_map); +DEFINE_MEMORY_BITMAP(page_resave_map); +DEFINE_MEMORY_BITMAP(io_map); +DEFINE_MEMORY_BITMAP(nosave_map); +DEFINE_MEMORY_BITMAP(free_map); -static LIST_HEAD(nosave_regions); +int memory_bm_write(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)) +{ + int result = 0; + unsigned int nr = 0; + struct bm_block *bb; + + if (!bm) + return result; + + list_for_each_entry(bb, &bm->blocks, hook) + nr++; + + result = (*rw_chunk)(WRITE, NULL, (char *) &nr, sizeof(unsigned int)); + if (result) + return result; + + list_for_each_entry(bb, &bm->blocks, hook) { + result = (*rw_chunk)(WRITE, NULL, (char *) &bb->start_pfn, + 2 * sizeof(unsigned long)); + if (result) + return result; + + result = (*rw_chunk)(WRITE, NULL, (char *) bb->data, PAGE_SIZE); + if (result) + return result; + } + + return 0; +} +EXPORT_SYMBOL_GPL(memory_bm_write); + +int memory_bm_read(struct memory_bitmap *bm, int (*rw_chunk) + (int rw, struct toi_module_ops *owner, char *buffer, int buffer_size)) +{ + int result = 0; + unsigned int nr, i; + struct bm_block *bb; + + if (!bm) + return result; + + result = memory_bm_create(bm, GFP_KERNEL, 0); + + if (result) + return result; + + result = (*rw_chunk)(READ, NULL, (char *) &nr, sizeof(unsigned int)); + if (result) + goto Free; + + for (i = 0; i < nr; i++) { + unsigned long pfn; + + result = (*rw_chunk)(READ, NULL, (char *) &pfn, + sizeof(unsigned long)); + if (result) + goto Free; + + list_for_each_entry(bb, &bm->blocks, hook) + if (bb->start_pfn == pfn) + break; + + if (&bb->hook == &bm->blocks) { + printk(KERN_ERR + "TuxOnIce: Failed to load memory bitmap.\n"); + result = -EINVAL; + goto Free; + } + + result = (*rw_chunk)(READ, NULL, (char *) &pfn, + sizeof(unsigned long)); + if (result) + goto Free; + + if (pfn != bb->end_pfn) { + printk(KERN_ERR + "TuxOnIce: Failed to load memory bitmap. " + "End PFN doesn't match what was saved.\n"); + result = -EINVAL; + goto Free; + } + + result = (*rw_chunk)(READ, NULL, (char *) bb->data, PAGE_SIZE); + + if (result) + goto Free; + } + + return 0; + +Free: + memory_bm_free(bm, PG_ANY); + return result; +} +EXPORT_SYMBOL_GPL(memory_bm_read); +#endif + +LIST_HEAD(nosave_regions); +EXPORT_SYMBOL_GPL(nosave_regions); /** * register_nosave_region - register a range of page frames the contents @@ -818,7 +944,7 @@ static unsigned int count_free_highmem_pages(void) * We should save the page if it isn't Nosave or NosaveFree, or Reserved, * and it isn't a part of a free chunk of pages. */ -static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) +struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) { struct page *page; @@ -837,6 +963,7 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn) return page; } +EXPORT_SYMBOL_GPL(saveable_highmem_page); /** * count_highmem_pages - compute the total number of saveable highmem @@ -862,11 +989,6 @@ unsigned int count_highmem_pages(void) } return n; } -#else -static inline void *saveable_highmem_page(struct zone *z, unsigned long p) -{ - return NULL; -} #endif /* CONFIG_HIGHMEM */ /** @@ -877,7 +999,7 @@ static inline void *saveable_highmem_page(struct zone *z, unsigned long p) * of pages statically defined as 'unsaveable', and it isn't a part of * a free chunk of pages. */ -static struct page *saveable_page(struct zone *zone, unsigned long pfn) +struct page *saveable_page(struct zone *zone, unsigned long pfn) { struct page *page; @@ -899,6 +1021,7 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn) return page; } +EXPORT_SYMBOL_GPL(saveable_page); /** * count_data_pages - compute the total number of saveable non-highmem @@ -1213,6 +1336,9 @@ asmlinkage int swsusp_save(void) { unsigned int nr_pages, nr_highmem; + if (toi_running) + return toi_post_context_save(); + printk(KERN_INFO "PM: Creating hibernation image: \n"); drain_local_pages(NULL); @@ -1253,14 +1379,14 @@ asmlinkage int swsusp_save(void) } #ifndef CONFIG_ARCH_HIBERNATION_HEADER -static int init_header_complete(struct swsusp_info *info) +int init_header_complete(struct swsusp_info *info) { memcpy(&info->uts, init_utsname(), sizeof(struct new_utsname)); info->version_code = LINUX_VERSION_CODE; return 0; } -static char *check_image_kernel(struct swsusp_info *info) +char *check_image_kernel(struct swsusp_info *info) { if (info->version_code != LINUX_VERSION_CODE) return "kernel version"; @@ -1274,6 +1400,7 @@ static char *check_image_kernel(struct swsusp_info *info) return "machine"; return NULL; } +EXPORT_SYMBOL_GPL(check_image_kernel); #endif /* CONFIG_ARCH_HIBERNATION_HEADER */ unsigned long snapshot_get_image_size(void) @@ -1281,7 +1408,7 @@ unsigned long snapshot_get_image_size(void) return nr_copy_pages + nr_meta_pages + 1; } -static int init_header(struct swsusp_info *info) +int init_header(struct swsusp_info *info) { memset(info, 0, sizeof(struct swsusp_info)); info->num_physpages = num_physpages; @@ -1291,6 +1418,7 @@ static int init_header(struct swsusp_info *info) info->size <<= PAGE_SHIFT; return init_header_complete(info); } +EXPORT_SYMBOL_GPL(init_header); /** * pack_pfns - pfns corresponding to the set bits found in the bitmap @bm diff --git a/kernel/power/tuxonice.h b/kernel/power/tuxonice.h new file mode 100644 index 0000000..267368d --- /dev/null +++ b/kernel/power/tuxonice.h @@ -0,0 +1,212 @@ +/* + * kernel/power/tuxonice.h + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * It contains declarations used throughout swsusp. + * + */ + +#ifndef KERNEL_POWER_TOI_H +#define KERNEL_POWER_TOI_H + +#include +#include +#include +#include +#include +#include +#include "tuxonice_pageflags.h" + +#define TOI_CORE_VERSION "3.0.1" + +#define MY_BOOT_KERNEL_DATA_VERSION 1 + +struct toi_boot_kernel_data { + int version; + int size; + unsigned long toi_action; + unsigned long toi_debug_state; + u32 toi_default_console_level; + int toi_io_time[2][2]; + char toi_nosave_commandline[COMMAND_LINE_SIZE]; +}; + +extern struct toi_boot_kernel_data toi_bkd; + +/* Location of book kernel data struct in kernel being resumed */ +extern unsigned long boot_kernel_data_buffer; + +/* == Action states == */ + +enum { + TOI_REBOOT, + TOI_PAUSE, + TOI_LOGALL, + TOI_CAN_CANCEL, + TOI_KEEP_IMAGE, + TOI_FREEZER_TEST, + TOI_SINGLESTEP, + TOI_PAUSE_NEAR_PAGESET_END, + TOI_TEST_FILTER_SPEED, + TOI_TEST_BIO, + TOI_NO_PAGESET2, + TOI_IGNORE_ROOTFS, + TOI_REPLACE_SWSUSP, + TOI_PAGESET2_FULL, + TOI_ABORT_ON_RESAVE_NEEDED, + TOI_NO_MULTITHREADED_IO, + TOI_NO_DIRECT_LOAD, + TOI_LATE_CPU_HOTPLUG, + TOI_GET_MAX_MEM_ALLOCD, + TOI_NO_FLUSHER_THREAD, + TOI_NO_PS2_IF_UNNEEDED +}; + +#define clear_action_state(bit) (test_and_clear_bit(bit, &toi_bkd.toi_action)) + +/* == Result states == */ + +enum { + TOI_ABORTED, + TOI_ABORT_REQUESTED, + TOI_NOSTORAGE_AVAILABLE, + TOI_INSUFFICIENT_STORAGE, + TOI_FREEZING_FAILED, + TOI_KEPT_IMAGE, + TOI_WOULD_EAT_MEMORY, + TOI_UNABLE_TO_FREE_ENOUGH_MEMORY, + TOI_PM_SEM, + TOI_DEVICE_REFUSED, + TOI_SYSDEV_REFUSED, + TOI_EXTRA_PAGES_ALLOW_TOO_SMALL, + TOI_UNABLE_TO_PREPARE_IMAGE, + TOI_FAILED_MODULE_INIT, + TOI_FAILED_MODULE_CLEANUP, + TOI_FAILED_IO, + TOI_OUT_OF_MEMORY, + TOI_IMAGE_ERROR, + TOI_PLATFORM_PREP_FAILED, + TOI_CPU_HOTPLUG_FAILED, + TOI_ARCH_PREPARE_FAILED, + TOI_RESAVE_NEEDED, + TOI_CANT_SUSPEND, + TOI_NOTIFIERS_PREPARE_FAILED, + TOI_PRE_SNAPSHOT_FAILED, + TOI_PRE_RESTORE_FAILED, + TOI_USERMODE_HELPERS_ERR, + TOI_CANT_USE_ALT_RESUME, + TOI_HEADER_TOO_BIG, + TOI_NUM_RESULT_STATES /* Used in printing debug info only */ +}; + +extern unsigned long toi_result; + +#define set_result_state(bit) (test_and_set_bit(bit, &toi_result)) +#define set_abort_result(bit) (test_and_set_bit(TOI_ABORTED, &toi_result), \ + test_and_set_bit(bit, &toi_result)) +#define clear_result_state(bit) (test_and_clear_bit(bit, &toi_result)) +#define test_result_state(bit) (test_bit(bit, &toi_result)) + +/* == Debug sections and levels == */ + +/* debugging levels. */ +enum { + TOI_STATUS = 0, + TOI_ERROR = 2, + TOI_LOW, + TOI_MEDIUM, + TOI_HIGH, + TOI_VERBOSE, +}; + +enum { + TOI_ANY_SECTION, + TOI_EAT_MEMORY, + TOI_IO, + TOI_HEADER, + TOI_WRITER, + TOI_MEMORY, +}; + +#define set_debug_state(bit) (test_and_set_bit(bit, &toi_bkd.toi_debug_state)) +#define clear_debug_state(bit) \ + (test_and_clear_bit(bit, &toi_bkd.toi_debug_state)) +#define test_debug_state(bit) (test_bit(bit, &toi_bkd.toi_debug_state)) + +/* == Steps in hibernating == */ + +enum { + STEP_HIBERNATE_PREPARE_IMAGE, + STEP_HIBERNATE_SAVE_IMAGE, + STEP_HIBERNATE_POWERDOWN, + STEP_RESUME_CAN_RESUME, + STEP_RESUME_LOAD_PS1, + STEP_RESUME_DO_RESTORE, + STEP_RESUME_READ_PS2, + STEP_RESUME_GO, + STEP_RESUME_ALT_IMAGE, + STEP_CLEANUP, + STEP_QUIET_CLEANUP +}; + +/* == TuxOnIce states == + (see also include/linux/suspend.h) */ + +#define get_toi_state() (toi_state) +#define restore_toi_state(saved_state) \ + do { toi_state = saved_state; } while (0) + +/* == Module support == */ + +struct toi_core_fns { + int (*post_context_save)(void); + unsigned long (*get_nonconflicting_page)(void); + int (*try_hibernate)(void); + void (*try_resume)(void); +}; + +extern struct toi_core_fns *toi_core_fns; + +/* == All else == */ +#define KB(x) ((x) << (PAGE_SHIFT - 10)) +#define MB(x) ((x) >> (20 - PAGE_SHIFT)) + +extern int toi_start_anything(int toi_or_resume); +extern void toi_finish_anything(int toi_or_resume); + +extern int save_image_part1(void); +extern int toi_atomic_restore(void); + +extern int _toi_try_hibernate(void); +extern void __toi_try_resume(void); + +extern int __toi_post_context_save(void); + +extern unsigned int nr_hibernates; +extern char alt_resume_param[256]; + +extern void copyback_post(void); +extern int toi_hibernate(void); +extern long extra_pd1_pages_used; + +#define SECTOR_SIZE 512 + +extern void toi_early_boot_message(int can_erase_image, int default_answer, + char *warning_reason, ...); + +static inline int load_direct(struct page *page) +{ + return test_action_state(TOI_NO_DIRECT_LOAD) ? 0 : + PagePageset1Copy(page); +} + +extern int do_check_can_resume(void); +extern int do_toi_step(int step); +extern int toi_launch_userspace_program(char *command, int channel_no, + enum umh_wait wait, int debug); + +extern char *tuxonice_signature; +#endif diff --git a/kernel/power/tuxonice_alloc.c b/kernel/power/tuxonice_alloc.c new file mode 100644 index 0000000..5e3aae8 --- /dev/null +++ b/kernel/power/tuxonice_alloc.c @@ -0,0 +1,287 @@ +/* + * kernel/power/tuxonice_alloc.c + * + * Copyright (C) 2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + */ + +#ifdef CONFIG_PM_DEBUG +#include +#include +#include "tuxonice_modules.h" +#include "tuxonice_alloc.h" +#include "tuxonice_sysfs.h" +#include "tuxonice.h" + +#define TOI_ALLOC_PATHS 39 + +static DEFINE_MUTEX(toi_alloc_mutex); + +static struct toi_module_ops toi_alloc_ops; + +static int toi_fail_num; +static atomic_t toi_alloc_count[TOI_ALLOC_PATHS], + toi_free_count[TOI_ALLOC_PATHS], + toi_test_count[TOI_ALLOC_PATHS], + toi_fail_count[TOI_ALLOC_PATHS]; +static int toi_cur_allocd[TOI_ALLOC_PATHS], toi_max_allocd[TOI_ALLOC_PATHS]; +static int cur_allocd, max_allocd; + +static char *toi_alloc_desc[TOI_ALLOC_PATHS] = { + "", /* 0 */ + "get_io_info_struct", + "extent", + "extent (loading chain)", + "userui channel", + "userui arg", /* 5 */ + "attention list metadata", + "extra pagedir memory metadata", + "bdev metadata", + "extra pagedir memory", + "header_locations_read", /* 10 */ + "bio queue", + "prepare_readahead", + "i/o buffer", + "writer buffer in bio_init", + "checksum buffer", /* 15 */ + "compression buffer", + "filewriter signature op", + "set resume param alloc1", + "set resume param alloc2", + "debugging info buffer", /* 20 */ + "check can resume buffer", + "write module config buffer", + "read module config buffer", + "write image header buffer", + "read pageset1 buffer", /* 25 */ + "get_have_image_data buffer", + "checksum page", + "worker rw loop", + "get nonconflicting page", + "ps1 load addresses", /* 30 */ + "remove swap image", + "swap image exists", + "swap parse sig location", + "sysfs kobj", + "swap mark resume attempted buffer", /* 35 */ + "cluster member", + "boot kernel data buffer", + "setting swap signature" +}; + +#define MIGHT_FAIL(FAIL_NUM, FAIL_VAL) \ + do { \ + BUG_ON(FAIL_NUM >= TOI_ALLOC_PATHS); \ + \ + if (FAIL_NUM == toi_fail_num) { \ + atomic_inc(&toi_test_count[FAIL_NUM]); \ + toi_fail_num = 0; \ + return FAIL_VAL; \ + } \ + } while (0) + +static void alloc_update_stats(int fail_num, void *result) +{ + if (!result) { + atomic_inc(&toi_fail_count[fail_num]); + return; + } + + atomic_inc(&toi_alloc_count[fail_num]); + if (unlikely(test_action_state(TOI_GET_MAX_MEM_ALLOCD))) { + mutex_lock(&toi_alloc_mutex); + toi_cur_allocd[fail_num]++; + cur_allocd++; + if (unlikely(cur_allocd > max_allocd)) { + int i; + + for (i = 0; i < TOI_ALLOC_PATHS; i++) + toi_max_allocd[i] = toi_cur_allocd[i]; + max_allocd = cur_allocd; + } + mutex_unlock(&toi_alloc_mutex); + } +} + +static void free_update_stats(int fail_num) +{ + BUG_ON(fail_num >= TOI_ALLOC_PATHS); + atomic_inc(&toi_free_count[fail_num]); + if (unlikely(test_action_state(TOI_GET_MAX_MEM_ALLOCD))) { + mutex_lock(&toi_alloc_mutex); + cur_allocd--; + toi_cur_allocd[fail_num]--; + mutex_unlock(&toi_alloc_mutex); + } +} + +void *toi_kzalloc(int fail_num, size_t size, gfp_t flags) +{ + void *result; + + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, NULL); + result = kzalloc(size, flags); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, result); + return result; +} +EXPORT_SYMBOL_GPL(toi_kzalloc); + +unsigned long toi_get_free_pages(int fail_num, gfp_t mask, + unsigned int order) +{ + unsigned long result; + + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, 0); + result = __get_free_pages(mask, order); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, (void *) result); + return result; +} +EXPORT_SYMBOL_GPL(toi_get_free_pages); + +struct page *toi_alloc_page(int fail_num, gfp_t mask) +{ + struct page *result; + + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, NULL); + result = alloc_page(mask); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, (void *) result); + return result; +} +EXPORT_SYMBOL_GPL(toi_alloc_page); + +unsigned long toi_get_zeroed_page(int fail_num, gfp_t mask) +{ + unsigned long result; + + if (toi_alloc_ops.enabled) + MIGHT_FAIL(fail_num, 0); + result = get_zeroed_page(mask); + if (toi_alloc_ops.enabled) + alloc_update_stats(fail_num, (void *) result); + return result; +} +EXPORT_SYMBOL_GPL(toi_get_zeroed_page); + +void toi_kfree(int fail_num, const void *arg) +{ + if (arg && toi_alloc_ops.enabled) + free_update_stats(fail_num); + + kfree(arg); +} +EXPORT_SYMBOL_GPL(toi_kfree); + +void toi_free_page(int fail_num, unsigned long virt) +{ + if (virt && toi_alloc_ops.enabled) + free_update_stats(fail_num); + + free_page(virt); +} +EXPORT_SYMBOL_GPL(toi_free_page); + +void toi__free_page(int fail_num, struct page *page) +{ + if (page && toi_alloc_ops.enabled) + free_update_stats(fail_num); + + __free_page(page); +} +EXPORT_SYMBOL_GPL(toi__free_page); + +void toi_free_pages(int fail_num, struct page *page, int order) +{ + if (page && toi_alloc_ops.enabled) + free_update_stats(fail_num); + + __free_pages(page, order); +} + +void toi_alloc_print_debug_stats(void) +{ + int i, header_done = 0; + + if (!toi_alloc_ops.enabled) + return; + + for (i = 0; i < TOI_ALLOC_PATHS; i++) + if (atomic_read(&toi_alloc_count[i]) != + atomic_read(&toi_free_count[i])) { + if (!header_done) { + printk(KERN_INFO "Idx Allocs Frees Tests " + " Fails Max Description\n"); + header_done = 1; + } + + printk(KERN_INFO "%3d %7d %7d %7d %7d %7d %s\n", i, + atomic_read(&toi_alloc_count[i]), + atomic_read(&toi_free_count[i]), + atomic_read(&toi_test_count[i]), + atomic_read(&toi_fail_count[i]), + toi_max_allocd[i], + toi_alloc_desc[i]); + } +} +EXPORT_SYMBOL_GPL(toi_alloc_print_debug_stats); + +static int toi_alloc_initialise(int starting_cycle) +{ + int i; + + if (starting_cycle && toi_alloc_ops.enabled) { + for (i = 0; i < TOI_ALLOC_PATHS; i++) { + atomic_set(&toi_alloc_count[i], 0); + atomic_set(&toi_free_count[i], 0); + atomic_set(&toi_test_count[i], 0); + atomic_set(&toi_fail_count[i], 0); + toi_cur_allocd[i] = 0; + toi_max_allocd[i] = 0; + }; + max_allocd = 0; + cur_allocd = 0; + } + + return 0; +} + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("failure_test", SYSFS_RW, &toi_fail_num, 0, 99, 0, NULL), + SYSFS_BIT("find_max_mem_allocated", SYSFS_RW, &toi_bkd.toi_action, + TOI_GET_MAX_MEM_ALLOCD, 0), + SYSFS_INT("enabled", SYSFS_RW, &toi_alloc_ops.enabled, 0, 1, 0, + NULL) +}; + +static struct toi_module_ops toi_alloc_ops = { + .type = MISC_HIDDEN_MODULE, + .name = "allocation debugging", + .directory = "alloc", + .module = THIS_MODULE, + .early = 1, + .initialise = toi_alloc_initialise, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +int toi_alloc_init(void) +{ + int result = toi_register_module(&toi_alloc_ops); + toi_alloc_ops.enabled = 0; + return result; +} + +void toi_alloc_exit(void) +{ + toi_unregister_module(&toi_alloc_ops); +} +#endif diff --git a/kernel/power/tuxonice_alloc.h b/kernel/power/tuxonice_alloc.h new file mode 100644 index 0000000..a1dd8ff --- /dev/null +++ b/kernel/power/tuxonice_alloc.h @@ -0,0 +1,51 @@ +/* + * kernel/power/tuxonice_alloc.h + * + * Copyright (C) 2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + */ + +#define TOI_WAIT_GFP (GFP_KERNEL | __GFP_NOWARN) +#define TOI_ATOMIC_GFP (GFP_ATOMIC | __GFP_NOWARN) + +#ifdef CONFIG_PM_DEBUG +extern void *toi_kzalloc(int fail_num, size_t size, gfp_t flags); +extern void toi_kfree(int fail_num, const void *arg); + +extern unsigned long toi_get_free_pages(int fail_num, gfp_t mask, + unsigned int order); +#define toi_get_free_page(FAIL_NUM, MASK) toi_get_free_pages(FAIL_NUM, MASK, 0) +extern unsigned long toi_get_zeroed_page(int fail_num, gfp_t mask); +extern void toi_free_page(int fail_num, unsigned long buf); +extern void toi__free_page(int fail_num, struct page *page); +extern void toi_free_pages(int fail_num, struct page *page, int order); +extern struct page *toi_alloc_page(int fail_num, gfp_t mask); +extern int toi_alloc_init(void); +extern void toi_alloc_exit(void); + +extern void toi_alloc_print_debug_stats(void); + +#else /* CONFIG_PM_DEBUG */ + +#define toi_kzalloc(FAIL, SIZE, FLAGS) (kzalloc(SIZE, FLAGS)) +#define toi_kfree(FAIL, ALLOCN) (kfree(ALLOCN)) + +#define toi_get_free_pages(FAIL, FLAGS, ORDER) __get_free_pages(FLAGS, ORDER) +#define toi_get_free_page(FAIL, FLAGS) __get_free_page(FLAGS) +#define toi_get_zeroed_page(FAIL, FLAGS) get_zeroed_page(FLAGS) +#define toi_free_page(FAIL, ALLOCN) do { free_page(ALLOCN); } while (0) +#define toi__free_page(FAIL, PAGE) __free_page(PAGE) +#define toi_free_pages(FAIL, PAGE, ORDER) __free_pages(PAGE, ORDER) +#define toi_alloc_page(FAIL, MASK) alloc_page(MASK) +static inline int toi_alloc_init(void) +{ + return 0; +} + +static inline void toi_alloc_exit(void) { } + +static inline void toi_alloc_print_debug_stats(void) { } + +#endif diff --git a/kernel/power/tuxonice_atomic_copy.c b/kernel/power/tuxonice_atomic_copy.c new file mode 100644 index 0000000..aff8177 --- /dev/null +++ b/kernel/power/tuxonice_atomic_copy.c @@ -0,0 +1,420 @@ +/* + * kernel/power/tuxonice_atomic_copy.c + * + * Copyright 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * Copyright (C) 2006 Red Hat, inc. + * + * Distributed under GPLv2. + * + * Routines for doing the atomic save/restore. + */ + +#include +#include +#include +#include +#include +#include "tuxonice.h" +#include "tuxonice_storage.h" +#include "tuxonice_power_off.h" +#include "tuxonice_ui.h" +#include "power.h" +#include "tuxonice_io.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_pageflags.h" +#include "tuxonice_checksum.h" +#include "tuxonice_builtin.h" +#include "tuxonice_atomic_copy.h" +#include "tuxonice_alloc.h" + +long extra_pd1_pages_used; + +/** + * free_pbe_list - free page backup entries used by the atomic copy code. + * @list: List to free. + * @highmem: Whether the list is in highmem. + * + * Normally, this function isn't used. If, however, we need to abort before + * doing the atomic copy, we use this to free the pbes previously allocated. + **/ +static void free_pbe_list(struct pbe **list, int highmem) +{ + while (*list) { + int i; + struct pbe *free_pbe, *next_page = NULL; + struct page *page; + + if (highmem) { + page = (struct page *) *list; + free_pbe = (struct pbe *) kmap(page); + } else { + page = virt_to_page(*list); + free_pbe = *list; + } + + for (i = 0; i < PBES_PER_PAGE; i++) { + if (!free_pbe) + break; + if (highmem) + toi__free_page(29, free_pbe->address); + else + toi_free_page(29, + (unsigned long) free_pbe->address); + free_pbe = free_pbe->next; + } + + if (highmem) { + if (free_pbe) + next_page = free_pbe; + kunmap(page); + } else { + if (free_pbe) + next_page = free_pbe; + } + + toi__free_page(29, page); + *list = (struct pbe *) next_page; + }; +} + +/** + * copyback_post - post atomic-restore actions + * + * After doing the atomic restore, we have a few more things to do: + * 1) We want to retain some values across the restore, so we now copy + * these from the nosave variables to the normal ones. + * 2) Set the status flags. + * 3) Resume devices. + * 4) Tell userui so it can redraw & restore settings. + * 5) Reread the page cache. + **/ +void copyback_post(void) +{ + struct toi_boot_kernel_data *bkd = + (struct toi_boot_kernel_data *) boot_kernel_data_buffer; + + /* + * The boot kernel's data may be larger (newer version) or + * smaller (older version) than ours. Copy the minimum + * of the two sizes, so that we don't overwrite valid values + * from pre-atomic copy. + */ + + memcpy(&toi_bkd, (char *) boot_kernel_data_buffer, + min_t(int, sizeof(struct toi_boot_kernel_data), + bkd->size)); + + if (toi_activate_storage(1)) + panic("Failed to reactivate our storage."); + + toi_ui_post_atomic_restore(); + + toi_cond_pause(1, "About to reload secondary pagedir."); + + if (read_pageset2(0)) + panic("Unable to successfully reread the page cache."); + + /* + * If the user wants to sleep again after resuming from full-off, + * it's most likely to be in order to suspend to ram, so we'll + * do this check after loading pageset2, to give them the fastest + * wakeup when they are ready to use the computer again. + */ + toi_check_resleep(); +} + +/** + * toi_copy_pageset1 - do the atomic copy of pageset1 + * + * Make the atomic copy of pageset1. We can't use copy_page (as we once did) + * because we can't be sure what side effects it has. On my old Duron, with + * 3DNOW, kernel_fpu_begin increments preempt count, making our preempt + * count at resume time 4 instead of 3. + * + * We don't want to call kmap_atomic unconditionally because it has the side + * effect of incrementing the preempt count, which will leave it one too high + * post resume (the page containing the preempt count will be copied after + * its incremented. This is essentially the same problem. + **/ +void toi_copy_pageset1(void) +{ + int i; + unsigned long source_index, dest_index; + + memory_bm_position_reset(pageset1_map); + memory_bm_position_reset(pageset1_copy_map); + + source_index = memory_bm_next_pfn(pageset1_map); + dest_index = memory_bm_next_pfn(pageset1_copy_map); + + for (i = 0; i < pagedir1.size; i++) { + unsigned long *origvirt, *copyvirt; + struct page *origpage, *copypage; + int loop = (PAGE_SIZE / sizeof(unsigned long)) - 1, + was_present1, was_present2; + + origpage = pfn_to_page(source_index); + copypage = pfn_to_page(dest_index); + + origvirt = PageHighMem(origpage) ? + kmap_atomic(origpage, KM_USER0) : + page_address(origpage); + + copyvirt = PageHighMem(copypage) ? + kmap_atomic(copypage, KM_USER1) : + page_address(copypage); + + was_present1 = kernel_page_present(origpage); + if (!was_present1) + kernel_map_pages(origpage, 1, 1); + + was_present2 = kernel_page_present(copypage); + if (!was_present2) + kernel_map_pages(copypage, 1, 1); + + while (loop >= 0) { + *(copyvirt + loop) = *(origvirt + loop); + loop--; + } + + if (!was_present1) + kernel_map_pages(origpage, 1, 0); + + if (!was_present2) + kernel_map_pages(copypage, 1, 0); + + if (PageHighMem(origpage)) + kunmap_atomic(origvirt, KM_USER0); + + if (PageHighMem(copypage)) + kunmap_atomic(copyvirt, KM_USER1); + + source_index = memory_bm_next_pfn(pageset1_map); + dest_index = memory_bm_next_pfn(pageset1_copy_map); + } +} + +/** + * __toi_post_context_save - steps after saving the cpu context + * + * Steps taken after saving the CPU state to make the actual + * atomic copy. + * + * Called from swsusp_save in snapshot.c via toi_post_context_save. + **/ +int __toi_post_context_save(void) +{ + long old_ps1_size = pagedir1.size; + + check_checksums(); + + free_checksum_pages(); + + toi_recalculate_image_contents(1); + + extra_pd1_pages_used = pagedir1.size - old_ps1_size; + + if (extra_pd1_pages_used > extra_pd1_pages_allowance) { + printk(KERN_INFO "Pageset1 has grown by %ld pages. " + "extra_pages_allowance is currently only %lu.\n", + pagedir1.size - old_ps1_size, + extra_pd1_pages_allowance); + + /* + * Highlevel code will see this, clear the state and + * retry if we haven't already done so twice. + */ + set_abort_result(TOI_EXTRA_PAGES_ALLOW_TOO_SMALL); + return 1; + } + + if (!test_action_state(TOI_TEST_FILTER_SPEED) && + !test_action_state(TOI_TEST_BIO)) + toi_copy_pageset1(); + + return 0; +} + +/** + * toi_hibernate - high level code for doing the atomic copy + * + * High-level code which prepares to do the atomic copy. Loosely based + * on the swsusp version, but with the following twists: + * - We set toi_running so the swsusp code uses our code paths. + * - We give better feedback regarding what goes wrong if there is a + * problem. + * - We use an extra function to call the assembly, just in case this code + * is in a module (return address). + **/ +int toi_hibernate(void) +{ + int error; + + toi_running = 1; /* For the swsusp code we use :< */ + + error = toi_lowlevel_builtin(); + + toi_running = 0; + return error; +} + +/** + * toi_atomic_restore - prepare to do the atomic restore + * + * Get ready to do the atomic restore. This part gets us into the same + * state we are in prior to do calling do_toi_lowlevel while + * hibernating: hot-unplugging secondary cpus and freeze processes, + * before starting the thread that will do the restore. + **/ +int toi_atomic_restore(void) +{ + int error; + + toi_running = 1; + + toi_prepare_status(DONT_CLEAR_BAR, "Atomic restore."); + + memcpy(&toi_bkd.toi_nosave_commandline, saved_command_line, + strlen(saved_command_line)); + + if (add_boot_kernel_data_pbe()) + goto Failed; + + toi_prepare_status(DONT_CLEAR_BAR, "Doing atomic copy/restore."); + + if (toi_go_atomic(PMSG_QUIESCE, 0)) + goto Failed; + + /* We'll ignore saved state, but this gets preempt count (etc) right */ + save_processor_state(); + + error = swsusp_arch_resume(); + /* + * Code below is only ever reached in case of failure. Otherwise + * execution continues at place where swsusp_arch_suspend was called. + * + * We don't know whether it's safe to continue (this shouldn't happen), + * so lets err on the side of caution. + */ + BUG(); + +Failed: + free_pbe_list(&restore_pblist, 0); +#ifdef CONFIG_HIGHMEM + free_pbe_list(&restore_highmem_pblist, 1); +#endif + toi_running = 0; + return 1; +} + +/** + * toi_go_atomic - do the actual atomic copy/restore + * @state: The state to use for device_suspend & power_down calls. + * @suspend_time: Whether we're suspending or resuming. + **/ +int toi_go_atomic(pm_message_t state, int suspend_time) +{ + if (suspend_time && platform_begin(1)) { + set_abort_result(TOI_PLATFORM_PREP_FAILED); + toi_end_atomic(ATOMIC_STEP_PLATFORM_END, suspend_time, 0); + return 1; + } + + suspend_console(); + + if (device_suspend(state)) { + set_abort_result(TOI_DEVICE_REFUSED); + toi_end_atomic(ATOMIC_STEP_DEVICE_RESUME, suspend_time, 3); + return 1; + } + + if (suspend_time && platform_pre_snapshot(1)) { + set_abort_result(TOI_PRE_SNAPSHOT_FAILED); + toi_end_atomic(ATOMIC_STEP_PLATFORM_FINISH, suspend_time, 0); + return 1; + } + + if (!suspend_time && platform_pre_restore(1)) { + set_abort_result(TOI_PRE_RESTORE_FAILED); + toi_end_atomic(ATOMIC_STEP_DEVICE_RESUME, suspend_time, 0); + return 1; + } + + if (test_action_state(TOI_LATE_CPU_HOTPLUG)) { + if (disable_nonboot_cpus()) { + set_abort_result(TOI_CPU_HOTPLUG_FAILED); + toi_end_atomic(ATOMIC_STEP_CPU_HOTPLUG, + suspend_time, 0); + return 1; + } + } + + if (suspend_time && arch_prepare_suspend()) { + set_abort_result(TOI_ARCH_PREPARE_FAILED); + toi_end_atomic(ATOMIC_STEP_CPU_HOTPLUG, suspend_time, 0); + return 1; + } + + device_pm_lock(); + local_irq_disable(); + + /* At this point, device_suspend() has been called, but *not* + * device_power_down(). We *must* device_power_down() now. + * Otherwise, drivers for some devices (e.g. interrupt controllers) + * become desynchronized with the actual state of the hardware + * at resume time, and evil weirdness ensues. + */ + + if (device_power_down(state)) { + set_abort_result(TOI_DEVICE_REFUSED); + toi_end_atomic(ATOMIC_STEP_IRQS, suspend_time, 0); + return 1; + } + + if (sysdev_suspend(state)) { + set_abort_result(TOI_SYSDEV_REFUSED); + toi_end_atomic(ATOMIC_STEP_POWER_UP, suspend_time, 0); + return 1; + } + + return 0; +} + +/** + * toi_end_atomic - post atomic copy/restore routines + * @stage: What step to start at. + * @suspend_time: Whether we're suspending or resuming. + * @error: Whether we're recovering from an error. + **/ +void toi_end_atomic(int stage, int suspend_time, int error) +{ + switch (stage) { + case ATOMIC_ALL_STEPS: + if (!suspend_time) + platform_leave(1); + sysdev_resume(); + case ATOMIC_STEP_POWER_UP: + device_power_up(suspend_time ? + (error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE); + case ATOMIC_STEP_IRQS: + local_irq_enable(); + device_pm_unlock(); + case ATOMIC_STEP_CPU_HOTPLUG: + if (test_action_state(TOI_LATE_CPU_HOTPLUG)) + enable_nonboot_cpus(); + case ATOMIC_STEP_PLATFORM_FINISH: + platform_finish(1); + case ATOMIC_STEP_DEVICE_RESUME: + if (suspend_time && (error & 2)) + platform_recover(1); + device_resume(suspend_time ? + ((error & 1) ? PMSG_RECOVER : PMSG_THAW) : + PMSG_RESTORE); + case ATOMIC_STEP_RESUME_CONSOLE: + resume_console(); + case ATOMIC_STEP_PLATFORM_END: + platform_end(1); + + toi_prepare_status(DONT_CLEAR_BAR, "Post atomic."); + } +} diff --git a/kernel/power/tuxonice_atomic_copy.h b/kernel/power/tuxonice_atomic_copy.h new file mode 100644 index 0000000..e907f50 --- /dev/null +++ b/kernel/power/tuxonice_atomic_copy.h @@ -0,0 +1,23 @@ +/* + * kernel/power/tuxonice_atomic_copy.h + * + * Copyright 2008 Nigel Cunningham (nigel at tuxonice net) + * + * Distributed under GPLv2. + * + * Routines for doing the atomic save/restore. + */ + +enum { + ATOMIC_ALL_STEPS, + ATOMIC_STEP_POWER_UP, + ATOMIC_STEP_IRQS, + ATOMIC_STEP_CPU_HOTPLUG, + ATOMIC_STEP_PLATFORM_FINISH, + ATOMIC_STEP_DEVICE_RESUME, + ATOMIC_STEP_RESUME_CONSOLE, + ATOMIC_STEP_PLATFORM_END, +}; + +int toi_go_atomic(pm_message_t state, int toi_time); +void toi_end_atomic(int stage, int toi_time, int error); diff --git a/kernel/power/tuxonice_block_io.c b/kernel/power/tuxonice_block_io.c new file mode 100644 index 0000000..7100da0 --- /dev/null +++ b/kernel/power/tuxonice_block_io.c @@ -0,0 +1,1336 @@ +/* + * kernel/power/tuxonice_block_io.c + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * Distributed under GPLv2. + * + * This file contains block io functions for TuxOnIce. These are + * used by the swapwriter and it is planned that they will also + * be used by the NFSwriter. + * + */ + +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_block_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_alloc.h" +#include "tuxonice_io.h" + +#define MEMORY_ONLY 1 +#define THROTTLE_WAIT 2 + +/* #define MEASURE_MUTEX_CONTENTION */ +#ifndef MEASURE_MUTEX_CONTENTION +#define my_mutex_lock(index, the_lock) mutex_lock(the_lock) +#define my_mutex_unlock(index, the_lock) mutex_unlock(the_lock) +#else +unsigned long mutex_times[2][2][NR_CPUS]; +#define my_mutex_lock(index, the_lock) do { \ + int have_mutex; \ + have_mutex = mutex_trylock(the_lock); \ + if (!have_mutex) { \ + mutex_lock(the_lock); \ + mutex_times[index][0][smp_processor_id()]++; \ + } else { \ + mutex_times[index][1][smp_processor_id()]++; \ + } + +#define my_mutex_unlock(index, the_lock) \ + mutex_unlock(the_lock); \ +} while (0) +#endif + +static int target_outstanding_io = 1024; +static int max_outstanding_writes, max_outstanding_reads; + +static struct page *bio_queue_head, *bio_queue_tail; +static atomic_t toi_bio_queue_size; +static DEFINE_SPINLOCK(bio_queue_lock); + +static int free_mem_throttle, throughput_throttle; +static int more_readahead = 1; +static struct page *readahead_list_head, *readahead_list_tail; +static DECLARE_WAIT_QUEUE_HEAD(readahead_list_wait); + +static struct page *waiting_on; + +static atomic_t toi_io_in_progress, toi_io_done; +static DECLARE_WAIT_QUEUE_HEAD(num_in_progress_wait); + +static int extra_page_forward; + +static int current_stream; +/* 0 = Header, 1 = Pageset1, 2 = Pageset2, 3 = End of PS1 */ +struct hibernate_extent_iterate_saved_state toi_writer_posn_save[4]; +EXPORT_SYMBOL_GPL(toi_writer_posn_save); + +/* Pointer to current entry being loaded/saved. */ +struct toi_extent_iterate_state toi_writer_posn; +EXPORT_SYMBOL_GPL(toi_writer_posn); + +/* Not static, so that the allocators can setup and complete + * writing the header */ +char *toi_writer_buffer; +EXPORT_SYMBOL_GPL(toi_writer_buffer); + +int toi_writer_buffer_posn; +EXPORT_SYMBOL_GPL(toi_writer_buffer_posn); + +static struct toi_bdev_info *toi_devinfo; + +static DEFINE_MUTEX(toi_bio_mutex); +static DEFINE_MUTEX(toi_bio_readahead_mutex); + +static struct task_struct *toi_queue_flusher; +static int toi_bio_queue_flush_pages(int dedicated_thread); + +#define TOTAL_OUTSTANDING_IO (atomic_read(&toi_io_in_progress) + \ + atomic_read(&toi_bio_queue_size)) + +/** + * set_free_mem_throttle - set the point where we pause to avoid oom. + * + * Initially, this value is zero, but when we first fail to allocate memory, + * we set it (plus a buffer) and thereafter throttle i/o once that limit is + * reached. + **/ +static void set_free_mem_throttle(void) +{ + int new_throttle = nr_unallocated_buffer_pages() + 256; + + if (new_throttle > free_mem_throttle) + free_mem_throttle = new_throttle; +} + +#define NUM_REASONS 7 +static atomic_t reasons[NUM_REASONS]; +static char *reason_name[NUM_REASONS] = { + "readahead not ready", + "bio allocation", + "synchronous I/O", + "toi_bio_get_new_page", + "memory low", + "readahead buffer allocation", + "throughput_throttle", +}; + +/** + * do_bio_wait - wait for some TuxOnIce I/O to complete + * @reason: The array index of the reason we're waiting. + * + * Wait for a particular page of I/O if we're after a particular page. + * If we're not after a particular page, wait instead for all in flight + * I/O to be completed or for us to have enough free memory to be able + * to submit more I/O. + * + * If we wait, we also update our statistics regarding why we waited. + **/ +static void do_bio_wait(int reason) +{ + struct page *was_waiting_on = waiting_on; + + /* On SMP, waiting_on can be reset, so we make a copy */ + if (was_waiting_on) { + if (PageLocked(was_waiting_on)) { + wait_on_page_bit(was_waiting_on, PG_locked); + atomic_inc(&reasons[reason]); + } + } else { + atomic_inc(&reasons[reason]); + + wait_event(num_in_progress_wait, + !atomic_read(&toi_io_in_progress) || + nr_unallocated_buffer_pages() > free_mem_throttle); + } +} + +/** + * throttle_if_needed - wait for I/O completion if throttle points are reached + * @flags: What to check and how to act. + * + * Check whether we need to wait for some I/O to complete. We always check + * whether we have enough memory available, but may also (depending upon + * @reason) check if the throughput throttle limit has been reached. + **/ +static int throttle_if_needed(int flags) +{ + int free_pages = nr_unallocated_buffer_pages(); + + /* Getting low on memory and I/O is in progress? */ + while (unlikely(free_pages < free_mem_throttle) && + atomic_read(&toi_io_in_progress)) { + if (!(flags & THROTTLE_WAIT)) + return -ENOMEM; + do_bio_wait(4); + free_pages = nr_unallocated_buffer_pages(); + } + + while (!(flags & MEMORY_ONLY) && throughput_throttle && + TOTAL_OUTSTANDING_IO >= throughput_throttle) { + int result = toi_bio_queue_flush_pages(0); + if (result) + return result; + atomic_inc(&reasons[6]); + wait_event(num_in_progress_wait, + !atomic_read(&toi_io_in_progress) || + TOTAL_OUTSTANDING_IO < throughput_throttle); + } + + return 0; +} + +/** + * update_throughput_throttle - update the raw throughput throttle + * @jif_index: The number of times this function has been called. + * + * This function is called twice per second by the core, and used to limit the + * amount of I/O we submit at once, spreading out our waiting through the + * whole job and letting userui get an opportunity to do its work. + * + * We don't start limiting I/O until 1/2s has gone so that we get a + * decent sample for our initial limit, and keep updating it because + * throughput may vary (on rotating media, eg) with our block number. + * + * We throttle to 1/10s worth of I/O. + **/ +static void update_throughput_throttle(int jif_index) +{ + int done = atomic_read(&toi_io_done); + throughput_throttle = done / jif_index / 5; +} + +/** + * toi_finish_all_io - wait for all outstanding i/o to complete + * + * Flush any queued but unsubmitted I/O and wait for it all to complete. + **/ +static int toi_finish_all_io(void) +{ + int result = toi_bio_queue_flush_pages(0); + wait_event(num_in_progress_wait, !TOTAL_OUTSTANDING_IO); + return result; +} + +/** + * toi_end_bio - bio completion function. + * @bio: bio that has completed. + * @err: Error value. Yes, like end_swap_bio_read, we ignore it. + * + * Function called by the block driver from interrupt context when I/O is + * completed. If we were writing the page, we want to free it and will have + * set bio->bi_private to the parameter we should use in telling the page + * allocation accounting code what the page was allocated for. If we're + * reading the page, it will be in the singly linked list made from + * page->private pointers. + **/ +static void toi_end_bio(struct bio *bio, int err) +{ + struct page *page = bio->bi_io_vec[0].bv_page; + + BUG_ON(!test_bit(BIO_UPTODATE, &bio->bi_flags)); + + unlock_page(page); + bio_put(bio); + + if (waiting_on == page) + waiting_on = NULL; + + put_page(page); + + if (bio->bi_private) + toi__free_page((int) ((unsigned long) bio->bi_private) , page); + + bio_put(bio); + + atomic_dec(&toi_io_in_progress); + atomic_inc(&toi_io_done); + + wake_up(&num_in_progress_wait); +} + +/** + * submit - submit BIO request + * @writing: READ or WRITE. + * @dev: The block device we're using. + * @first_block: The first sector we're using. + * @page: The page being used for I/O. + * @free_group: If writing, the group that was used in allocating the page + * and which will be used in freeing the page from the completion + * routine. + * + * Based on Patrick Mochell's pmdisk code from long ago: "Straight from the + * textbook - allocate and initialize the bio. If we're writing, make sure + * the page is marked as dirty. Then submit it and carry on." + * + * If we're just testing the speed of our own code, we fake having done all + * the hard work and all toi_end_bio immediately. + **/ +static int submit(int writing, struct block_device *dev, sector_t first_block, + struct page *page, int free_group) +{ + struct bio *bio = NULL; + int cur_outstanding_io, result; + + /* + * Shouldn't throttle if reading - can deadlock in the single + * threaded case as pages are only freed when we use the + * readahead. + */ + if (writing) { + result = throttle_if_needed(MEMORY_ONLY | THROTTLE_WAIT); + if (result) + return result; + } + + while (!bio) { + bio = bio_alloc(TOI_ATOMIC_GFP, 1); + if (!bio) { + set_free_mem_throttle(); + do_bio_wait(1); + } + } + + bio->bi_bdev = dev; + bio->bi_sector = first_block; + bio->bi_private = (void *) ((unsigned long) free_group); + bio->bi_end_io = toi_end_bio; + + if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { + printk(KERN_DEBUG "ERROR: adding page to bio at %lld\n", + (unsigned long long) first_block); + bio_put(bio); + return -EFAULT; + } + + bio_get(bio); + + cur_outstanding_io = atomic_add_return(1, &toi_io_in_progress); + if (writing) { + if (cur_outstanding_io > max_outstanding_writes) + max_outstanding_writes = cur_outstanding_io; + } else { + if (cur_outstanding_io > max_outstanding_reads) + max_outstanding_reads = cur_outstanding_io; + } + + + if (unlikely(test_action_state(TOI_TEST_FILTER_SPEED))) { + /* Fake having done the hard work */ + set_bit(BIO_UPTODATE, &bio->bi_flags); + toi_end_bio(bio, 0); + } else + submit_bio(writing | (1 << BIO_RW_SYNCIO) | + (1 << BIO_RW_UNPLUG), bio); + + return 0; +} + +/** + * toi_do_io: Prepare to do some i/o on a page and submit or batch it. + * + * @writing: Whether reading or writing. + * @bdev: The block device which we're using. + * @block0: The first sector we're reading or writing. + * @page: The page on which I/O is being done. + * @readahead_index: If doing readahead, the index (reset this flag when done). + * @syncio: Whether the i/o is being done synchronously. + * + * Prepare and start a read or write operation. + * + * Note that we always work with our own page. If writing, we might be given a + * compression buffer that will immediately be used to start compressing the + * next page. For reading, we do readahead and therefore don't know the final + * address where the data needs to go. + **/ +static int toi_do_io(int writing, struct block_device *bdev, long block0, + struct page *page, int is_readahead, int syncio, int free_group) +{ + page->private = 0; + + /* Do here so we don't race against toi_bio_get_next_page_read */ + lock_page(page); + + if (is_readahead) { + if (readahead_list_head) + readahead_list_tail->private = (unsigned long) page; + else + readahead_list_head = page; + + readahead_list_tail = page; + wake_up(&readahead_list_wait); + } + + /* Done before submitting to avoid races. */ + if (syncio) + waiting_on = page; + + /* Submit the page */ + get_page(page); + + if (submit(writing, bdev, block0, page, free_group)) + return -EFAULT; + + if (syncio) + do_bio_wait(2); + + return 0; +} + +/** + * toi_bdev_page_io - simpler interface to do directly i/o on a single page + * @writing: Whether reading or writing. + * @bdev: Block device on which we're operating. + * @pos: Sector at which page to read or write starts. + * @page: Page to be read/written. + * + * A simple interface to submit a page of I/O and wait for its completion. + * The caller must free the page used. + **/ +static int toi_bdev_page_io(int writing, struct block_device *bdev, + long pos, struct page *page) +{ + return toi_do_io(writing, bdev, pos, page, 0, 1, 0); +} + +/** + * toi_bio_memory_needed - report the amount of memory needed for block i/o + * + * We want to have at least enough memory so as to have target_outstanding_io + * or more transactions on the fly at once. If we can do more, fine. + **/ +static int toi_bio_memory_needed(void) +{ + return target_outstanding_io * (PAGE_SIZE + sizeof(struct request) + + sizeof(struct bio)); +} + +/** + * toi_bio_print_debug_stats - put out debugging info in the buffer provided + * @buffer: A buffer of size @size into which text should be placed. + * @size: The size of @buffer. + * + * Fill a buffer with debugging info. This is used for both our debug_info sysfs + * entry and for recording the same info in dmesg. + **/ +static int toi_bio_print_debug_stats(char *buffer, int size) +{ + int len = scnprintf(buffer, size, "- Max outstanding reads %d. Max " + "writes %d.\n", max_outstanding_reads, + max_outstanding_writes); + + len += scnprintf(buffer + len, size - len, + " Memory_needed: %d x (%lu + %u + %u) = %d bytes.\n", + target_outstanding_io, + PAGE_SIZE, (unsigned int) sizeof(struct request), + (unsigned int) sizeof(struct bio), toi_bio_memory_needed()); + +#ifdef MEASURE_MUTEX_CONTENTION + { + int i; + + len += scnprintf(buffer + len, size - len, + " Mutex contention while reading:\n Contended Free\n"); + + for_each_online_cpu(i) + len += scnprintf(buffer + len, size - len, + " %9lu %9lu\n", + mutex_times[0][0][i], mutex_times[0][1][i]); + + len += scnprintf(buffer + len, size - len, + " Mutex contention while writing:\n Contended Free\n"); + + for_each_online_cpu(i) + len += scnprintf(buffer + len, size - len, + " %9lu %9lu\n", + mutex_times[1][0][i], mutex_times[1][1][i]); + + } +#endif + + return len + scnprintf(buffer + len, size - len, + " Free mem throttle point reached %d.\n", free_mem_throttle); +} + +/** + * toi_set_devinfo - set the bdev info used for i/o + * @info: Pointer to an array of struct toi_bdev_info - the list of + * bdevs and blocks on them in which the image is stored. + * + * Set the list of bdevs and blocks in which the image will be stored. + * Think of them (all together) as one long tape on which the data will be + * stored. + **/ +static void toi_set_devinfo(struct toi_bdev_info *info) +{ + toi_devinfo = info; +} + +/** + * dump_block_chains - print the contents of the bdev info array. + **/ +static void dump_block_chains(void) +{ + int i; + + for (i = 0; i < toi_writer_posn.num_chains; i++) { + struct hibernate_extent *this; + + this = (toi_writer_posn.chains + i)->first; + + if (!this) + continue; + + printk(KERN_DEBUG "Chain %d:", i); + + while (this) { + printk(" [%lu-%lu]%s", this->start, + this->end, this->next ? "," : ""); + this = this->next; + } + + printk("\n"); + } + + for (i = 0; i < 4; i++) + printk(KERN_DEBUG "Posn %d: Chain %d, extent %d, offset %lu.\n", + i, toi_writer_posn_save[i].chain_num, + toi_writer_posn_save[i].extent_num, + toi_writer_posn_save[i].offset); +} + +static int total_header_bytes; +static int unowned; + +static int debug_broken_header(void) +{ + printk(KERN_DEBUG "Image header too big for size allocated!\n"); + print_toi_header_storage_for_modules(); + printk(KERN_DEBUG "Page flags : %d.\n", toi_pageflags_space_needed()); + printk(KERN_DEBUG "toi_header : %ld.\n", sizeof(struct toi_header)); + printk(KERN_DEBUG "Total unowned : %d.\n", unowned); + printk(KERN_DEBUG "Total used : %d (%ld pages).\n", total_header_bytes, + DIV_ROUND_UP(total_header_bytes, PAGE_SIZE)); + printk(KERN_DEBUG "Space needed now : %ld.\n", get_header_storage_needed()); + dump_block_chains(); + abort_hibernate(TOI_HEADER_TOO_BIG, "Header reservation too small."); + return -EIO; +} + +/** + * go_next_page - skip blocks to the start of the next page + * @writing: Whether we're reading or writing the image. + * + * Go forward one page, or two if extra_page_forward is set. It only gets + * set at the start of reading the image header, to skip the first page + * of the header, which is read without using the extent chains. + **/ +static int go_next_page(int writing, int section_barrier) +{ + int i, max = (toi_writer_posn.current_chain == -1) ? 1 : + toi_devinfo[toi_writer_posn.current_chain].blocks_per_page, + compare_to = 0; + + /* Have we already used the last page of the stream? */ + switch (current_stream) { + case 0: + compare_to = 2; + break; + case 1: + compare_to = 3; + break; + case 2: + compare_to = 1; + break; + } + + if (section_barrier && toi_writer_posn.current_chain == + toi_writer_posn_save[compare_to].chain_num && + toi_writer_posn.current_offset == + toi_writer_posn_save[compare_to].offset) { + if (writing) { + if (!current_stream) + return debug_broken_header(); + } else { + more_readahead = 0; + return -ENODATA; + } + } + + /* Nope. Go foward a page - or maybe two */ + for (i = 0; i < max; i++) + toi_extent_state_next(&toi_writer_posn); + + if (toi_extent_state_eof(&toi_writer_posn)) { + /* Don't complain if readahead falls off the end */ + if (writing && section_barrier) { + printk(KERN_DEBUG "Extent state eof. " + "Expected compression ratio too optimistic?\n"); + dump_block_chains(); + } + return -ENODATA; + } + + if (extra_page_forward) { + extra_page_forward = 0; + return go_next_page(writing, section_barrier); + } + + return 0; +} + +/** + * set_extra_page_forward - make us skip an extra page on next go_next_page + * + * Used in reading header, to jump to 2nd page after getting 1st page + * direct from image header. + **/ +static void set_extra_page_forward(void) +{ + extra_page_forward = 1; +} + +/** + * toi_bio_rw_page - do i/o on the next disk page in the image + * @writing: Whether reading or writing. + * @page: Page to do i/o on. + * @is_readahead: Whether we're doing readahead + * @free_group: The group used in allocating the page + * + * Submit a page for reading or writing, possibly readahead. + * Pass the group used in allocating the page as well, as it should + * be freed on completion of the bio if we're writing the page. + **/ +static int toi_bio_rw_page(int writing, struct page *page, + int is_readahead, int free_group) +{ + struct toi_bdev_info *dev_info; + int result = go_next_page(writing, 1); + + if (result) + return result; + + dev_info = &toi_devinfo[toi_writer_posn.current_chain]; + + return toi_do_io(writing, dev_info->bdev, + toi_writer_posn.current_offset << + dev_info->bmap_shift, + page, is_readahead, 0, free_group); +} + +/** + * toi_rw_init - prepare to read or write a stream in the image + * @writing: Whether reading or writing. + * @stream number: Section of the image being processed. + * + * Prepare to read or write a section ('stream') in the image. + **/ +static int toi_rw_init(int writing, int stream_number) +{ + if (stream_number) + toi_extent_state_restore(&toi_writer_posn, + &toi_writer_posn_save[stream_number]); + else + toi_extent_state_goto_start(&toi_writer_posn); + + atomic_set(&toi_io_done, 0); + toi_writer_buffer = (char *) toi_get_zeroed_page(11, TOI_ATOMIC_GFP); + toi_writer_buffer_posn = writing ? 0 : PAGE_SIZE; + + current_stream = stream_number; + + more_readahead = 1; + + return toi_writer_buffer ? 0 : -ENOMEM; +} + +/** + * toi_read_header_init - prepare to read the image header + * + * Reset readahead indices prior to starting to read a section of the image. + **/ +static void toi_read_header_init(void) +{ + toi_writer_buffer = (char *) toi_get_zeroed_page(11, TOI_ATOMIC_GFP); + more_readahead = 1; +} + +/** + * toi_bio_queue_write - queue a page for writing + * @full_buffer: Pointer to a page to be queued + * + * Add a page to the queue to be submitted. If we're the queue flusher, + * we'll do this once we've dropped toi_bio_mutex, so other threads can + * continue to submit I/O while we're on the slow path doing the actual + * submission. + **/ +static void toi_bio_queue_write(char **full_buffer) +{ + struct page *page = virt_to_page(*full_buffer); + unsigned long flags; + + page->private = 0; + + spin_lock_irqsave(&bio_queue_lock, flags); + if (!bio_queue_head) + bio_queue_head = page; + else + bio_queue_tail->private = (unsigned long) page; + + bio_queue_tail = page; + atomic_inc(&toi_bio_queue_size); + + spin_unlock_irqrestore(&bio_queue_lock, flags); + wake_up(&toi_io_queue_flusher); + + *full_buffer = NULL; +} + +/** + * toi_rw_cleanup - Cleanup after i/o. + * @writing: Whether we were reading or writing. + * + * Flush all I/O and clean everything up after reading or writing a + * section of the image. + **/ +static int toi_rw_cleanup(int writing) +{ + int i, result; + + if (writing) { + int result; + + if (toi_writer_buffer_posn && !test_result_state(TOI_ABORTED)) + toi_bio_queue_write(&toi_writer_buffer); + + result = toi_bio_queue_flush_pages(0); + + if (result) + return result; + + if (current_stream == 2) + toi_extent_state_save(&toi_writer_posn, + &toi_writer_posn_save[1]); + else if (current_stream == 1) + toi_extent_state_save(&toi_writer_posn, + &toi_writer_posn_save[3]); + } + + result = toi_finish_all_io(); + + while (readahead_list_head) { + void *next = (void *) readahead_list_head->private; + toi__free_page(12, readahead_list_head); + readahead_list_head = next; + } + + readahead_list_tail = NULL; + + if (!current_stream) + return result; + + for (i = 0; i < NUM_REASONS; i++) { + if (!atomic_read(&reasons[i])) + continue; + printk(KERN_DEBUG "Waited for i/o due to %s %d times.\n", + reason_name[i], atomic_read(&reasons[i])); + atomic_set(&reasons[i], 0); + } + + current_stream = 0; + return result; +} + +/** + * toi_start_one_readahead - start one page of readahead + * @dedicated_thread: Is this a thread dedicated to doing readahead? + * + * Start one new page of readahead. If this is being called by a thread + * whose only just is to submit readahead, don't quit because we failed + * to allocate a page. + **/ +static int toi_start_one_readahead(int dedicated_thread) +{ + char *buffer = NULL; + int oom = 0, result; + + result = throttle_if_needed(dedicated_thread ? THROTTLE_WAIT : 0); + if (result) + return result; + + mutex_lock(&toi_bio_readahead_mutex); + + while (!buffer) { + buffer = (char *) toi_get_zeroed_page(12, + TOI_ATOMIC_GFP); + if (!buffer) { + if (oom && !dedicated_thread) { + mutex_unlock(&toi_bio_readahead_mutex); + return -ENOMEM; + } + + oom = 1; + set_free_mem_throttle(); + do_bio_wait(5); + } + } + + result = toi_bio_rw_page(READ, virt_to_page(buffer), 1, 0); + mutex_unlock(&toi_bio_readahead_mutex); + return result; +} + +/** + * toi_start_new_readahead - start new readahead + * @dedicated_thread: Are we dedicated to this task? + * + * Start readahead of image pages. + * + * We can be called as a thread dedicated to this task (may be helpful on + * systems with lots of CPUs), in which case we don't exit until there's no + * more readahead. + * + * If this is not called by a dedicated thread, we top up our queue until + * there's no more readahead to submit, we've submitted the number given + * in target_outstanding_io or the number in progress exceeds the target + * outstanding I/O value. + * + * No mutex needed because this is only ever called by the first cpu. + **/ +static int toi_start_new_readahead(int dedicated_thread) +{ + int last_result, num_submitted = 0; + + /* Start a new readahead? */ + if (!more_readahead) + return 0; + + do { + last_result = toi_start_one_readahead(dedicated_thread); + + if (last_result) { + if (last_result == -ENOMEM || last_result == -ENODATA) + return 0; + + printk(KERN_DEBUG + "Begin read chunk returned %d.\n", + last_result); + } else + num_submitted++; + + } while (more_readahead && !last_result && + (dedicated_thread || + (num_submitted < target_outstanding_io && + atomic_read(&toi_io_in_progress) < target_outstanding_io))); + + return last_result; +} + +/** + * bio_io_flusher - start the dedicated I/O flushing routine + * @writing: Whether we're writing the image. + **/ +static int bio_io_flusher(int writing) +{ + + if (writing) + return toi_bio_queue_flush_pages(1); + else + return toi_start_new_readahead(1); +} + +/** + * toi_bio_get_next_page_read - read a disk page, perhaps with readahead + * @no_readahead: Whether we can use readahead + * + * Read a page from disk, submitting readahead and cleaning up finished i/o + * while we wait for the page we're after. + **/ +static int toi_bio_get_next_page_read(int no_readahead) +{ + unsigned long *virt; + struct page *next; + + /* + * When reading the second page of the header, we have to + * delay submitting the read until after we've gotten the + * extents out of the first page. + */ + if (unlikely(no_readahead && toi_start_one_readahead(0))) { + printk(KERN_DEBUG "No readahead and toi_start_one_readahead " + "returned non-zero.\n"); + return -EIO; + } + + if (unlikely(!readahead_list_head)) { + BUG_ON(!more_readahead); + if (unlikely(toi_start_one_readahead(0))) { + printk(KERN_DEBUG "No readahead and " + "toi_start_one_readahead returned non-zero.\n"); + return -EIO; + } + } + + if (PageLocked(readahead_list_head)) { + waiting_on = readahead_list_head; + do_bio_wait(0); + } + + virt = page_address(readahead_list_head); + memcpy(toi_writer_buffer, virt, PAGE_SIZE); + + next = (struct page *) readahead_list_head->private; + toi__free_page(12, readahead_list_head); + readahead_list_head = next; + return 0; +} + +/** + * toi_bio_queue_flush_pages - flush the queue of pages queued for writing + * @dedicated_thread: Whether we're a dedicated thread + * + * Flush the queue of pages ready to be written to disk. + * + * If we're a dedicated thread, stay in here until told to leave, + * sleeping in wait_event. + * + * The first thread is normally the only one to come in here. Another + * thread can enter this routine too, though, via throttle_if_needed. + * Since that's the case, we must be careful to only have one thread + * doing this work at a time. Otherwise we have a race and could save + * pages out of order. + * + * If an error occurs, free all remaining pages without submitting them + * for I/O. + **/ + +int toi_bio_queue_flush_pages(int dedicated_thread) +{ + unsigned long flags; + int result = 0; + static int busy; + + if (busy) + return 0; + + busy = 1; + +top: + spin_lock_irqsave(&bio_queue_lock, flags); + while (bio_queue_head) { + struct page *page = bio_queue_head; + bio_queue_head = (struct page *) page->private; + if (bio_queue_tail == page) + bio_queue_tail = NULL; + atomic_dec(&toi_bio_queue_size); + spin_unlock_irqrestore(&bio_queue_lock, flags); + if (!result) + result = toi_bio_rw_page(WRITE, page, 0, 11); + if (result) + toi__free_page(11 , page); + spin_lock_irqsave(&bio_queue_lock, flags); + } + spin_unlock_irqrestore(&bio_queue_lock, flags); + + if (dedicated_thread) { + wait_event(toi_io_queue_flusher, bio_queue_head || + toi_bio_queue_flusher_should_finish); + if (likely(!toi_bio_queue_flusher_should_finish)) + goto top; + toi_bio_queue_flusher_should_finish = 0; + } + + busy = 0; + return result; +} + +/** + * toi_bio_get_new_page - get a new page for I/O + * @full_buffer: Pointer to a page to allocate. + **/ +static int toi_bio_get_new_page(char **full_buffer) +{ + int result = throttle_if_needed(THROTTLE_WAIT); + if (result) + return result; + + while (!*full_buffer) { + *full_buffer = (char *) toi_get_zeroed_page(11, TOI_ATOMIC_GFP); + if (!*full_buffer) { + set_free_mem_throttle(); + do_bio_wait(3); + } + } + + return 0; +} + +/** + * toi_rw_buffer - combine smaller buffers into PAGE_SIZE I/O + * @writing: Bool - whether writing (or reading). + * @buffer: The start of the buffer to write or fill. + * @buffer_size: The size of the buffer to write or fill. + * @no_readahead: Don't try to start readhead (when getting extents). + **/ +static int toi_rw_buffer(int writing, char *buffer, int buffer_size, + int no_readahead) +{ + int bytes_left = buffer_size, result = 0; + + while (bytes_left) { + char *source_start = buffer + buffer_size - bytes_left; + char *dest_start = toi_writer_buffer + toi_writer_buffer_posn; + int capacity = PAGE_SIZE - toi_writer_buffer_posn; + char *to = writing ? dest_start : source_start; + char *from = writing ? source_start : dest_start; + + if (bytes_left <= capacity) { + memcpy(to, from, bytes_left); + toi_writer_buffer_posn += bytes_left; + return 0; + } + + /* Complete this page and start a new one */ + memcpy(to, from, capacity); + bytes_left -= capacity; + + if (!writing) { + /* + * Perform actual I/O: + * read readahead_list_head into toi_writer_buffer + */ + int result = toi_bio_get_next_page_read(no_readahead); + if (result) + return result; + } else { + toi_bio_queue_write(&toi_writer_buffer); + result = toi_bio_get_new_page(&toi_writer_buffer); + if (result) + return result; + } + + toi_writer_buffer_posn = 0; + toi_cond_pause(0, NULL); + } + + return 0; +} + +/** + * toi_bio_read_page - read a page of the image + * @pfn: The pfn where the data belongs. + * @buffer_page: The page containing the (possibly compressed) data. + * @buf_size: The number of bytes on @buffer_page used (PAGE_SIZE). + * + * Read a (possibly compressed) page from the image, into buffer_page, + * returning its pfn and the buffer size. + **/ +static int toi_bio_read_page(unsigned long *pfn, struct page *buffer_page, + unsigned int *buf_size) +{ + int result = 0; + char *buffer_virt = kmap(buffer_page); + + /* + * Only call start_new_readahead if we don't have a dedicated thread + * and we're the queue flusher. + */ + if (current == toi_queue_flusher) { + int result2 = toi_start_new_readahead(0); + if (result2) { + printk(KERN_DEBUG "Queue flusher and " + "toi_start_one_readahead returned non-zero.\n"); + result = -EIO; + goto out; + } + } + + my_mutex_lock(0, &toi_bio_mutex); + + /* + * Structure in the image: + * [destination pfn|page size|page data] + * buf_size is PAGE_SIZE + */ + if (toi_rw_buffer(READ, (char *) pfn, sizeof(unsigned long), 0) || + toi_rw_buffer(READ, (char *) buf_size, sizeof(int), 0) || + toi_rw_buffer(READ, buffer_virt, *buf_size, 0)) { + abort_hibernate(TOI_FAILED_IO, "Read of data failed."); + result = 1; + } + + my_mutex_unlock(0, &toi_bio_mutex); +out: + kunmap(buffer_page); + return result; +} + +/** + * toi_bio_write_page - write a page of the image + * @pfn: The pfn where the data belongs. + * @buffer_page: The page containing the (possibly compressed) data. + * @buf_size: The number of bytes on @buffer_page used. + * + * Write a (possibly compressed) page to the image from the buffer, together + * with it's index and buffer size. + **/ +static int toi_bio_write_page(unsigned long pfn, struct page *buffer_page, + unsigned int buf_size) +{ + char *buffer_virt; + int result = 0, result2 = 0; + + if (unlikely(test_action_state(TOI_TEST_FILTER_SPEED))) + return 0; + + my_mutex_lock(1, &toi_bio_mutex); + + if (test_result_state(TOI_ABORTED)) { + my_mutex_unlock(1, &toi_bio_mutex); + return -EIO; + } + + buffer_virt = kmap(buffer_page); + + /* + * Structure in the image: + * [destination pfn|page size|page data] + * buf_size is PAGE_SIZE + */ + if (toi_rw_buffer(WRITE, (char *) &pfn, sizeof(unsigned long), 0) || + toi_rw_buffer(WRITE, (char *) &buf_size, sizeof(int), 0) || + toi_rw_buffer(WRITE, buffer_virt, buf_size, 0)) { + printk(KERN_DEBUG "toi_rw_buffer returned non-zero to " + "toi_bio_write_page.\n"); + result = -EIO; + } + + kunmap(buffer_page); + my_mutex_unlock(1, &toi_bio_mutex); + + if (current == toi_queue_flusher) + result2 = toi_bio_queue_flush_pages(0); + + return result ? result : result2; +} + +/** + * _toi_rw_header_chunk - read or write a portion of the image header + * @writing: Whether reading or writing. + * @owner: The module for which we're writing. + * Used for confirming that modules + * don't use more header space than they asked for. + * @buffer: Address of the data to write. + * @buffer_size: Size of the data buffer. + * @no_readahead: Don't try to start readhead (when getting extents). + * + * Perform PAGE_SIZE I/O. Start readahead if needed. + **/ +static int _toi_rw_header_chunk(int writing, struct toi_module_ops *owner, + char *buffer, int buffer_size, int no_readahead) +{ + int result = 0; + + if (owner) { + owner->header_used += buffer_size; + toi_message(TOI_HEADER, TOI_LOW, 1, + "Header: %s : %d bytes (%d/%d).\n", + owner->name, + buffer_size, owner->header_used, + owner->header_requested); + if (owner->header_used > owner->header_requested) { + printk(KERN_EMERG "TuxOnIce module %s is using more " + "header space (%u) than it requested (%u).\n", + owner->name, + owner->header_used, + owner->header_requested); + return buffer_size; + } + } else { + unowned += buffer_size; + toi_message(TOI_HEADER, TOI_LOW, 1, + "Header: (No owner): %d bytes (%d total so far)\n", + buffer_size, unowned); + } + + if (!writing && !no_readahead) + result = toi_start_new_readahead(0); + + if (!result) + result = toi_rw_buffer(writing, buffer, buffer_size, + no_readahead); + + total_header_bytes += buffer_size; + return result; +} + +static int toi_rw_header_chunk(int writing, struct toi_module_ops *owner, + char *buffer, int size) +{ + return _toi_rw_header_chunk(writing, owner, buffer, size, 0); +} + +static int toi_rw_header_chunk_noreadahead(int writing, + struct toi_module_ops *owner, char *buffer, int size) +{ + return _toi_rw_header_chunk(writing, owner, buffer, size, 1); +} + +/** + * write_header_chunk_finish - flush any buffered header data + **/ +static int write_header_chunk_finish(void) +{ + int result = 0; + + if (toi_writer_buffer_posn) + toi_bio_queue_write(&toi_writer_buffer); + + result = toi_finish_all_io(); + + unowned = 0; + total_header_bytes = 0; + return result; +} + +/** + * toi_bio_storage_needed - get the amount of storage needed for my fns + **/ +static int toi_bio_storage_needed(void) +{ + return sizeof(int); +} + +/** + * toi_bio_save_config_info - save block I/O config to image header + * @buf: PAGE_SIZE'd buffer into which data should be saved. + **/ +static int toi_bio_save_config_info(char *buf) +{ + int *ints = (int *) buf; + ints[0] = target_outstanding_io; + return sizeof(int); +} + +/** + * toi_bio_load_config_info - restore block I/O config + * @buf: Data to be reloaded. + * @size: Size of the buffer saved. + **/ +static void toi_bio_load_config_info(char *buf, int size) +{ + int *ints = (int *) buf; + target_outstanding_io = ints[0]; +} + +/** + * toi_bio_initialise - initialise bio code at start of some action + * @starting_cycle: Whether starting a hibernation cycle, or just reading or + * writing a sysfs value. + **/ +static int toi_bio_initialise(int starting_cycle) +{ + if (starting_cycle) { + max_outstanding_writes = 0; + max_outstanding_reads = 0; + toi_queue_flusher = current; +#ifdef MEASURE_MUTEX_CONTENTION + { + int i, j, k; + + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for_each_online_cpu(k) + mutex_times[i][j][k] = 0; + } +#endif + } + + return 0; +} + +/** + * toi_bio_cleanup - cleanup after some action + * @finishing_cycle: Whether completing a cycle. + **/ +static void toi_bio_cleanup(int finishing_cycle) +{ + if (toi_writer_buffer) { + toi_free_page(11, (unsigned long) toi_writer_buffer); + toi_writer_buffer = NULL; + } +} + +struct toi_bio_ops toi_bio_ops = { + .bdev_page_io = toi_bdev_page_io, + .finish_all_io = toi_finish_all_io, + .update_throughput_throttle = update_throughput_throttle, + .forward_one_page = go_next_page, + .set_extra_page_forward = set_extra_page_forward, + .set_devinfo = toi_set_devinfo, + .read_page = toi_bio_read_page, + .write_page = toi_bio_write_page, + .rw_init = toi_rw_init, + .rw_cleanup = toi_rw_cleanup, + .read_header_init = toi_read_header_init, + .rw_header_chunk = toi_rw_header_chunk, + .rw_header_chunk_noreadahead = toi_rw_header_chunk_noreadahead, + .write_header_chunk_finish = write_header_chunk_finish, + .io_flusher = bio_io_flusher, +}; +EXPORT_SYMBOL_GPL(toi_bio_ops); + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("target_outstanding_io", SYSFS_RW, &target_outstanding_io, + 0, 16384, 0, NULL), +}; + +static struct toi_module_ops toi_blockwriter_ops = { + .name = "lowlevel i/o", + .type = MISC_HIDDEN_MODULE, + .directory = "block_io", + .module = THIS_MODULE, + .print_debug_info = toi_bio_print_debug_stats, + .memory_needed = toi_bio_memory_needed, + .storage_needed = toi_bio_storage_needed, + .save_config_info = toi_bio_save_config_info, + .load_config_info = toi_bio_load_config_info, + .initialise = toi_bio_initialise, + .cleanup = toi_bio_cleanup, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/** + * toi_block_io_load - load time routine for block I/O module + * + * Register block i/o ops and sysfs entries. + **/ +static __init int toi_block_io_load(void) +{ + return toi_register_module(&toi_blockwriter_ops); +} + +#ifdef MODULE +static __exit void toi_block_io_unload(void) +{ + toi_unregister_module(&toi_blockwriter_ops); +} + +module_init(toi_block_io_load); +module_exit(toi_block_io_unload); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Nigel Cunningham"); +MODULE_DESCRIPTION("TuxOnIce block io functions"); +#else +late_initcall(toi_block_io_load); +#endif diff --git a/kernel/power/tuxonice_block_io.h b/kernel/power/tuxonice_block_io.h new file mode 100644 index 0000000..b18298c --- /dev/null +++ b/kernel/power/tuxonice_block_io.h @@ -0,0 +1,59 @@ +/* + * kernel/power/tuxonice_block_io.h + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * Copyright (C) 2006 Red Hat, inc. + * + * Distributed under GPLv2. + * + * This file contains declarations for functions exported from + * tuxonice_block_io.c, which contains low level io functions. + */ + +#include +#include "tuxonice_extent.h" + +struct toi_bdev_info { + struct block_device *bdev; + dev_t dev_t; + int bmap_shift; + int blocks_per_page; + int ignored; +}; + +/* + * Our exported interface so the swapwriter and filewriter don't + * need these functions duplicated. + */ +struct toi_bio_ops { + int (*bdev_page_io) (int rw, struct block_device *bdev, long pos, + struct page *page); + void (*check_io_stats) (void); + void (*reset_io_stats) (void); + void (*update_throughput_throttle) (int jif_index); + int (*finish_all_io) (void); + int (*forward_one_page) (int writing, int section_barrier); + void (*set_extra_page_forward) (void); + void (*set_devinfo) (struct toi_bdev_info *info); + int (*read_page) (unsigned long *index, struct page *buffer_page, + unsigned int *buf_size); + int (*write_page) (unsigned long index, struct page *buffer_page, + unsigned int buf_size); + void (*read_header_init) (void); + int (*rw_header_chunk) (int rw, struct toi_module_ops *owner, + char *buffer, int buffer_size); + int (*rw_header_chunk_noreadahead) (int rw, + struct toi_module_ops *owner, + char *buffer, int buffer_size); + int (*write_header_chunk_finish) (void); + int (*rw_init) (int rw, int stream_number); + int (*rw_cleanup) (int rw); + int (*io_flusher) (int rw); +}; + +extern struct toi_bio_ops toi_bio_ops; + +extern char *toi_writer_buffer; +extern int toi_writer_buffer_posn; +extern struct hibernate_extent_iterate_saved_state toi_writer_posn_save[4]; +extern struct toi_extent_iterate_state toi_writer_posn; diff --git a/kernel/power/tuxonice_builtin.c b/kernel/power/tuxonice_builtin.c new file mode 100644 index 0000000..635e602 --- /dev/null +++ b/kernel/power/tuxonice_builtin.c @@ -0,0 +1,314 @@ +/* + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "tuxonice_io.h" +#include "tuxonice.h" +#include "tuxonice_extent.h" +#include "tuxonice_netlink.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_ui.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_pagedir.h" +#include "tuxonice_modules.h" +#include "tuxonice_builtin.h" +#include "tuxonice_power_off.h" + +/* + * Highmem related functions (x86 only). + */ + +#ifdef CONFIG_HIGHMEM + +/** + * copyback_high: Restore highmem pages. + * + * Highmem data and pbe lists are/can be stored in highmem. + * The format is slightly different to the lowmem pbe lists + * used for the assembly code: the last pbe in each page is + * a struct page * instead of struct pbe *, pointing to the + * next page where pbes are stored (or NULL if happens to be + * the end of the list). Since we don't want to generate + * unnecessary deltas against swsusp code, we use a cast + * instead of a union. + **/ + +static void copyback_high(void) +{ + struct page *pbe_page = (struct page *) restore_highmem_pblist; + struct pbe *this_pbe, *first_pbe; + unsigned long *origpage, *copypage; + int pbe_index = 1; + + if (!pbe_page) + return; + + this_pbe = (struct pbe *) kmap_atomic(pbe_page, KM_BOUNCE_READ); + first_pbe = this_pbe; + + while (this_pbe) { + int loop = (PAGE_SIZE / sizeof(unsigned long)) - 1; + + origpage = kmap_atomic((struct page *) this_pbe->orig_address, + KM_BIO_DST_IRQ); + copypage = kmap_atomic((struct page *) this_pbe->address, + KM_BIO_SRC_IRQ); + + while (loop >= 0) { + *(origpage + loop) = *(copypage + loop); + loop--; + } + + kunmap_atomic(origpage, KM_BIO_DST_IRQ); + kunmap_atomic(copypage, KM_BIO_SRC_IRQ); + + if (!this_pbe->next) + break; + + if (pbe_index < PBES_PER_PAGE) { + this_pbe++; + pbe_index++; + } else { + pbe_page = (struct page *) this_pbe->next; + kunmap_atomic(first_pbe, KM_BOUNCE_READ); + if (!pbe_page) + return; + this_pbe = (struct pbe *) kmap_atomic(pbe_page, + KM_BOUNCE_READ); + first_pbe = this_pbe; + pbe_index = 1; + } + } + kunmap_atomic(first_pbe, KM_BOUNCE_READ); +} + +#else /* CONFIG_HIGHMEM */ +static void copyback_high(void) { } +#endif + +char toi_wait_for_keypress_dev_console(int timeout) +{ + int fd, this_timeout = 255; + char key = '\0'; + struct termios t, t_backup; + + /* We should be guaranteed /dev/console exists after populate_rootfs() + * in init/main.c. + */ + fd = sys_open("/dev/console", O_RDONLY, 0); + if (fd < 0) { + printk(KERN_INFO "Couldn't open /dev/console.\n"); + return key; + } + + if (sys_ioctl(fd, TCGETS, (long)&t) < 0) + goto out_close; + + memcpy(&t_backup, &t, sizeof(t)); + + t.c_lflag &= ~(ISIG|ICANON|ECHO); + t.c_cc[VMIN] = 0; + +new_timeout: + if (timeout > 0) { + this_timeout = timeout < 26 ? timeout : 25; + timeout -= this_timeout; + this_timeout *= 10; + } + + t.c_cc[VTIME] = this_timeout; + + if (sys_ioctl(fd, TCSETS, (long)&t) < 0) + goto out_restore; + + while (1) { + if (sys_read(fd, &key, 1) <= 0) { + if (timeout) + goto new_timeout; + key = '\0'; + break; + } + key = tolower(key); + if (test_toi_state(TOI_SANITY_CHECK_PROMPT)) { + if (key == 'c') { + set_toi_state(TOI_CONTINUE_REQ); + break; + } else if (key == ' ') + break; + } else + break; + } + +out_restore: + sys_ioctl(fd, TCSETS, (long)&t_backup); +out_close: + sys_close(fd); + + return key; +} +EXPORT_SYMBOL_GPL(toi_wait_for_keypress_dev_console); + +struct toi_boot_kernel_data toi_bkd __nosavedata + __attribute__((aligned(PAGE_SIZE))) = { + MY_BOOT_KERNEL_DATA_VERSION, + 0, +#ifdef CONFIG_TOI_REPLACE_SWSUSP + (1 << TOI_REPLACE_SWSUSP) | +#endif + (1 << TOI_NO_FLUSHER_THREAD) | + (1 << TOI_PAGESET2_FULL) | (1 << TOI_LATE_CPU_HOTPLUG), +}; +EXPORT_SYMBOL_GPL(toi_bkd); + +struct block_device *toi_open_by_devnum(dev_t dev, fmode_t mode) +{ + struct block_device *bdev = bdget(dev); + int err = -ENOMEM; + if (bdev) + err = blkdev_get(bdev, mode); + return err ? ERR_PTR(err) : bdev; +} +EXPORT_SYMBOL_GPL(toi_open_by_devnum); + +int toi_wait = CONFIG_TOI_DEFAULT_WAIT; +EXPORT_SYMBOL_GPL(toi_wait); + +struct toi_core_fns *toi_core_fns; +EXPORT_SYMBOL_GPL(toi_core_fns); + +unsigned long toi_result; +EXPORT_SYMBOL_GPL(toi_result); + +struct pagedir pagedir1 = {1}; +EXPORT_SYMBOL_GPL(pagedir1); + +unsigned long toi_get_nonconflicting_page(void) +{ + return toi_core_fns->get_nonconflicting_page(); +} + +int toi_post_context_save(void) +{ + return toi_core_fns->post_context_save(); +} + +int toi_try_hibernate(void) +{ + if (!toi_core_fns) + return -ENODEV; + + return toi_core_fns->try_hibernate(); +} + +static int num_resume_calls; +#ifdef CONFIG_TOI_IGNORE_LATE_INITCALL +static int ignore_late_initcall = 1; +#else +static int ignore_late_initcall; +#endif + +void toi_try_resume(void) +{ + /* Don't let it wrap around eventually */ + if (num_resume_calls < 2) + num_resume_calls++; + + if (num_resume_calls == 1 && ignore_late_initcall) { + printk(KERN_INFO "TuxOnIce: Ignoring late initcall, as requested.\n"); + return; + } + + if (toi_core_fns) + toi_core_fns->try_resume(); + else + printk(KERN_INFO "TuxOnIce core not loaded yet.\n"); +} + +int toi_lowlevel_builtin(void) +{ + int error = 0; + + save_processor_state(); + error = swsusp_arch_suspend(); + if (error) + printk(KERN_ERR "Error %d hibernating\n", error); + + /* Restore control flow appears here */ + if (!toi_in_hibernate) { + copyback_high(); + set_toi_state(TOI_NOW_RESUMING); + } + + restore_processor_state(); + + return error; +} +EXPORT_SYMBOL_GPL(toi_lowlevel_builtin); + +unsigned long toi_compress_bytes_in; +EXPORT_SYMBOL_GPL(toi_compress_bytes_in); + +unsigned long toi_compress_bytes_out; +EXPORT_SYMBOL_GPL(toi_compress_bytes_out); + +unsigned long toi_state = ((1 << TOI_BOOT_TIME) | + (1 << TOI_IGNORE_LOGLEVEL) | + (1 << TOI_IO_STOPPED)); +EXPORT_SYMBOL_GPL(toi_state); + +/* The number of hibernates we have started (some may have been cancelled) */ +unsigned int nr_hibernates; +EXPORT_SYMBOL_GPL(nr_hibernates); + +int toi_running; +EXPORT_SYMBOL_GPL(toi_running); + +__nosavedata int toi_in_hibernate; +EXPORT_SYMBOL_GPL(toi_in_hibernate); + +__nosavedata struct pbe *restore_highmem_pblist; +EXPORT_SYMBOL_GPL(restore_highmem_pblist); + +static int __init toi_wait_setup(char *str) +{ + int value; + + if (sscanf(str, "=%d", &value)) { + if (value < -1 || value > 255) + printk(KERN_INFO "TuxOnIce_wait outside range -1 to " + "255.\n"); + else + toi_wait = value; + } + + return 1; +} + +__setup("toi_wait", toi_wait_setup); + +static int __init toi_ignore_late_initcall_setup(char *str) +{ + int value; + + if (sscanf(str, "=%d", &value)) + ignore_late_initcall = value; + + return 1; +} + +__setup("toi_initramfs_resume_only", toi_ignore_late_initcall_setup); diff --git a/kernel/power/tuxonice_builtin.h b/kernel/power/tuxonice_builtin.h new file mode 100644 index 0000000..49b25b7 --- /dev/null +++ b/kernel/power/tuxonice_builtin.h @@ -0,0 +1,27 @@ +/* + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + */ +#include + +extern struct toi_core_fns *toi_core_fns; +extern unsigned long toi_compress_bytes_in, toi_compress_bytes_out; +extern unsigned int nr_hibernates; +extern int toi_in_hibernate; + +extern __nosavedata struct pbe *restore_highmem_pblist; + +int toi_lowlevel_builtin(void); + +#ifdef CONFIG_HIGHMEM +extern __nosavedata struct zone_data *toi_nosave_zone_list; +extern __nosavedata unsigned long toi_nosave_max_pfn; +#endif + +extern unsigned long toi_get_nonconflicting_page(void); +extern int toi_post_context_save(void); + +extern char toi_wait_for_keypress_dev_console(int timeout); +extern struct block_device *toi_open_by_devnum(dev_t dev, fmode_t mode); +extern int toi_wait; diff --git a/kernel/power/tuxonice_checksum.c b/kernel/power/tuxonice_checksum.c new file mode 100644 index 0000000..b0adc17 --- /dev/null +++ b/kernel/power/tuxonice_checksum.c @@ -0,0 +1,375 @@ +/* + * kernel/power/tuxonice_checksum.c + * + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * Copyright (C) 2006 Red Hat, inc. + * + * This file is released under the GPLv2. + * + * This file contains data checksum routines for TuxOnIce, + * using cryptoapi. They are used to locate any modifications + * made to pageset 2 while we're saving it. + */ + +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_io.h" +#include "tuxonice_pageflags.h" +#include "tuxonice_checksum.h" +#include "tuxonice_pagedir.h" +#include "tuxonice_alloc.h" + +static struct toi_module_ops toi_checksum_ops; + +/* Constant at the mo, but I might allow tuning later */ +static char toi_checksum_name[32] = "md4"; +/* Bytes per checksum */ +#define CHECKSUM_SIZE (16) + +#define CHECKSUMS_PER_PAGE ((PAGE_SIZE - sizeof(void *)) / CHECKSUM_SIZE) + +struct cpu_context { + struct crypto_hash *transform; + struct hash_desc desc; + struct scatterlist sg[2]; + char *buf; +}; + +static DEFINE_PER_CPU(struct cpu_context, contexts); +static int pages_allocated; +static unsigned long page_list; + +static int toi_num_resaved; + +static unsigned long this_checksum, next_page; +static int checksum_index; + +static inline int checksum_pages_needed(void) +{ + return DIV_ROUND_UP(pagedir2.size, CHECKSUMS_PER_PAGE); +} + +/* ---- Local buffer management ---- */ + +/* + * toi_checksum_cleanup + * + * Frees memory allocated for our labours. + */ +static void toi_checksum_cleanup(int ending_cycle) +{ + int cpu; + + if (ending_cycle) { + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + if (this->transform) { + crypto_free_hash(this->transform); + this->transform = NULL; + this->desc.tfm = NULL; + } + + if (this->buf) { + toi_free_page(27, (unsigned long) this->buf); + this->buf = NULL; + } + } + } +} + +/* + * toi_crypto_initialise + * + * Prepare to do some work by allocating buffers and transforms. + * Returns: Int: Zero. Even if we can't set up checksum, we still + * seek to hibernate. + */ +static int toi_checksum_initialise(int starting_cycle) +{ + int cpu; + + if (!(starting_cycle & SYSFS_HIBERNATE) || !toi_checksum_ops.enabled) + return 0; + + if (!*toi_checksum_name) { + printk(KERN_INFO "TuxOnIce: No checksum algorithm name set.\n"); + return 1; + } + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + struct page *page; + + this->transform = crypto_alloc_hash(toi_checksum_name, 0, 0); + if (IS_ERR(this->transform)) { + printk(KERN_INFO "TuxOnIce: Failed to initialise the " + "%s checksum algorithm: %ld.\n", + toi_checksum_name, (long) this->transform); + this->transform = NULL; + return 1; + } + + this->desc.tfm = this->transform; + this->desc.flags = 0; + + page = toi_alloc_page(27, GFP_KERNEL); + if (!page) + return 1; + this->buf = page_address(page); + sg_init_one(&this->sg[0], this->buf, PAGE_SIZE); + } + return 0; +} + +/* + * toi_checksum_print_debug_stats + * @buffer: Pointer to a buffer into which the debug info will be printed. + * @size: Size of the buffer. + * + * Print information to be recorded for debugging purposes into a buffer. + * Returns: Number of characters written to the buffer. + */ + +static int toi_checksum_print_debug_stats(char *buffer, int size) +{ + int len; + + if (!toi_checksum_ops.enabled) + return scnprintf(buffer, size, + "- Checksumming disabled.\n"); + + len = scnprintf(buffer, size, "- Checksum method is '%s'.\n", + toi_checksum_name); + len += scnprintf(buffer + len, size - len, + " %d pages resaved in atomic copy.\n", toi_num_resaved); + return len; +} + +static int toi_checksum_memory_needed(void) +{ + return toi_checksum_ops.enabled ? + checksum_pages_needed() << PAGE_SHIFT : 0; +} + +static int toi_checksum_storage_needed(void) +{ + if (toi_checksum_ops.enabled) + return strlen(toi_checksum_name) + sizeof(int) + 1; + else + return 0; +} + +/* + * toi_checksum_save_config_info + * @buffer: Pointer to a buffer of size PAGE_SIZE. + * + * Save informaton needed when reloading the image at resume time. + * Returns: Number of bytes used for saving our data. + */ +static int toi_checksum_save_config_info(char *buffer) +{ + int namelen = strlen(toi_checksum_name) + 1; + int total_len; + + *((unsigned int *) buffer) = namelen; + strncpy(buffer + sizeof(unsigned int), toi_checksum_name, namelen); + total_len = sizeof(unsigned int) + namelen; + return total_len; +} + +/* toi_checksum_load_config_info + * @buffer: Pointer to the start of the data. + * @size: Number of bytes that were saved. + * + * Description: Reload information needed for dechecksuming the image at + * resume time. + */ +static void toi_checksum_load_config_info(char *buffer, int size) +{ + int namelen; + + namelen = *((unsigned int *) (buffer)); + strncpy(toi_checksum_name, buffer + sizeof(unsigned int), + namelen); + return; +} + +/* + * Free Checksum Memory + */ + +void free_checksum_pages(void) +{ + while (pages_allocated) { + unsigned long next = *((unsigned long *) page_list); + ClearPageNosave(virt_to_page(page_list)); + toi_free_page(15, (unsigned long) page_list); + page_list = next; + pages_allocated--; + } +} + +/* + * Allocate Checksum Memory + */ + +int allocate_checksum_pages(void) +{ + int pages_needed = checksum_pages_needed(); + + if (!toi_checksum_ops.enabled) + return 0; + + while (pages_allocated < pages_needed) { + unsigned long *new_page = + (unsigned long *) toi_get_zeroed_page(15, TOI_ATOMIC_GFP); + if (!new_page) { + printk(KERN_ERR "Unable to allocate checksum pages.\n"); + return -ENOMEM; + } + SetPageNosave(virt_to_page(new_page)); + (*new_page) = page_list; + page_list = (unsigned long) new_page; + pages_allocated++; + } + + next_page = (unsigned long) page_list; + checksum_index = 0; + + return 0; +} + +char *tuxonice_get_next_checksum(void) +{ + if (!toi_checksum_ops.enabled) + return NULL; + + if (checksum_index % CHECKSUMS_PER_PAGE) + this_checksum += CHECKSUM_SIZE; + else { + this_checksum = next_page + sizeof(void *); + next_page = *((unsigned long *) next_page); + } + + checksum_index++; + return (char *) this_checksum; +} + +int tuxonice_calc_checksum(struct page *page, char *checksum_locn) +{ + char *pa; + int result, cpu = smp_processor_id(); + struct cpu_context *ctx = &per_cpu(contexts, cpu); + + if (!toi_checksum_ops.enabled) + return 0; + + pa = kmap(page); + memcpy(ctx->buf, pa, PAGE_SIZE); + kunmap(page); + result = crypto_hash_digest(&ctx->desc, ctx->sg, PAGE_SIZE, + checksum_locn); + return result; +} +/* + * Calculate checksums + */ + +void check_checksums(void) +{ + int pfn, index = 0, cpu = smp_processor_id(); + char current_checksum[CHECKSUM_SIZE]; + struct cpu_context *ctx = &per_cpu(contexts, cpu); + + if (!toi_checksum_ops.enabled) + return; + + next_page = (unsigned long) page_list; + + toi_num_resaved = 0; + this_checksum = 0; + + memory_bm_position_reset(pageset2_map); + for (pfn = memory_bm_next_pfn(pageset2_map); pfn != BM_END_OF_MAP; + pfn = memory_bm_next_pfn(pageset2_map)) { + int ret; + char *pa; + struct page *page = pfn_to_page(pfn); + + if (index % CHECKSUMS_PER_PAGE) { + this_checksum += CHECKSUM_SIZE; + } else { + this_checksum = next_page + sizeof(void *); + next_page = *((unsigned long *) next_page); + } + + /* Done when IRQs disabled so must be atomic */ + pa = kmap_atomic(page, KM_USER1); + memcpy(ctx->buf, pa, PAGE_SIZE); + kunmap_atomic(pa, KM_USER1); + ret = crypto_hash_digest(&ctx->desc, ctx->sg, PAGE_SIZE, + current_checksum); + + if (ret) { + printk(KERN_INFO "Digest failed. Returned %d.\n", ret); + return; + } + + if (memcmp(current_checksum, (char *) this_checksum, + CHECKSUM_SIZE)) { + SetPageResave(pfn_to_page(pfn)); + toi_num_resaved++; + if (test_action_state(TOI_ABORT_ON_RESAVE_NEEDED)) + set_abort_result(TOI_RESAVE_NEEDED); + } + + index++; + } +} + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("enabled", SYSFS_RW, &toi_checksum_ops.enabled, 0, 1, 0, + NULL), + SYSFS_BIT("abort_if_resave_needed", SYSFS_RW, &toi_bkd.toi_action, + TOI_ABORT_ON_RESAVE_NEEDED, 0) +}; + +/* + * Ops structure. + */ +static struct toi_module_ops toi_checksum_ops = { + .type = MISC_MODULE, + .name = "checksumming", + .directory = "checksum", + .module = THIS_MODULE, + .initialise = toi_checksum_initialise, + .cleanup = toi_checksum_cleanup, + .print_debug_info = toi_checksum_print_debug_stats, + .save_config_info = toi_checksum_save_config_info, + .load_config_info = toi_checksum_load_config_info, + .memory_needed = toi_checksum_memory_needed, + .storage_needed = toi_checksum_storage_needed, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ +int toi_checksum_init(void) +{ + int result = toi_register_module(&toi_checksum_ops); + return result; +} + +void toi_checksum_exit(void) +{ + toi_unregister_module(&toi_checksum_ops); +} diff --git a/kernel/power/tuxonice_checksum.h b/kernel/power/tuxonice_checksum.h new file mode 100644 index 0000000..84a9174 --- /dev/null +++ b/kernel/power/tuxonice_checksum.h @@ -0,0 +1,32 @@ +/* + * kernel/power/tuxonice_checksum.h + * + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * Copyright (C) 2006 Red Hat, inc. + * + * This file is released under the GPLv2. + * + * This file contains data checksum routines for TuxOnIce, + * using cryptoapi. They are used to locate any modifications + * made to pageset 2 while we're saving it. + */ + +#if defined(CONFIG_TOI_CHECKSUM) +extern int toi_checksum_init(void); +extern void toi_checksum_exit(void); +void check_checksums(void); +int allocate_checksum_pages(void); +void free_checksum_pages(void); +char *tuxonice_get_next_checksum(void); +int tuxonice_calc_checksum(struct page *page, char *checksum_locn); +#else +static inline int toi_checksum_init(void) { return 0; } +static inline void toi_checksum_exit(void) { } +static inline void check_checksums(void) { }; +static inline int allocate_checksum_pages(void) { return 0; }; +static inline void free_checksum_pages(void) { }; +static inline char *tuxonice_get_next_checksum(void) { return NULL; }; +static inline int tuxonice_calc_checksum(struct page *page, char *checksum_locn) + { return 0; } +#endif + diff --git a/kernel/power/tuxonice_cluster.c b/kernel/power/tuxonice_cluster.c new file mode 100644 index 0000000..0fad533 --- /dev/null +++ b/kernel/power/tuxonice_cluster.c @@ -0,0 +1,1069 @@ +/* + * kernel/power/tuxonice_cluster.c + * + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * This file contains routines for cluster hibernation support. + * + * Based on ip autoconfiguration code in net/ipv4/ipconfig.c. + * + * How does it work? + * + * There is no 'master' node that tells everyone else what to do. All nodes + * send messages to the broadcast address/port, maintain a list of peers + * and figure out when to progress to the next step in hibernating or resuming. + * This makes us more fault tolerant when it comes to nodes coming and going + * (which may be more of an issue if we're hibernating when power supplies + * are being unreliable). + * + * At boot time, we start a ktuxonice thread that handles communication with + * other nodes. This node maintains a state machine that controls our progress + * through hibernating and resuming, keeping us in step with other nodes. Nodes + * are identified by their hw address. + * + * On startup, the node sends CLUSTER_PING on the configured interface's + * broadcast address, port $toi_cluster_port (see below) and begins to listen + * for other broadcast messages. CLUSTER_PING messages are repeated at + * intervals of 5 minutes, with a random offset to spread traffic out. + * + * A hibernation cycle is initiated from any node via + * + * echo > /sys/power/tuxonice/do_hibernate + * + * and (possibily) the hibernate script. At each step of the process, the node + * completes its work, and waits for all other nodes to signal completion of + * their work (or timeout) before progressing to the next step. + * + * Request/state Action before reply Possible reply Next state + * HIBERNATE capable, pre-script HIBERNATE|ACK NODE_PREP + * HIBERNATE|NACK INIT_0 + * + * PREP prepare_image PREP|ACK IMAGE_WRITE + * PREP|NACK INIT_0 + * ABORT RUNNING + * + * IO write image IO|ACK power off + * ABORT POST_RESUME + * + * (Boot time) check for image IMAGE|ACK RESUME_PREP + * (Note 1) + * IMAGE|NACK (Note 2) + * + * PREP prepare read image PREP|ACK IMAGE_READ + * PREP|NACK (As NACK_IMAGE) + * + * IO read image IO|ACK POST_RESUME + * + * POST_RESUME thaw, post-script RUNNING + * + * INIT_0 init 0 + * + * Other messages: + * + * - PING: Request for all other live nodes to send a PONG. Used at startup to + * announce presence, when a node is suspected dead and periodically, in case + * segments of the network are [un]plugged. + * + * - PONG: Response to a PING. + * + * - ABORT: Request to cancel writing an image. + * + * - BYE: Notification that this node is shutting down. + * + * Note 1: Repeated at 3s intervals until we continue to boot/resume, so that + * nodes which are slower to start up can get state synchronised. If a node + * starting up sees other nodes sending RESUME_PREP or IMAGE_READ, it may send + * ACK_IMAGE and they will wait for it to catch up. If it sees ACK_READ, it + * must invalidate its image (if any) and boot normally. + * + * Note 2: May occur when one node lost power or powered off while others + * hibernated. This node waits for others to complete resuming (ACK_READ) + * before completing its boot, so that it appears as a fail node restarting. + * + * If any node has an image, then it also has a list of nodes that hibernated + * in synchronisation with it. The node will wait for other nodes to appear + * or timeout before beginning its restoration. + * + * If a node has no image, it needs to wait, in case other nodes which do have + * an image are going to resume, but are taking longer to announce their + * presence. For this reason, the user can specify a timeout value and a number + * of nodes detected before we just continue. (We might want to assume in a + * cluster of, say, 15 nodes, if 8 others have booted without finding an image, + * the remaining nodes will too. This might help in situations where some nodes + * are much slower to boot, or more subject to hardware failures or such like). + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_alloc.h" +#include "tuxonice_io.h" + +#if 1 +#define PRINTK(a, b...) do { printk(a, ##b); } while (0) +#else +#define PRINTK(a, b...) do { } while (0) +#endif + +static int loopback_mode; +static int num_local_nodes = 1; +#define MAX_LOCAL_NODES 8 +#define SADDR (loopback_mode ? b->sid : h->saddr) + +#define MYNAME "TuxOnIce Clustering" + +enum cluster_message { + MSG_ACK = 1, + MSG_NACK = 2, + MSG_PING = 4, + MSG_ABORT = 8, + MSG_BYE = 16, + MSG_HIBERNATE = 32, + MSG_IMAGE = 64, + MSG_IO = 128, + MSG_RUNNING = 256 +}; + +static char *str_message(int message) +{ + switch (message) { + case 4: + return "Ping"; + case 8: + return "Abort"; + case 9: + return "Abort acked"; + case 10: + return "Abort nacked"; + case 16: + return "Bye"; + case 17: + return "Bye acked"; + case 18: + return "Bye nacked"; + case 32: + return "Hibernate request"; + case 33: + return "Hibernate ack"; + case 34: + return "Hibernate nack"; + case 64: + return "Image exists?"; + case 65: + return "Image does exist"; + case 66: + return "No image here"; + case 128: + return "I/O"; + case 129: + return "I/O okay"; + case 130: + return "I/O failed"; + case 256: + return "Running"; + default: + printk(KERN_ERR "Unrecognised message %d.\n", message); + return "Unrecognised message (see dmesg)"; + } +} + +#define MSG_ACK_MASK (MSG_ACK | MSG_NACK) +#define MSG_STATE_MASK (~MSG_ACK_MASK) + +struct node_info { + struct list_head member_list; + wait_queue_head_t member_events; + spinlock_t member_list_lock; + spinlock_t receive_lock; + int peer_count, ignored_peer_count; + struct toi_sysfs_data sysfs_data; + enum cluster_message current_message; +}; + +struct node_info node_array[MAX_LOCAL_NODES]; + +struct cluster_member { + __be32 addr; + enum cluster_message message; + struct list_head list; + int ignore; +}; + +#define toi_cluster_port_send 3501 +#define toi_cluster_port_recv 3502 + +static struct net_device *net_dev; +static struct toi_module_ops toi_cluster_ops; + +static int toi_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pt, struct net_device *orig_dev); + +static struct packet_type toi_cluster_packet_type = { + .type = __constant_htons(ETH_P_IP), + .func = toi_recv, +}; + +struct toi_pkt { /* BOOTP packet format */ + struct iphdr iph; /* IP header */ + struct udphdr udph; /* UDP header */ + u8 htype; /* HW address type */ + u8 hlen; /* HW address length */ + __be32 xid; /* Transaction ID */ + __be16 secs; /* Seconds since we started */ + __be16 flags; /* Just what it says */ + u8 hw_addr[16]; /* Sender's HW address */ + u16 message; /* Message */ + unsigned long sid; /* Source ID for loopback testing */ +}; + +static char toi_cluster_iface[IFNAMSIZ] = CONFIG_TOI_DEFAULT_CLUSTER_INTERFACE; + +static int added_pack; + +static int others_have_image; + +/* Key used to allow multiple clusters on the same lan */ +static char toi_cluster_key[32] = CONFIG_TOI_DEFAULT_CLUSTER_KEY; +static char pre_hibernate_script[255] = + CONFIG_TOI_DEFAULT_CLUSTER_PRE_HIBERNATE; +static char post_hibernate_script[255] = + CONFIG_TOI_DEFAULT_CLUSTER_POST_HIBERNATE; + +/* List of cluster members */ +static unsigned long continue_delay = 5 * HZ; +static unsigned long cluster_message_timeout = 3 * HZ; + +/* === Membership list === */ + +static void print_member_info(int index) +{ + struct cluster_member *this; + + printk(KERN_INFO "==> Dumping node %d.\n", index); + + list_for_each_entry(this, &node_array[index].member_list, list) + printk(KERN_INFO "%d.%d.%d.%d last message %s. %s\n", + NIPQUAD(this->addr), + str_message(this->message), + this->ignore ? "(Ignored)" : ""); + printk(KERN_INFO "== Done ==\n"); +} + +static struct cluster_member *__find_member(int index, __be32 addr) +{ + struct cluster_member *this; + + list_for_each_entry(this, &node_array[index].member_list, list) { + if (this->addr != addr) + continue; + + return this; + } + + return NULL; +} + +static void set_ignore(int index, __be32 addr, struct cluster_member *this) +{ + if (this->ignore) { + PRINTK("Node %d already ignoring %d.%d.%d.%d.\n", + index, NIPQUAD(addr)); + return; + } + + PRINTK("Node %d sees node %d.%d.%d.%d now being ignored.\n", + index, NIPQUAD(addr)); + this->ignore = 1; + node_array[index].ignored_peer_count++; +} + +static int __add_update_member(int index, __be32 addr, int message) +{ + struct cluster_member *this; + + this = __find_member(index, addr); + if (this) { + if (this->message != message) { + this->message = message; + if ((message & MSG_NACK) && + (message & (MSG_HIBERNATE | MSG_IMAGE | MSG_IO))) + set_ignore(index, addr, this); + PRINTK("Node %d sees node %d.%d.%d.%d now sending " + "%s.\n", index, NIPQUAD(addr), + str_message(message)); + wake_up(&node_array[index].member_events); + } + return 0; + } + + this = (struct cluster_member *) toi_kzalloc(36, + sizeof(struct cluster_member), GFP_KERNEL); + + if (!this) + return -1; + + this->addr = addr; + this->message = message; + this->ignore = 0; + INIT_LIST_HEAD(&this->list); + + node_array[index].peer_count++; + + PRINTK("Node %d sees node %d.%d.%d.%d sending %s.\n", index, + NIPQUAD(addr), str_message(message)); + + if ((message & MSG_NACK) && + (message & (MSG_HIBERNATE | MSG_IMAGE | MSG_IO))) + set_ignore(index, addr, this); + list_add_tail(&this->list, &node_array[index].member_list); + return 1; +} + +static int add_update_member(int index, __be32 addr, int message) +{ + int result; + unsigned long flags; + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + result = __add_update_member(index, addr, message); + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); + + print_member_info(index); + + wake_up(&node_array[index].member_events); + + return result; +} + +static void del_member(int index, __be32 addr) +{ + struct cluster_member *this; + unsigned long flags; + + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + this = __find_member(index, addr); + + if (this) { + list_del_init(&this->list); + toi_kfree(36, this); + node_array[index].peer_count--; + } + + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); +} + +/* === Message transmission === */ + +static void toi_send_if(int message, unsigned long my_id); + +/* + * Process received TOI packet. + */ +static int toi_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pt, struct net_device *orig_dev) +{ + struct toi_pkt *b; + struct iphdr *h; + int len, result, index; + unsigned long addr, message, ack; + + /* Perform verifications before taking the lock. */ + if (skb->pkt_type == PACKET_OTHERHOST) + goto drop; + + if (dev != net_dev) + goto drop; + + skb = skb_share_check(skb, GFP_ATOMIC); + if (!skb) + return NET_RX_DROP; + + if (!pskb_may_pull(skb, + sizeof(struct iphdr) + + sizeof(struct udphdr))) + goto drop; + + b = (struct toi_pkt *)skb_network_header(skb); + h = &b->iph; + + if (h->ihl != 5 || h->version != 4 || h->protocol != IPPROTO_UDP) + goto drop; + + /* Fragments are not supported */ + if (h->frag_off & htons(IP_OFFSET | IP_MF)) { + if (net_ratelimit()) + printk(KERN_ERR "TuxOnIce: Ignoring fragmented " + "cluster message.\n"); + goto drop; + } + + if (skb->len < ntohs(h->tot_len)) + goto drop; + + if (ip_fast_csum((char *) h, h->ihl)) + goto drop; + + if (b->udph.source != htons(toi_cluster_port_send) || + b->udph.dest != htons(toi_cluster_port_recv)) + goto drop; + + if (ntohs(h->tot_len) < ntohs(b->udph.len) + sizeof(struct iphdr)) + goto drop; + + len = ntohs(b->udph.len) - sizeof(struct udphdr); + + /* Ok the front looks good, make sure we can get at the rest. */ + if (!pskb_may_pull(skb, skb->len)) + goto drop; + + b = (struct toi_pkt *)skb_network_header(skb); + h = &b->iph; + + addr = SADDR; + PRINTK(">>> Message %s received from " NIPQUAD_FMT ".\n", + str_message(b->message), NIPQUAD(addr)); + + message = b->message & MSG_STATE_MASK; + ack = b->message & MSG_ACK_MASK; + + for (index = 0; index < num_local_nodes; index++) { + int new_message = node_array[index].current_message, + old_message = new_message; + + if (index == SADDR || !old_message) { + PRINTK("Ignoring node %d (offline or self).\n", index); + continue; + } + + /* One message at a time, please. */ + spin_lock(&node_array[index].receive_lock); + + result = add_update_member(index, SADDR, b->message); + if (result == -1) { + printk(KERN_INFO "Failed to add new cluster member " + NIPQUAD_FMT ".\n", + NIPQUAD(addr)); + goto drop_unlock; + } + + switch (b->message & MSG_STATE_MASK) { + case MSG_PING: + break; + case MSG_ABORT: + break; + case MSG_BYE: + break; + case MSG_HIBERNATE: + /* Can I hibernate? */ + new_message = MSG_HIBERNATE | + ((index & 1) ? MSG_NACK : MSG_ACK); + break; + case MSG_IMAGE: + /* Can I resume? */ + new_message = MSG_IMAGE | + ((index & 1) ? MSG_NACK : MSG_ACK); + if (new_message != old_message) + printk(KERN_ERR "Setting whether I can resume " + "to %d.\n", new_message); + break; + case MSG_IO: + new_message = MSG_IO | MSG_ACK; + break; + case MSG_RUNNING: + break; + default: + if (net_ratelimit()) + printk(KERN_ERR "Unrecognised TuxOnIce cluster" + " message %d from " NIPQUAD_FMT ".\n", + b->message, NIPQUAD(addr)); + }; + + if (old_message != new_message) { + node_array[index].current_message = new_message; + printk(KERN_INFO ">>> Sending new message for node " + "%d.\n", index); + toi_send_if(new_message, index); + } else if (!ack) { + printk(KERN_INFO ">>> Resending message for node %d.\n", + index); + toi_send_if(new_message, index); + } +drop_unlock: + spin_unlock(&node_array[index].receive_lock); + }; + +drop: + /* Throw the packet out. */ + kfree_skb(skb); + + return 0; +} + +/* + * Send cluster message to single interface. + */ +static void toi_send_if(int message, unsigned long my_id) +{ + struct sk_buff *skb; + struct toi_pkt *b; + int hh_len = LL_RESERVED_SPACE(net_dev); + struct iphdr *h; + + /* Allocate packet */ + skb = alloc_skb(sizeof(struct toi_pkt) + hh_len + 15, GFP_KERNEL); + if (!skb) + return; + skb_reserve(skb, hh_len); + b = (struct toi_pkt *) skb_put(skb, sizeof(struct toi_pkt)); + memset(b, 0, sizeof(struct toi_pkt)); + + /* Construct IP header */ + skb_reset_network_header(skb); + h = ip_hdr(skb); + h->version = 4; + h->ihl = 5; + h->tot_len = htons(sizeof(struct toi_pkt)); + h->frag_off = htons(IP_DF); + h->ttl = 64; + h->protocol = IPPROTO_UDP; + h->daddr = htonl(INADDR_BROADCAST); + h->check = ip_fast_csum((unsigned char *) h, h->ihl); + + /* Construct UDP header */ + b->udph.source = htons(toi_cluster_port_send); + b->udph.dest = htons(toi_cluster_port_recv); + b->udph.len = htons(sizeof(struct toi_pkt) - sizeof(struct iphdr)); + /* UDP checksum not calculated -- explicitly allowed in BOOTP RFC */ + + /* Construct message */ + b->message = message; + b->sid = my_id; + b->htype = net_dev->type; /* can cause undefined behavior */ + b->hlen = net_dev->addr_len; + memcpy(b->hw_addr, net_dev->dev_addr, net_dev->addr_len); + b->secs = htons(3); /* 3 seconds */ + + /* Chain packet down the line... */ + skb->dev = net_dev; + skb->protocol = htons(ETH_P_IP); + if ((dev_hard_header(skb, net_dev, ntohs(skb->protocol), + net_dev->broadcast, net_dev->dev_addr, skb->len) < 0) || + dev_queue_xmit(skb) < 0) + printk(KERN_INFO "E"); +} + +/* ========================================= */ + +/* kTOICluster */ + +static atomic_t num_cluster_threads; +static DECLARE_WAIT_QUEUE_HEAD(clusterd_events); + +static int kTOICluster(void *data) +{ + unsigned long my_id; + + my_id = atomic_add_return(1, &num_cluster_threads) - 1; + node_array[my_id].current_message = (unsigned long) data; + + PRINTK("kTOICluster daemon %lu starting.\n", my_id); + + current->flags |= PF_NOFREEZE; + + while (node_array[my_id].current_message) { + toi_send_if(node_array[my_id].current_message, my_id); + sleep_on_timeout(&clusterd_events, + cluster_message_timeout); + PRINTK("Link state %lu is %d.\n", my_id, + node_array[my_id].current_message); + } + + toi_send_if(MSG_BYE, my_id); + atomic_dec(&num_cluster_threads); + wake_up(&clusterd_events); + + PRINTK("kTOICluster daemon %lu exiting.\n", my_id); + __set_current_state(TASK_RUNNING); + return 0; +} + +static void kill_clusterd(void) +{ + int i; + + for (i = 0; i < num_local_nodes; i++) { + if (node_array[i].current_message) { + PRINTK("Seeking to kill clusterd %d.\n", i); + node_array[i].current_message = 0; + } + } + wait_event(clusterd_events, + !atomic_read(&num_cluster_threads)); + PRINTK("All cluster daemons have exited.\n"); +} + +static int peers_not_in_message(int index, int message, int precise) +{ + struct cluster_member *this; + unsigned long flags; + int result = 0; + + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + list_for_each_entry(this, &node_array[index].member_list, list) { + if (this->ignore) + continue; + + PRINTK("Peer %d.%d.%d.%d sending %s. " + "Seeking %s.\n", + NIPQUAD(this->addr), + str_message(this->message), str_message(message)); + if ((precise ? this->message : + this->message & MSG_STATE_MASK) != + message) + result++; + } + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); + PRINTK("%d peers in sought message.\n", result); + return result; +} + +static void reset_ignored(int index) +{ + struct cluster_member *this; + unsigned long flags; + + spin_lock_irqsave(&node_array[index].member_list_lock, flags); + list_for_each_entry(this, &node_array[index].member_list, list) + this->ignore = 0; + node_array[index].ignored_peer_count = 0; + spin_unlock_irqrestore(&node_array[index].member_list_lock, flags); +} + +static int peers_in_message(int index, int message, int precise) +{ + return node_array[index].peer_count - + node_array[index].ignored_peer_count - + peers_not_in_message(index, message, precise); +} + +static int time_to_continue(int index, unsigned long start, int message) +{ + int first = peers_not_in_message(index, message, 0); + int second = peers_in_message(index, message, 1); + + PRINTK("First part returns %d, second returns %d.\n", first, second); + + if (!first && !second) { + PRINTK("All peers answered message %d.\n", + message); + return 1; + } + + if (time_after(jiffies, start + continue_delay)) { + PRINTK("Timeout reached.\n"); + return 1; + } + + PRINTK("Not time to continue yet (%lu < %lu).\n", jiffies, + start + continue_delay); + return 0; +} + +void toi_initiate_cluster_hibernate(void) +{ + int result; + unsigned long start; + + result = do_toi_step(STEP_HIBERNATE_PREPARE_IMAGE); + if (result) + return; + + toi_send_if(MSG_HIBERNATE, 0); + + start = jiffies; + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_HIBERNATE)); + + if (test_action_state(TOI_FREEZER_TEST)) { + toi_send_if(MSG_ABORT, 0); + + start = jiffies; + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_RUNNING)); + + do_toi_step(STEP_QUIET_CLEANUP); + return; + } + + toi_send_if(MSG_IO, 0); + + result = do_toi_step(STEP_HIBERNATE_SAVE_IMAGE); + if (result) + return; + + /* This code runs at resume time too! */ + if (toi_in_hibernate) + result = do_toi_step(STEP_HIBERNATE_POWERDOWN); +} +EXPORT_SYMBOL_GPL(toi_initiate_cluster_hibernate); + +/* toi_cluster_print_debug_stats + * + * Description: Print information to be recorded for debugging purposes into a + * buffer. + * Arguments: buffer: Pointer to a buffer into which the debug info will be + * printed. + * size: Size of the buffer. + * Returns: Number of characters written to the buffer. + */ +static int toi_cluster_print_debug_stats(char *buffer, int size) +{ + int len; + + if (strlen(toi_cluster_iface)) + len = scnprintf(buffer, size, + "- Cluster interface is '%s'.\n", + toi_cluster_iface); + else + len = scnprintf(buffer, size, + "- Cluster support is disabled.\n"); + return len; +} + +/* cluster_memory_needed + * + * Description: Tell the caller how much memory we need to operate during + * hibernate/resume. + * Returns: Unsigned long. Maximum number of bytes of memory required for + * operation. + */ +static int toi_cluster_memory_needed(void) +{ + return 0; +} + +static int toi_cluster_storage_needed(void) +{ + return 1 + strlen(toi_cluster_iface); +} + +/* toi_cluster_save_config_info + * + * Description: Save informaton needed when reloading the image at resume time. + * Arguments: Buffer: Pointer to a buffer of size PAGE_SIZE. + * Returns: Number of bytes used for saving our data. + */ +static int toi_cluster_save_config_info(char *buffer) +{ + strcpy(buffer, toi_cluster_iface); + return strlen(toi_cluster_iface + 1); +} + +/* toi_cluster_load_config_info + * + * Description: Reload information needed for declustering the image at + * resume time. + * Arguments: Buffer: Pointer to the start of the data. + * Size: Number of bytes that were saved. + */ +static void toi_cluster_load_config_info(char *buffer, int size) +{ + strncpy(toi_cluster_iface, buffer, size); + return; +} + +static void cluster_startup(void) +{ + int have_image = do_check_can_resume(), i; + unsigned long start = jiffies, initial_message; + struct task_struct *p; + + initial_message = MSG_IMAGE; + + have_image = 1; + + for (i = 0; i < num_local_nodes; i++) { + PRINTK("Starting ktoiclusterd %d.\n", i); + p = kthread_create(kTOICluster, (void *) initial_message, + "ktoiclusterd/%d", i); + if (IS_ERR(p)) { + printk(KERN_ERR "Failed to start ktoiclusterd.\n"); + return; + } + + wake_up_process(p); + } + + /* Wait for delay or someone else sending first message */ + wait_event(node_array[0].member_events, time_to_continue(0, start, + MSG_IMAGE)); + + others_have_image = peers_in_message(0, MSG_IMAGE | MSG_ACK, 1); + + printk(KERN_INFO "Continuing. I %shave an image. Peers with image:" + " %d.\n", have_image ? "" : "don't ", others_have_image); + + if (have_image) { + int result; + + /* Start to resume */ + printk(KERN_INFO " === Starting to resume === \n"); + node_array[0].current_message = MSG_IO; + toi_send_if(MSG_IO, 0); + + /* result = do_toi_step(STEP_RESUME_LOAD_PS1); */ + result = 0; + + if (!result) { + /* + * Atomic restore - we'll come back in the hibernation + * path. + */ + + /* result = do_toi_step(STEP_RESUME_DO_RESTORE); */ + result = 0; + + /* do_toi_step(STEP_QUIET_CLEANUP); */ + } + + node_array[0].current_message |= MSG_NACK; + + /* For debugging - disable for real life? */ + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_IO)); + } + + if (others_have_image) { + /* Wait for them to resume */ + printk(KERN_INFO "Waiting for other nodes to resume.\n"); + start = jiffies; + wait_event(node_array[0].member_events, + time_to_continue(0, start, MSG_RUNNING)); + if (peers_not_in_message(0, MSG_RUNNING, 0)) + printk(KERN_INFO "Timed out while waiting for other " + "nodes to resume.\n"); + } + + /* Find out whether an image exists here. Send ACK_IMAGE or NACK_IMAGE + * as appropriate. + * + * If we don't have an image: + * - Wait until someone else says they have one, or conditions are met + * for continuing to boot (n machines or t seconds). + * - If anyone has an image, wait for them to resume before continuing + * to boot. + * + * If we have an image: + * - Wait until conditions are met before continuing to resume (n + * machines or t seconds). Send RESUME_PREP and freeze processes. + * NACK_PREP if freezing fails (shouldn't) and follow logic for + * us having no image above. On success, wait for [N]ACK_PREP from + * other machines. Read image (including atomic restore) until done. + * Wait for ACK_READ from others (should never fail). Thaw processes + * and do post-resume. (The section after the atomic restore is done + * via the code for hibernating). + */ + + node_array[0].current_message = MSG_RUNNING; +} + +/* toi_cluster_open_iface + * + * Description: Prepare to use an interface. + */ + +static int toi_cluster_open_iface(void) +{ + struct net_device *dev; + + rtnl_lock(); + + for_each_netdev(&init_net, dev) { + if (/* dev == &init_net.loopback_dev || */ + strcmp(dev->name, toi_cluster_iface)) + continue; + + net_dev = dev; + break; + } + + rtnl_unlock(); + + if (!net_dev) { + printk(KERN_ERR MYNAME ": Device %s not found.\n", + toi_cluster_iface); + return -ENODEV; + } + + dev_add_pack(&toi_cluster_packet_type); + added_pack = 1; + + loopback_mode = (net_dev == init_net.loopback_dev); + num_local_nodes = loopback_mode ? 8 : 1; + + PRINTK("Loopback mode is %s. Number of local nodes is %d.\n", + loopback_mode ? "on" : "off", num_local_nodes); + + cluster_startup(); + return 0; +} + +/* toi_cluster_close_iface + * + * Description: Stop using an interface. + */ + +static int toi_cluster_close_iface(void) +{ + kill_clusterd(); + if (added_pack) { + dev_remove_pack(&toi_cluster_packet_type); + added_pack = 0; + } + return 0; +} + +static void write_side_effect(void) +{ + if (toi_cluster_ops.enabled) { + toi_cluster_open_iface(); + set_toi_state(TOI_CLUSTER_MODE); + } else { + toi_cluster_close_iface(); + clear_toi_state(TOI_CLUSTER_MODE); + } +} + +static void node_write_side_effect(void) +{ +} + +/* + * data for our sysfs entries. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_STRING("interface", SYSFS_RW, toi_cluster_iface, IFNAMSIZ, 0, + NULL), + SYSFS_INT("enabled", SYSFS_RW, &toi_cluster_ops.enabled, 0, 1, 0, + write_side_effect), + SYSFS_STRING("cluster_name", SYSFS_RW, toi_cluster_key, 32, 0, NULL), + SYSFS_STRING("pre-hibernate-script", SYSFS_RW, pre_hibernate_script, + 256, 0, NULL), + SYSFS_STRING("post-hibernate-script", SYSFS_RW, post_hibernate_script, + 256, 0, STRING), + SYSFS_UL("continue_delay", SYSFS_RW, &continue_delay, HZ / 2, 60 * HZ, + 0) +}; + +/* + * Ops structure. + */ + +static struct toi_module_ops toi_cluster_ops = { + .type = FILTER_MODULE, + .name = "Cluster", + .directory = "cluster", + .module = THIS_MODULE, + .memory_needed = toi_cluster_memory_needed, + .print_debug_info = toi_cluster_print_debug_stats, + .save_config_info = toi_cluster_save_config_info, + .load_config_info = toi_cluster_load_config_info, + .storage_needed = toi_cluster_storage_needed, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ + +#ifdef MODULE +#define INIT static __init +#define EXIT static __exit +#else +#define INIT +#define EXIT +#endif + +INIT int toi_cluster_init(void) +{ + int temp = toi_register_module(&toi_cluster_ops), i; + struct kobject *kobj = toi_cluster_ops.dir_kobj; + + for (i = 0; i < MAX_LOCAL_NODES; i++) { + node_array[i].current_message = 0; + INIT_LIST_HEAD(&node_array[i].member_list); + init_waitqueue_head(&node_array[i].member_events); + spin_lock_init(&node_array[i].member_list_lock); + spin_lock_init(&node_array[i].receive_lock); + + /* Set up sysfs entry */ + node_array[i].sysfs_data.attr.name = toi_kzalloc(8, + sizeof(node_array[i].sysfs_data.attr.name), + GFP_KERNEL); + sprintf((char *) node_array[i].sysfs_data.attr.name, "node_%d", + i); + node_array[i].sysfs_data.attr.mode = SYSFS_RW; + node_array[i].sysfs_data.type = TOI_SYSFS_DATA_INTEGER; + node_array[i].sysfs_data.flags = 0; + node_array[i].sysfs_data.data.integer.variable = + (int *) &node_array[i].current_message; + node_array[i].sysfs_data.data.integer.minimum = 0; + node_array[i].sysfs_data.data.integer.maximum = INT_MAX; + node_array[i].sysfs_data.write_side_effect = + node_write_side_effect; + toi_register_sysfs_file(kobj, &node_array[i].sysfs_data); + } + + toi_cluster_ops.enabled = (strlen(toi_cluster_iface) > 0); + + if (toi_cluster_ops.enabled) + toi_cluster_open_iface(); + + return temp; +} + +EXIT void toi_cluster_exit(void) +{ + int i; + toi_cluster_close_iface(); + + for (i = 0; i < MAX_LOCAL_NODES; i++) + toi_unregister_sysfs_file(toi_cluster_ops.dir_kobj, + &node_array[i].sysfs_data); + toi_unregister_module(&toi_cluster_ops); +} + +static int __init toi_cluster_iface_setup(char *iface) +{ + toi_cluster_ops.enabled = (*iface && + strcmp(iface, "off")); + + if (toi_cluster_ops.enabled) + strncpy(toi_cluster_iface, iface, strlen(iface)); +} + +__setup("toi_cluster=", toi_cluster_iface_setup); + +#ifdef MODULE +MODULE_LICENSE("GPL"); +module_init(toi_cluster_init); +module_exit(toi_cluster_exit); +MODULE_AUTHOR("Nigel Cunningham"); +MODULE_DESCRIPTION("Cluster Support for TuxOnIce"); +#endif diff --git a/kernel/power/tuxonice_cluster.h b/kernel/power/tuxonice_cluster.h new file mode 100644 index 0000000..b0f8918 --- /dev/null +++ b/kernel/power/tuxonice_cluster.h @@ -0,0 +1,19 @@ +/* + * kernel/power/tuxonice_cluster.h + * + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * Copyright (C) 2006 Red Hat, inc. + * + * This file is released under the GPLv2. + */ + +#ifdef CONFIG_TOI_CLUSTER +extern int toi_cluster_init(void); +extern void toi_cluster_exit(void); +extern void toi_initiate_cluster_hibernate(void); +#else +static inline int toi_cluster_init(void) { return 0; } +static inline void toi_cluster_exit(void) { } +static inline void toi_initiate_cluster_hibernate(void) { } +#endif + diff --git a/kernel/power/tuxonice_compress.c b/kernel/power/tuxonice_compress.c new file mode 100644 index 0000000..febb783 --- /dev/null +++ b/kernel/power/tuxonice_compress.c @@ -0,0 +1,448 @@ +/* + * kernel/power/compression.c + * + * Copyright (C) 2003-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * This file contains data compression routines for TuxOnIce, + * using cryptoapi. + */ + +#include +#include +#include +#include +#include + +#include "tuxonice_builtin.h" +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_alloc.h" + +static int toi_expected_compression; + +static struct toi_module_ops toi_compression_ops; +static struct toi_module_ops *next_driver; + +static char toi_compressor_name[32] = "lzo"; + +static DEFINE_MUTEX(stats_lock); + +struct cpu_context { + u8 *page_buffer; + struct crypto_comp *transform; + unsigned int len; + char *buffer_start; + char *output_buffer; +}; + +static DEFINE_PER_CPU(struct cpu_context, contexts); + +static int toi_compress_prepare_result; + +/* + * toi_compress_cleanup + * + * Frees memory allocated for our labours. + */ +static void toi_compress_cleanup(int toi_or_resume) +{ + int cpu; + + if (!toi_or_resume) + return; + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + if (this->transform) { + crypto_free_comp(this->transform); + this->transform = NULL; + } + + if (this->page_buffer) + toi_free_page(16, (unsigned long) this->page_buffer); + + this->page_buffer = NULL; + + if (this->output_buffer) + vfree(this->output_buffer); + + this->output_buffer = NULL; + } +} + +/* + * toi_crypto_prepare + * + * Prepare to do some work by allocating buffers and transforms. + */ +static int toi_compress_crypto_prepare(void) +{ + int cpu; + + if (!*toi_compressor_name) { + printk(KERN_INFO "TuxOnIce: Compression enabled but no " + "compressor name set.\n"); + return 1; + } + + for_each_online_cpu(cpu) { + struct cpu_context *this = &per_cpu(contexts, cpu); + this->transform = crypto_alloc_comp(toi_compressor_name, 0, 0); + if (IS_ERR(this->transform)) { + printk(KERN_INFO "TuxOnIce: Failed to initialise the " + "%s compression transform.\n", + toi_compressor_name); + this->transform = NULL; + return 1; + } + + this->page_buffer = + (char *) toi_get_zeroed_page(16, TOI_ATOMIC_GFP); + + if (!this->page_buffer) { + printk(KERN_ERR + "Failed to allocate a page buffer for TuxOnIce " + "compression driver.\n"); + return -ENOMEM; + } + + this->output_buffer = + (char *) vmalloc_32(2 * PAGE_SIZE); + + if (!this->output_buffer) { + printk(KERN_ERR + "Failed to allocate a output buffer for TuxOnIce " + "compression driver.\n"); + return -ENOMEM; + } + + } + + return 0; +} + +/* + * toi_compress_init + */ + +static int toi_compress_init(int toi_or_resume) +{ + if (!toi_or_resume) + return 0; + + toi_compress_bytes_in = 0; + toi_compress_bytes_out = 0; + + next_driver = toi_get_next_filter(&toi_compression_ops); + + if (!next_driver) + return -ECHILD; + + toi_compress_prepare_result = toi_compress_crypto_prepare(); + + return 0; +} + +/* + * toi_compress_rw_init() + */ + +static int toi_compress_rw_init(int rw, int stream_number) +{ + if (toi_compress_prepare_result) { + printk(KERN_ERR "Failed to initialise compression " + "algorithm.\n"); + if (rw == READ) { + printk(KERN_INFO "Unable to read the image.\n"); + return -ENODEV; + } else { + printk(KERN_INFO "Continuing without " + "compressing the image.\n"); + toi_compression_ops.enabled = 0; + } + } + + return 0; +} + +/* + * toi_compress_write_page() + * + * Compress a page of data, buffering output and passing on filled + * pages to the next module in the pipeline. + * + * Buffer_page: Pointer to a buffer of size PAGE_SIZE, containing + * data to be compressed. + * + * Returns: 0 on success. Otherwise the error is that returned by later + * modules, -ECHILD if we have a broken pipeline or -EIO if + * zlib errs. + */ +static int toi_compress_write_page(unsigned long index, + struct page *buffer_page, unsigned int buf_size) +{ + int ret, cpu = smp_processor_id(); + struct cpu_context *ctx = &per_cpu(contexts, cpu); + + if (!ctx->transform) + return next_driver->write_page(index, buffer_page, buf_size); + + ctx->buffer_start = kmap(buffer_page); + + ctx->len = buf_size; + + ret = crypto_comp_compress(ctx->transform, + ctx->buffer_start, buf_size, + ctx->output_buffer, &ctx->len); + + kunmap(buffer_page); + + mutex_lock(&stats_lock); + toi_compress_bytes_in += buf_size; + toi_compress_bytes_out += ctx->len; + mutex_unlock(&stats_lock); + + if (!ret && ctx->len < buf_size) { /* some compression */ + memcpy(ctx->page_buffer, ctx->output_buffer, ctx->len); + return next_driver->write_page(index, + virt_to_page(ctx->page_buffer), + ctx->len); + } else + return next_driver->write_page(index, buffer_page, buf_size); +} + +/* + * toi_compress_read_page() + * @buffer_page: struct page *. Pointer to a buffer of size PAGE_SIZE. + * + * Retrieve data from later modules and decompress it until the input buffer + * is filled. + * Zero if successful. Error condition from me or from downstream on failure. + */ +static int toi_compress_read_page(unsigned long *index, + struct page *buffer_page, unsigned int *buf_size) +{ + int ret, cpu = smp_processor_id(); + unsigned int len; + unsigned int outlen = PAGE_SIZE; + char *buffer_start; + struct cpu_context *ctx = &per_cpu(contexts, cpu); + + if (!ctx->transform) + return next_driver->read_page(index, buffer_page, buf_size); + + /* + * All our reads must be synchronous - we can't decompress + * data that hasn't been read yet. + */ + + *buf_size = PAGE_SIZE; + + ret = next_driver->read_page(index, buffer_page, &len); + + /* Error or uncompressed data */ + if (ret || len == PAGE_SIZE) + return ret; + + buffer_start = kmap(buffer_page); + memcpy(ctx->page_buffer, buffer_start, len); + ret = crypto_comp_decompress( + ctx->transform, + ctx->page_buffer, + len, buffer_start, &outlen); + if (ret) + abort_hibernate(TOI_FAILED_IO, + "Compress_read returned %d.\n", ret); + else if (outlen != PAGE_SIZE) { + abort_hibernate(TOI_FAILED_IO, + "Decompression yielded %d bytes instead of %ld.\n", + outlen, PAGE_SIZE); + printk(KERN_ERR "Decompression yielded %d bytes instead of " + "%ld.\n", outlen, PAGE_SIZE); + ret = -EIO; + *buf_size = outlen; + } + kunmap(buffer_page); + return ret; +} + +/* + * toi_compress_print_debug_stats + * @buffer: Pointer to a buffer into which the debug info will be printed. + * @size: Size of the buffer. + * + * Print information to be recorded for debugging purposes into a buffer. + * Returns: Number of characters written to the buffer. + */ + +static int toi_compress_print_debug_stats(char *buffer, int size) +{ + unsigned long pages_in = toi_compress_bytes_in >> PAGE_SHIFT, + pages_out = toi_compress_bytes_out >> PAGE_SHIFT; + int len; + + /* Output the compression ratio achieved. */ + if (*toi_compressor_name) + len = scnprintf(buffer, size, "- Compressor is '%s'.\n", + toi_compressor_name); + else + len = scnprintf(buffer, size, "- Compressor is not set.\n"); + + if (pages_in) + len += scnprintf(buffer+len, size - len, " Compressed " + "%lu bytes into %lu (%ld percent compression).\n", + toi_compress_bytes_in, + toi_compress_bytes_out, + (pages_in - pages_out) * 100 / pages_in); + return len; +} + +/* + * toi_compress_compression_memory_needed + * + * Tell the caller how much memory we need to operate during hibernate/resume. + * Returns: Unsigned long. Maximum number of bytes of memory required for + * operation. + */ +static int toi_compress_memory_needed(void) +{ + return 2 * PAGE_SIZE; +} + +static int toi_compress_storage_needed(void) +{ + return 4 * sizeof(unsigned long) + strlen(toi_compressor_name) + 1; +} + +/* + * toi_compress_save_config_info + * @buffer: Pointer to a buffer of size PAGE_SIZE. + * + * Save informaton needed when reloading the image at resume time. + * Returns: Number of bytes used for saving our data. + */ +static int toi_compress_save_config_info(char *buffer) +{ + int namelen = strlen(toi_compressor_name) + 1; + int total_len; + + *((unsigned long *) buffer) = toi_compress_bytes_in; + *((unsigned long *) (buffer + 1 * sizeof(unsigned long))) = + toi_compress_bytes_out; + *((unsigned long *) (buffer + 2 * sizeof(unsigned long))) = + toi_expected_compression; + *((unsigned long *) (buffer + 3 * sizeof(unsigned long))) = namelen; + strncpy(buffer + 4 * sizeof(unsigned long), toi_compressor_name, + namelen); + total_len = 4 * sizeof(unsigned long) + namelen; + return total_len; +} + +/* toi_compress_load_config_info + * @buffer: Pointer to the start of the data. + * @size: Number of bytes that were saved. + * + * Description: Reload information needed for decompressing the image at + * resume time. + */ +static void toi_compress_load_config_info(char *buffer, int size) +{ + int namelen; + + toi_compress_bytes_in = *((unsigned long *) buffer); + toi_compress_bytes_out = *((unsigned long *) (buffer + 1 * + sizeof(unsigned long))); + toi_expected_compression = *((unsigned long *) (buffer + 2 * + sizeof(unsigned long))); + namelen = *((unsigned long *) (buffer + 3 * sizeof(unsigned long))); + if (strncmp(toi_compressor_name, buffer + 4 * sizeof(unsigned long), + namelen)) { + toi_compress_cleanup(1); + strncpy(toi_compressor_name, buffer + 4 * sizeof(unsigned long), + namelen); + toi_compress_crypto_prepare(); + } + return; +} + +/* + * toi_expected_compression_ratio + * + * Description: Returns the expected ratio between data passed into this module + * and the amount of data output when writing. + * Returns: 100 if the module is disabled. Otherwise the value set by the + * user via our sysfs entry. + */ + +static int toi_compress_expected_ratio(void) +{ + if (!toi_compression_ops.enabled) + return 100; + else + return 100 - toi_expected_compression; +} + +/* + * data for our sysfs entries. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_INT("expected_compression", SYSFS_RW, &toi_expected_compression, + 0, 99, 0, NULL), + SYSFS_INT("enabled", SYSFS_RW, &toi_compression_ops.enabled, 0, 1, 0, + NULL), + SYSFS_STRING("algorithm", SYSFS_RW, toi_compressor_name, 31, 0, NULL), +}; + +/* + * Ops structure. + */ +static struct toi_module_ops toi_compression_ops = { + .type = FILTER_MODULE, + .name = "compression", + .directory = "compression", + .module = THIS_MODULE, + .initialise = toi_compress_init, + .cleanup = toi_compress_cleanup, + .memory_needed = toi_compress_memory_needed, + .print_debug_info = toi_compress_print_debug_stats, + .save_config_info = toi_compress_save_config_info, + .load_config_info = toi_compress_load_config_info, + .storage_needed = toi_compress_storage_needed, + .expected_compression = toi_compress_expected_ratio, + + .rw_init = toi_compress_rw_init, + + .write_page = toi_compress_write_page, + .read_page = toi_compress_read_page, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ + +static __init int toi_compress_load(void) +{ + return toi_register_module(&toi_compression_ops); +} + +#ifdef MODULE +static __exit void toi_compress_unload(void) +{ + toi_unregister_module(&toi_compression_ops); +} + +module_init(toi_compress_load); +module_exit(toi_compress_unload); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Nigel Cunningham"); +MODULE_DESCRIPTION("Compression Support for TuxOnIce"); +#else +late_initcall(toi_compress_load); +#endif diff --git a/kernel/power/tuxonice_extent.c b/kernel/power/tuxonice_extent.c new file mode 100644 index 0000000..76877c6 --- /dev/null +++ b/kernel/power/tuxonice_extent.c @@ -0,0 +1,314 @@ +/* + * kernel/power/tuxonice_extent.c + * + * Copyright (C) 2003-2008 Nigel Cunningham (nigel at tuxonice net) + * + * Distributed under GPLv2. + * + * These functions encapsulate the manipulation of storage metadata. + */ + +#include +#include +#include "tuxonice_modules.h" +#include "tuxonice_extent.h" +#include "tuxonice_alloc.h" +#include "tuxonice_ui.h" +#include "tuxonice.h" + +/** + * toi_get_extent - return a free extent + * + * May fail, returning NULL instead. + **/ +static struct hibernate_extent *toi_get_extent(void) +{ + return (struct hibernate_extent *) toi_kzalloc(2, + sizeof(struct hibernate_extent), TOI_ATOMIC_GFP); +} + +/** + * toi_put_extent_chain - free a whole chain of extents + * @chain: Chain to free. + **/ +void toi_put_extent_chain(struct hibernate_extent_chain *chain) +{ + struct hibernate_extent *this; + + this = chain->first; + + while (this) { + struct hibernate_extent *next = this->next; + toi_kfree(2, this); + chain->num_extents--; + this = next; + } + + chain->first = NULL; + chain->last_touched = NULL; + chain->size = 0; +} +EXPORT_SYMBOL_GPL(toi_put_extent_chain); + +/** + * toi_add_to_extent_chain - add an extent to an existing chain + * @chain: Chain to which the extend should be added + * @start: Start of the extent (first physical block) + * @end: End of the extent (last physical block) + * + * The chain information is updated if the insertion is successful. + **/ +int toi_add_to_extent_chain(struct hibernate_extent_chain *chain, + unsigned long start, unsigned long end) +{ + struct hibernate_extent *new_ext = NULL, *cur_ext = NULL; + + /* Find the right place in the chain */ + if (chain->last_touched && chain->last_touched->start < start) + cur_ext = chain->last_touched; + else if (chain->first && chain->first->start < start) + cur_ext = chain->first; + + if (cur_ext) { + while (cur_ext->next && cur_ext->next->start < start) + cur_ext = cur_ext->next; + + if (cur_ext->end == (start - 1)) { + struct hibernate_extent *next_ext = cur_ext->next; + cur_ext->end = end; + + /* Merge with the following one? */ + if (next_ext && cur_ext->end + 1 == next_ext->start) { + cur_ext->end = next_ext->end; + cur_ext->next = next_ext->next; + toi_kfree(2, next_ext); + chain->num_extents--; + } + + chain->last_touched = cur_ext; + chain->size += (end - start + 1); + + return 0; + } + } + + new_ext = toi_get_extent(); + if (!new_ext) { + printk(KERN_INFO "Error unable to append a new extent to the " + "chain.\n"); + return -ENOMEM; + } + + chain->num_extents++; + chain->size += (end - start + 1); + new_ext->start = start; + new_ext->end = end; + + chain->last_touched = new_ext; + + if (cur_ext) { + new_ext->next = cur_ext->next; + cur_ext->next = new_ext; + } else { + if (chain->first) + new_ext->next = chain->first; + chain->first = new_ext; + } + + return 0; +} +EXPORT_SYMBOL_GPL(toi_add_to_extent_chain); + +/** + * toi_serialise_extent_chain - write a chain in the image + * @owner: Module writing the chain. + * @chain: Chain to write. + **/ +int toi_serialise_extent_chain(struct toi_module_ops *owner, + struct hibernate_extent_chain *chain) +{ + struct hibernate_extent *this; + int ret, i = 0; + + ret = toiActiveAllocator->rw_header_chunk(WRITE, owner, (char *) chain, + 2 * sizeof(int)); + if (ret) + return ret; + + this = chain->first; + while (this) { + ret = toiActiveAllocator->rw_header_chunk(WRITE, owner, + (char *) this, 2 * sizeof(unsigned long)); + if (ret) + return ret; + this = this->next; + i++; + } + + if (i != chain->num_extents) { + printk(KERN_EMERG "Saved %d extents but chain metadata says " + "there should be %d.\n", i, chain->num_extents); + return 1; + } + + return ret; +} +EXPORT_SYMBOL_GPL(toi_serialise_extent_chain); + +/** + * toi_load_extent_chain - read back a chain saved in the image + * @chain: Chain to load + * + * The linked list of extents is reconstructed from the disk. chain will point + * to the first entry. + **/ +int toi_load_extent_chain(struct hibernate_extent_chain *chain) +{ + struct hibernate_extent *this, *last = NULL; + int i, ret; + + /* Get the next page */ + ret = toiActiveAllocator->rw_header_chunk_noreadahead(READ, NULL, + (char *) chain, 2 * sizeof(int)); + if (ret) { + printk(KERN_ERR "Failed to read the size of extent chain.\n"); + return 1; + } + + for (i = 0; i < chain->num_extents; i++) { + this = toi_kzalloc(3, sizeof(struct hibernate_extent), + TOI_ATOMIC_GFP); + if (!this) { + printk(KERN_INFO "Failed to allocate a new extent.\n"); + return -ENOMEM; + } + this->next = NULL; + /* Get the next page */ + ret = toiActiveAllocator->rw_header_chunk_noreadahead(READ, + NULL, (char *) this, 2 * sizeof(unsigned long)); + if (ret) { + printk(KERN_INFO "Failed to read an extent.\n"); + return 1; + } + if (last) + last->next = this; + else + chain->first = this; + last = this; + } + return 0; +} +EXPORT_SYMBOL_GPL(toi_load_extent_chain); + +/** + * toi_extent_state_next - go to the next extent + * + * Given a state, progress to the next valid entry. We may begin in an + * invalid state, as we do when invoked after extent_state_goto_start below. + * + * When using compression and expected_compression > 0, we let the image size + * be larger than storage, so we can validly run out of data to return. + **/ +unsigned long toi_extent_state_next(struct toi_extent_iterate_state *state) +{ + if (state->current_chain == state->num_chains) + return 0; + + if (state->current_extent) { + if (state->current_offset == state->current_extent->end) { + if (state->current_extent->next) { + state->current_extent = + state->current_extent->next; + state->current_offset = + state->current_extent->start; + } else { + state->current_extent = NULL; + state->current_offset = 0; + } + } else + state->current_offset++; + } + + while (!state->current_extent) { + int chain_num = ++(state->current_chain); + + if (chain_num == state->num_chains) + return 0; + + state->current_extent = (state->chains + chain_num)->first; + + if (!state->current_extent) + continue; + + state->current_offset = state->current_extent->start; + } + + return state->current_offset; +} +EXPORT_SYMBOL_GPL(toi_extent_state_next); + +/** + * toi_extent_state_goto_start - reinitialize an extent chain iterator + * @state: Iterator to reinitialize + **/ +void toi_extent_state_goto_start(struct toi_extent_iterate_state *state) +{ + state->current_chain = -1; + state->current_extent = NULL; + state->current_offset = 0; +} +EXPORT_SYMBOL_GPL(toi_extent_state_goto_start); + +/** + * toi_extent_state_save - save state of the iterator + * @state: Current state of the chain + * @saved_state: Iterator to populate + * + * Given a state and a struct hibernate_extent_state_store, save the current + * position in a format that can be used with relocated chains (at + * resume time). + **/ +void toi_extent_state_save(struct toi_extent_iterate_state *state, + struct hibernate_extent_iterate_saved_state *saved_state) +{ + struct hibernate_extent *extent; + + saved_state->chain_num = state->current_chain; + saved_state->extent_num = 0; + saved_state->offset = state->current_offset; + + if (saved_state->chain_num == -1) + return; + + extent = (state->chains + state->current_chain)->first; + + while (extent != state->current_extent) { + saved_state->extent_num++; + extent = extent->next; + } +} +EXPORT_SYMBOL_GPL(toi_extent_state_save); + +/** + * toi_extent_state_restore - restore the position saved by extent_state_save + * @state: State to populate + * @saved_state: Iterator saved to restore + **/ +void toi_extent_state_restore(struct toi_extent_iterate_state *state, + struct hibernate_extent_iterate_saved_state *saved_state) +{ + int posn = saved_state->extent_num; + + if (saved_state->chain_num == -1) { + toi_extent_state_goto_start(state); + return; + } + + state->current_chain = saved_state->chain_num; + state->current_extent = (state->chains + state->current_chain)->first; + state->current_offset = saved_state->offset; + + while (posn--) + state->current_extent = state->current_extent->next; +} +EXPORT_SYMBOL_GPL(toi_extent_state_restore); diff --git a/kernel/power/tuxonice_extent.h b/kernel/power/tuxonice_extent.h new file mode 100644 index 0000000..22ffb9b --- /dev/null +++ b/kernel/power/tuxonice_extent.h @@ -0,0 +1,72 @@ +/* + * kernel/power/tuxonice_extent.h + * + * Copyright (C) 2003-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * It contains declarations related to extents. Extents are + * TuxOnIce's method of storing some of the metadata for the image. + * See tuxonice_extent.c for more info. + * + */ + +#include "tuxonice_modules.h" + +#ifndef EXTENT_H +#define EXTENT_H + +struct hibernate_extent { + unsigned long start, end; + struct hibernate_extent *next; +}; + +struct hibernate_extent_chain { + int size; /* size of the chain ie sum (max-min+1) */ + int num_extents; + struct hibernate_extent *first, *last_touched; +}; + +struct toi_extent_iterate_state { + struct hibernate_extent_chain *chains; + int num_chains; + int current_chain; + struct hibernate_extent *current_extent; + unsigned long current_offset; +}; + +struct hibernate_extent_iterate_saved_state { + int chain_num; + int extent_num; + unsigned long offset; +}; + +#define toi_extent_state_eof(state) \ + ((state)->num_chains == (state)->current_chain) + +/* Simplify iterating through all the values in an extent chain */ +#define toi_extent_for_each(extent_chain, extentpointer, value) \ +if ((extent_chain)->first) \ + for ((extentpointer) = (extent_chain)->first, (value) = \ + (extentpointer)->start; \ + ((extentpointer) && ((extentpointer)->next || (value) <= \ + (extentpointer)->end)); \ + (((value) == (extentpointer)->end) ? \ + ((extentpointer) = (extentpointer)->next, (value) = \ + ((extentpointer) ? (extentpointer)->start : 0)) : \ + (value)++)) + +void toi_put_extent_chain(struct hibernate_extent_chain *chain); +int toi_add_to_extent_chain(struct hibernate_extent_chain *chain, + unsigned long start, unsigned long end); +int toi_serialise_extent_chain(struct toi_module_ops *owner, + struct hibernate_extent_chain *chain); +int toi_load_extent_chain(struct hibernate_extent_chain *chain); + +void toi_extent_state_save(struct toi_extent_iterate_state *state, + struct hibernate_extent_iterate_saved_state *saved_state); +void toi_extent_state_restore(struct toi_extent_iterate_state *state, + struct hibernate_extent_iterate_saved_state *saved_state); +void toi_extent_state_goto_start(struct toi_extent_iterate_state *state); +unsigned long toi_extent_state_next(struct toi_extent_iterate_state *state); +#endif diff --git a/kernel/power/tuxonice_file.c b/kernel/power/tuxonice_file.c new file mode 100644 index 0000000..4181c47 --- /dev/null +++ b/kernel/power/tuxonice_file.c @@ -0,0 +1,1241 @@ +/* + * kernel/power/tuxonice_file.c + * + * Copyright (C) 2005-2008 Nigel Cunningham (nigel at tuxonice net) + * + * Distributed under GPLv2. + * + * This file encapsulates functions for usage of a simple file as a + * backing store. It is based upon the swapallocator, and shares the + * same basic working. Here, though, we have nothing to do with + * swapspace, and only one device to worry about. + * + * The user can just + * + * echo TuxOnIce > /path/to/my_file + * + * dd if=/dev/zero bs=1M count= >> /path/to/my_file + * + * and + * + * echo /path/to/my_file > /sys/power/tuxonice/file/target + * + * then put what they find in /sys/power/tuxonice/resume + * as their resume= parameter in lilo.conf (and rerun lilo if using it). + * + * Having done this, they're ready to hibernate and resume. + * + * TODO: + * - File resizing. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_ui.h" +#include "tuxonice_extent.h" +#include "tuxonice_io.h" +#include "tuxonice_storage.h" +#include "tuxonice_block_io.h" +#include "tuxonice_alloc.h" +#include "tuxonice_builtin.h" + +static struct toi_module_ops toi_fileops; + +/* Details of our target. */ + +static char toi_file_target[256]; +static struct inode *target_inode; +static struct file *target_file; +static struct block_device *toi_file_target_bdev; +static dev_t resume_file_dev_t; +static int used_devt; +static int setting_toi_file_target; +static sector_t target_firstblock, target_header_start; +static int target_storage_available; +static int target_claim; + +/* Old signatures */ +static char HaveImage[] = "HaveImage\n"; +static char NoImage[] = "TuxOnIce\n"; +#define sig_size (sizeof(HaveImage) + 1) + +struct toi_file_header { + char sig[sig_size]; + int resumed_before; + unsigned long first_header_block; + int have_image; +}; + +/* Header Page Information */ +static int header_pages_reserved; + +/* Main Storage Pages */ +static int main_pages_allocated, main_pages_requested; + +#define target_is_normal_file() (S_ISREG(target_inode->i_mode)) + +static struct toi_bdev_info devinfo; + +/* Extent chain for blocks */ +static struct hibernate_extent_chain block_chain; + +/* Signature operations */ +enum { + GET_IMAGE_EXISTS, + INVALIDATE, + MARK_RESUME_ATTEMPTED, + UNMARK_RESUME_ATTEMPTED, +}; + +/** + * set_devinfo - populate device information + * @bdev: Block device on which the file is. + * @target_blkbits: Number of bits in the page block size of the target + * file inode. + * + * Populate the devinfo structure about the target device. + * + * Background: a sector represents a fixed amount of data (generally 512 bytes). + * The hard drive sector size and the filesystem block size may be different. + * If fs_blksize mesures the filesystem block size and hd_blksize the hard drive + * sector size: + * + * sector << (fs_blksize - hd_blksize) converts hd sector into fs block + * fs_block >> (fs_blksize - hd_blksize) converts fs block into hd sector number + * + * Here target_blkbits == fs_blksize and hd_blksize == 9, hence: + * + * (fs_blksize - hd_blksize) == devinfo.bmap_shift + * + * The memory page size is defined by PAGE_SHIFT. devinfo.blocks_per_page is the + * number of filesystem blocks per memory page. + * + * Note that blocks are stored after >>. They are used after being <<. + * We always only use PAGE_SIZE aligned blocks. + * + * Side effects: + * devinfo.bdev, devinfo.bmap_shift and devinfo.blocks_per_page are set. + */ +static void set_devinfo(struct block_device *bdev, int target_blkbits) +{ + devinfo.bdev = bdev; + if (!target_blkbits) { + devinfo.bmap_shift = 0; + devinfo.blocks_per_page = 0; + } else { + /* We are assuming a hard disk with 512 (2^9) bytes/sector */ + devinfo.bmap_shift = target_blkbits - 9; + devinfo.blocks_per_page = (1 << (PAGE_SHIFT - target_blkbits)); + } +} + +static long raw_to_real(long raw) +{ + long result; + + result = raw - (raw * (sizeof(unsigned long) + sizeof(int)) + + (PAGE_SIZE + sizeof(unsigned long) + sizeof(int) + 1)) / + (PAGE_SIZE + sizeof(unsigned long) + sizeof(int)); + + return result < 0 ? 0 : result; +} + +static int toi_file_storage_available(void) +{ + int result = 0; + struct block_device *bdev = toi_file_target_bdev; + + if (!target_inode) + return 0; + + switch (target_inode->i_mode & S_IFMT) { + case S_IFSOCK: + case S_IFCHR: + case S_IFIFO: /* Socket, Char, Fifo */ + return -1; + case S_IFREG: /* Regular file: current size - holes + free + space on part */ + result = target_storage_available; + break; + case S_IFBLK: /* Block device */ + if (!bdev->bd_disk) { + printk(KERN_INFO "bdev->bd_disk null.\n"); + return 0; + } + + result = (bdev->bd_part ? + bdev->bd_part->nr_sects : + get_capacity(bdev->bd_disk)) >> (PAGE_SHIFT - 9); + } + + return raw_to_real(result); +} + +static int has_contiguous_blocks(int page_num) +{ + int j; + sector_t last = 0; + + for (j = 0; j < devinfo.blocks_per_page; j++) { + sector_t this = bmap(target_inode, + page_num * devinfo.blocks_per_page + j); + + if (!this || (last && (last + 1) != this)) + break; + + last = this; + } + + return j == devinfo.blocks_per_page; +} + +static int size_ignoring_ignored_pages(void) +{ + int mappable = 0, i; + + if (!target_is_normal_file()) + return toi_file_storage_available(); + + for (i = 0; i < (target_inode->i_size >> PAGE_SHIFT) ; i++) + if (has_contiguous_blocks(i)) + mappable++; + + return mappable; +} + +/** + * __populate_block_list - add an extent to the chain + * @min: Start of the extent (first physical block = sector) + * @max: End of the extent (last physical block = sector) + * + * If TOI_TEST_BIO is set, print a debug message, outputting the min and max + * fs block numbers. + **/ +static int __populate_block_list(int min, int max) +{ + if (test_action_state(TOI_TEST_BIO)) + printk(KERN_INFO "Adding extent %d-%d.\n", + min << devinfo.bmap_shift, + ((max + 1) << devinfo.bmap_shift) - 1); + + return toi_add_to_extent_chain(&block_chain, min, max); +} + +static int apply_header_reservation(void) +{ + int i; + + /* Apply header space reservation */ + toi_extent_state_goto_start(&toi_writer_posn); + + for (i = 0; i < header_pages_reserved; i++) + if (toi_bio_ops.forward_one_page(1, 0)) + return -ENOSPC; + + /* The end of header pages will be the start of pageset 2 */ + toi_extent_state_save(&toi_writer_posn, &toi_writer_posn_save[2]); + + return 0; +} + +static int populate_block_list(void) +{ + int i, extent_min = -1, extent_max = -1, got_header = 0, result = 0; + + if (block_chain.first) + toi_put_extent_chain(&block_chain); + + if (!target_is_normal_file()) { + result = (target_storage_available > 0) ? + __populate_block_list(devinfo.blocks_per_page, + (target_storage_available + 1) * + devinfo.blocks_per_page - 1) : 0; + if (result) + return result; + goto out; + } + + for (i = 0; i < (target_inode->i_size >> PAGE_SHIFT); i++) { + sector_t new_sector; + + if (!has_contiguous_blocks(i)) + continue; + + new_sector = bmap(target_inode, (i * devinfo.blocks_per_page)); + + /* + * Ignore the first block in the file. + * It gets the header. + */ + if (new_sector == target_firstblock >> devinfo.bmap_shift) { + got_header = 1; + continue; + } + + /* + * I'd love to be able to fill in holes and resize + * files, but not yet... + */ + + if (new_sector == extent_max + 1) + extent_max += devinfo.blocks_per_page; + else { + if (extent_min > -1) { + result = __populate_block_list(extent_min, + extent_max); + if (result) + return result; + } + + extent_min = new_sector; + extent_max = extent_min + + devinfo.blocks_per_page - 1; + } + } + + if (extent_min > -1) { + result = __populate_block_list(extent_min, extent_max); + if (result) + return result; + } + +out: + return apply_header_reservation(); +} + +static void toi_file_cleanup(int finishing_cycle) +{ + if (toi_file_target_bdev) { + if (target_claim) { + bd_release(toi_file_target_bdev); + target_claim = 0; + } + + if (used_devt) { + blkdev_put(toi_file_target_bdev, + FMODE_READ | FMODE_NDELAY); + used_devt = 0; + } + toi_file_target_bdev = NULL; + target_inode = NULL; + set_devinfo(NULL, 0); + target_storage_available = 0; + } + + if (target_file && !IS_ERR(target_file)) + filp_close(target_file, NULL); + + target_file = NULL; +} + +/** + * reopen_resume_devt - reset the devinfo struct + * + * Having opened resume= once, we remember the major and + * minor nodes and use them to reopen the bdev for checking + * whether an image exists (possibly when starting a resume). + **/ +static void reopen_resume_devt(void) +{ + toi_file_target_bdev = toi_open_by_devnum(resume_file_dev_t, + FMODE_READ | FMODE_NDELAY); + if (IS_ERR(toi_file_target_bdev)) { + printk(KERN_INFO "Got a dev_num (%lx) but failed to open it.\n", + (unsigned long) resume_file_dev_t); + return; + } + target_inode = toi_file_target_bdev->bd_inode; + set_devinfo(toi_file_target_bdev, target_inode->i_blkbits); +} + +static void toi_file_get_target_info(char *target, int get_size, + int resume_param) +{ + if (target_file) + toi_file_cleanup(0); + + if (!target || !strlen(target)) + return; + + target_file = filp_open(target, O_RDONLY|O_LARGEFILE, 0); + + if (IS_ERR(target_file) || !target_file) { + + if (!resume_param) { + printk(KERN_INFO "Open file %s returned %p.\n", + target, target_file); + target_file = NULL; + return; + } + + target_file = NULL; + wait_for_device_probe(); + resume_file_dev_t = name_to_dev_t(target); + if (!resume_file_dev_t) { + struct kstat stat; + int error = vfs_stat(target, &stat); + printk(KERN_INFO "Open file %s returned %p and " + "name_to_devt failed.\n", target, + target_file); + if (error) + printk(KERN_INFO "Stating the file also failed." + " Nothing more we can do.\n"); + else + resume_file_dev_t = stat.rdev; + return; + } + + toi_file_target_bdev = toi_open_by_devnum(resume_file_dev_t, + FMODE_READ | FMODE_NDELAY); + if (IS_ERR(toi_file_target_bdev)) { + printk(KERN_INFO "Got a dev_num (%lx) but failed to " + "open it.\n", + (unsigned long) resume_file_dev_t); + return; + } + used_devt = 1; + target_inode = toi_file_target_bdev->bd_inode; + } else + target_inode = target_file->f_mapping->host; + + if (S_ISLNK(target_inode->i_mode) || S_ISDIR(target_inode->i_mode) || + S_ISSOCK(target_inode->i_mode) || S_ISFIFO(target_inode->i_mode)) { + printk(KERN_INFO "File support works with regular files," + " character files and block devices.\n"); + goto cleanup; + } + + if (!used_devt) { + if (S_ISBLK(target_inode->i_mode)) { + toi_file_target_bdev = I_BDEV(target_inode); + if (!bd_claim(toi_file_target_bdev, &toi_fileops)) + target_claim = 1; + } else + toi_file_target_bdev = target_inode->i_sb->s_bdev; + resume_file_dev_t = toi_file_target_bdev->bd_dev; + } + + set_devinfo(toi_file_target_bdev, target_inode->i_blkbits); + + if (get_size) + target_storage_available = size_ignoring_ignored_pages(); + + if (!resume_param) + target_firstblock = bmap(target_inode, 0) << devinfo.bmap_shift; + + return; +cleanup: + target_inode = NULL; + if (target_file) { + filp_close(target_file, NULL); + target_file = NULL; + } + set_devinfo(NULL, 0); + target_storage_available = 0; +} + +static void toi_file_noresume_reset(void) +{ + toi_bio_ops.rw_cleanup(READ); +} + +/** + * parse_signature - check if the file is suitable for resuming + * @header: Signature of the file + * + * Given a file header, check the content of the file. Return true if it + * contains a valid hibernate image. + * TOI_RESUMED_BEFORE is set accordingly. + **/ +static int parse_signature(struct toi_file_header *header) +{ + int have_image = !memcmp(HaveImage, header->sig, sizeof(HaveImage) - 1); + int no_image_header = !memcmp(NoImage, header->sig, + sizeof(NoImage) - 1); + int binary_sig = !memcmp(tuxonice_signature, header->sig, + sizeof(tuxonice_signature)); + + if (no_image_header || (binary_sig && !header->have_image)) + return 0; + + if (!have_image && !binary_sig) + return -1; + + if (header->resumed_before) + set_toi_state(TOI_RESUMED_BEFORE); + else + clear_toi_state(TOI_RESUMED_BEFORE); + + target_header_start = header->first_header_block; + return 1; +} + +/** + * prepare_signature - populate the signature structure + * @current_header: Signature structure to populate + * @first_header_block: Sector with the header containing the extents + **/ +static int prepare_signature(struct toi_file_header *current_header, + unsigned long first_header_block) +{ + memcpy(current_header->sig, tuxonice_signature, + sizeof(tuxonice_signature)); + current_header->resumed_before = 0; + current_header->first_header_block = first_header_block; + current_header->have_image = 1; + return 0; +} + +static int toi_file_storage_allocated(void) +{ + if (!target_inode) + return 0; + + if (target_is_normal_file()) + return (int) raw_to_real(target_storage_available); + else + return (int) raw_to_real(main_pages_requested); +} + +/** + * toi_file_release_storage - deallocate the block chain + **/ +static int toi_file_release_storage(void) +{ + toi_put_extent_chain(&block_chain); + + header_pages_reserved = 0; + main_pages_allocated = 0; + main_pages_requested = 0; + return 0; +} + +static void toi_file_reserve_header_space(int request) +{ + header_pages_reserved = request; +} + +static int toi_file_allocate_storage(int main_space_requested) +{ + int result = 0; + + int extra_pages = DIV_ROUND_UP(main_space_requested * + (sizeof(unsigned long) + sizeof(int)), PAGE_SIZE); + int pages_to_get = main_space_requested + extra_pages + + header_pages_reserved; + int blocks_to_get = pages_to_get - block_chain.size; + + /* Only release_storage reduces the size */ + if (blocks_to_get < 1) + return apply_header_reservation(); + + result = populate_block_list(); + + if (result) + return result; + + toi_message(TOI_WRITER, TOI_MEDIUM, 0, + "Finished with block_chain.size == %d.\n", + block_chain.size); + + if (block_chain.size < pages_to_get) { + printk(KERN_INFO "Block chain size (%d) < header pages (%d) + " + "extra pages (%d) + main pages (%d) (=%d " + "pages).\n", + block_chain.size, header_pages_reserved, + extra_pages, main_space_requested, + pages_to_get); + result = -ENOSPC; + } + + main_pages_requested = main_space_requested; + main_pages_allocated = main_space_requested + extra_pages; + return result; +} + +/** + * toi_file_write_header_init - save the header on the image + **/ +static int toi_file_write_header_init(void) +{ + int result; + + toi_bio_ops.rw_init(WRITE, 0); + toi_writer_buffer_posn = 0; + + /* Info needed to bootstrap goes at the start of the header. + * First we save the basic info needed for reading, including the number + * of header pages. Then we save the structs containing data needed + * for reading the header pages back. + * Note that even if header pages take more than one page, when we + * read back the info, we will have restored the location of the + * next header page by the time we go to use it. + */ + + result = toi_bio_ops.rw_header_chunk(WRITE, &toi_fileops, + (char *) &toi_writer_posn_save, + sizeof(toi_writer_posn_save)); + + if (result) + return result; + + result = toi_bio_ops.rw_header_chunk(WRITE, &toi_fileops, + (char *) &devinfo, sizeof(devinfo)); + + if (result) + return result; + + /* Flush the chain */ + toi_serialise_extent_chain(&toi_fileops, &block_chain); + + return 0; +} + +static int toi_file_write_header_cleanup(void) +{ + struct toi_file_header *header; + int result, result2; + unsigned long sig_page = toi_get_zeroed_page(38, TOI_ATOMIC_GFP); + + /* Write any unsaved data */ + result = toi_bio_ops.write_header_chunk_finish(); + + if (result) + goto out; + + toi_extent_state_goto_start(&toi_writer_posn); + toi_bio_ops.forward_one_page(1, 1); + + /* Adjust image header */ + result = toi_bio_ops.bdev_page_io(READ, toi_file_target_bdev, + target_firstblock, + virt_to_page(sig_page)); + if (result) + goto out; + + header = (struct toi_file_header *) sig_page; + + prepare_signature(header, + toi_writer_posn.current_offset << + devinfo.bmap_shift); + + result = toi_bio_ops.bdev_page_io(WRITE, toi_file_target_bdev, + target_firstblock, + virt_to_page(sig_page)); + +out: + result2 = toi_bio_ops.finish_all_io(); + toi_free_page(38, sig_page); + + return result ? result : result2; +} + +/* HEADER READING */ + +/** + * toi_file_read_header_init - check content of signature + * + * Entry point of the resume path. + * 1. Attempt to read the device specified with resume=. + * 2. Check the contents of the header for our signature. + * 3. Warn, ignore, reset and/or continue as appropriate. + * 4. If continuing, read the toi_file configuration section + * of the header and set up block device info so we can read + * the rest of the header & image. + * + * Returns: + * May not return if user choose to reboot at a warning. + * -EINVAL if cannot resume at this time. Booting should continue + * normally. + **/ +static int toi_file_read_header_init(void) +{ + int result; + struct block_device *tmp; + + /* Allocate toi_writer_buffer */ + toi_bio_ops.read_header_init(); + + /* + * Read toi_file configuration (header containing metadata). + * target_header_start is the first sector of the header. It has been + * set when checking if the file was suitable for resuming, see + * do_toi_step(STEP_RESUME_CAN_RESUME). + */ + result = toi_bio_ops.bdev_page_io(READ, toi_file_target_bdev, + target_header_start, + virt_to_page((unsigned long) toi_writer_buffer)); + + if (result) { + printk(KERN_ERR "FileAllocator read header init: Failed to " + "initialise reading the first page of data.\n"); + toi_bio_ops.rw_cleanup(READ); + return result; + } + + /* toi_writer_posn_save[0] contains the header */ + memcpy(&toi_writer_posn_save, toi_writer_buffer, + sizeof(toi_writer_posn_save)); + + /* Save the position in the buffer */ + toi_writer_buffer_posn = sizeof(toi_writer_posn_save); + + tmp = devinfo.bdev; + + /* See tuxonice_block_io.h */ + memcpy(&devinfo, + toi_writer_buffer + toi_writer_buffer_posn, + sizeof(devinfo)); + + devinfo.bdev = tmp; + toi_writer_buffer_posn += sizeof(devinfo); + + /* Reinitialize the extent pointer */ + toi_extent_state_goto_start(&toi_writer_posn); + /* Jump to the next page */ + toi_bio_ops.set_extra_page_forward(); + + /* Bring back the chain from disk: this will read + * all extents. + */ + return toi_load_extent_chain(&block_chain); +} + +static int toi_file_read_header_cleanup(void) +{ + toi_bio_ops.rw_cleanup(READ); + return 0; +} + +/** + * toi_file_signature_op - perform an operation on the file signature + * @op: operation to perform + * + * op is either GET_IMAGE_EXISTS, INVALIDATE, MARK_RESUME_ATTEMPTED or + * UNMARK_RESUME_ATTEMPTED. + * If the signature is changed, an I/O operation is performed. + * The signature exists iff toi_file_signature_op(GET_IMAGE_EXISTS)>-1. + **/ +static int toi_file_signature_op(int op) +{ + char *cur; + int result = 0, result2, changed = 0; + struct toi_file_header *header; + + if (!toi_file_target_bdev || IS_ERR(toi_file_target_bdev)) + return -1; + + cur = (char *) toi_get_zeroed_page(17, TOI_ATOMIC_GFP); + if (!cur) { + printk(KERN_INFO "Unable to allocate a page for reading the " + "image signature.\n"); + return -ENOMEM; + } + + result = toi_bio_ops.bdev_page_io(READ, toi_file_target_bdev, + target_firstblock, + virt_to_page(cur)); + + if (result) + goto out; + + header = (struct toi_file_header *) cur; + result = parse_signature(header); + + switch (op) { + case INVALIDATE: + if (result == -1) + goto out; + + memcpy(header->sig, tuxonice_signature, + sizeof(tuxonice_signature)); + header->resumed_before = 0; + header->have_image = 0; + result = 1; + changed = 1; + break; + case MARK_RESUME_ATTEMPTED: + if (result == 1) { + header->resumed_before = 1; + changed = 1; + } + break; + case UNMARK_RESUME_ATTEMPTED: + if (result == 1) { + header->resumed_before = 0; + changed = 1; + } + break; + } + + if (changed) { + int io_result = toi_bio_ops.bdev_page_io(WRITE, + toi_file_target_bdev, target_firstblock, + virt_to_page(cur)); + if (io_result) + result = io_result; + } + +out: + result2 = toi_bio_ops.finish_all_io(); + toi_free_page(17, (unsigned long) cur); + return result ? result : result2; +} + +/** + * toi_file_print_debug_stats - print debug info + * @buffer: Buffer to data to populate + * @size: Size of the buffer + **/ +static int toi_file_print_debug_stats(char *buffer, int size) +{ + int len = 0; + + if (toiActiveAllocator != &toi_fileops) { + len = scnprintf(buffer, size, + "- FileAllocator inactive.\n"); + return len; + } + + len = scnprintf(buffer, size, "- FileAllocator active.\n"); + + len += scnprintf(buffer+len, size-len, " Storage available for " + "image: %d pages.\n", + toi_file_storage_allocated()); + + return len; +} + +/** + * toi_file_storage_needed - storage needed + * + * Returns amount of space in the image header required + * for the toi_file's data. + * + * We ensure the space is allocated, but actually save the + * data from write_header_init and therefore don't also define a + * save_config_info routine. + **/ +static int toi_file_storage_needed(void) +{ + return strlen(toi_file_target) + 1 + + sizeof(toi_writer_posn_save) + + sizeof(devinfo) + + 2 * sizeof(int) + + (2 * sizeof(unsigned long) * block_chain.num_extents); +} + +/** + * toi_file_remove_image - invalidate the image + **/ +static int toi_file_remove_image(void) +{ + toi_file_release_storage(); + return toi_file_signature_op(INVALIDATE); +} + +/** + * toi_file_image_exists - test if an image exists + * + * Repopulate toi_file_target_bdev if needed. + **/ +static int toi_file_image_exists(int quiet) +{ + if (!toi_file_target_bdev) + reopen_resume_devt(); + return toi_file_signature_op(GET_IMAGE_EXISTS); +} + +/** + * toi_file_mark_resume_attempted - mark resume attempted if so + * @mark: attempted flag + * + * Record that we tried to resume from this image. Resuming + * multiple times from the same image may be dangerous + * (possible filesystem corruption). + **/ +static int toi_file_mark_resume_attempted(int mark) +{ + return toi_file_signature_op(mark ? MARK_RESUME_ATTEMPTED : + UNMARK_RESUME_ATTEMPTED); +} + +/** + * toi_file_set_resume_param - validate the specified resume file + * + * Given a target filename, populate the resume parameter. This is + * meant to be used by the user to populate the kernel command line. + * By setting /sys/power/tuxonice/file/target, the valid resume + * parameter to use is set and accessible through + * /sys/power/tuxonice/resume. + * + * If the file could be located, we check if it contains a valid + * signature. + **/ +static void toi_file_set_resume_param(void) +{ + char *buffer = (char *) toi_get_zeroed_page(18, TOI_ATOMIC_GFP); + char *buffer2 = (char *) toi_get_zeroed_page(19, TOI_ATOMIC_GFP); + unsigned long sector = bmap(target_inode, 0); + int offset = 0; + + if (!buffer || !buffer2) { + if (buffer) + toi_free_page(18, (unsigned long) buffer); + if (buffer2) + toi_free_page(19, (unsigned long) buffer2); + printk(KERN_ERR "TuxOnIce: Failed to allocate memory while " + "setting resume= parameter.\n"); + return; + } + + if (toi_file_target_bdev) { + set_devinfo(toi_file_target_bdev, target_inode->i_blkbits); + + bdevname(toi_file_target_bdev, buffer2); + offset += snprintf(buffer + offset, PAGE_SIZE - offset, + "/dev/%s", buffer2); + + if (sector) + /* The offset is: sector << (inode->i_blkbits - 9) */ + offset += snprintf(buffer + offset, PAGE_SIZE - offset, + ":0x%lx", sector << devinfo.bmap_shift); + } else + offset += snprintf(buffer + offset, PAGE_SIZE - offset, + "%s is not a valid target.", toi_file_target); + + sprintf(resume_file, "file:%s", buffer); + + toi_free_page(18, (unsigned long) buffer); + toi_free_page(19, (unsigned long) buffer2); + + toi_attempt_to_parse_resume_device(1); +} + +/** + * __test_toi_file_target - is the file target valid for hibernating? + * @target: target file + * @resume_param: whether resume= has been specified + * @quiet: quiet flag + * + * Test whether the file target can be used for hibernating: valid target + * and signature. + * The resume parameter is set if needed. + **/ +static int __test_toi_file_target(char *target, int resume_param, int quiet) +{ + toi_file_get_target_info(target, 0, resume_param); + if (toi_file_signature_op(GET_IMAGE_EXISTS) > -1) { + if (!quiet) + printk(KERN_INFO "TuxOnIce: FileAllocator: File " + "signature found.\n"); + if (!resume_param) + toi_file_set_resume_param(); + + toi_bio_ops.set_devinfo(&devinfo); + toi_writer_posn.chains = &block_chain; + toi_writer_posn.num_chains = 1; + + if (!resume_param) + set_toi_state(TOI_CAN_HIBERNATE); + return 0; + } + + /* + * Target unaccessible or no signature found + * Most errors have already been reported + */ + + clear_toi_state(TOI_CAN_HIBERNATE); + + if (quiet) + return 1; + + if (*target) + printk(KERN_INFO "TuxOnIce: FileAllocator: Sorry. No signature " + "found at %s.\n", target); + else + if (!resume_param) + printk(KERN_INFO "TuxOnIce: FileAllocator: Sorry. " + "Target is not set for hibernating.\n"); + + return 1; +} + +/** + * test_toi_file_target - sysfs callback for /sys/power/tuxonince/file/target + * + * Test wheter the target file is valid for hibernating. + **/ +static void test_toi_file_target(void) +{ + setting_toi_file_target = 1; + + printk(KERN_INFO "TuxOnIce: Hibernating %sabled.\n", + __test_toi_file_target(toi_file_target, 0, 1) ? + "dis" : "en"); + + setting_toi_file_target = 0; +} + +/** + * toi_file_parse_sig_location - parse image Location + * @commandline: the resume parameter + * @only_writer: ?? + * @quiet: quiet flag + * + * Attempt to parse a resume= parameter. + * File Allocator accepts: + * resume=file:DEVNAME[:FIRSTBLOCK] + * + * Where: + * DEVNAME is convertable to a dev_t by name_to_dev_t + * FIRSTBLOCK is the location of the first block in the file. + * BLOCKSIZE is the logical blocksize >= SECTOR_SIZE & + * <= PAGE_SIZE, + * mod SECTOR_SIZE == 0 of the device. + * + * Data is validated by attempting to read a header from the + * location given. Failure will result in toi_file refusing to + * save an image, and a reboot with correct parameters will be + * necessary. + **/ +static int toi_file_parse_sig_location(char *commandline, + int only_writer, int quiet) +{ + char *thischar, *devstart = NULL, *colon = NULL, *at_symbol = NULL; + int result = -EINVAL, target_blocksize = 0; + + if (strncmp(commandline, "file:", 5)) { + if (!only_writer) + return 1; + } else + commandline += 5; + + /* + * Don't check signature again if we're beginning a cycle. If we already + * did the initialisation successfully, assume we'll be okay when it + * comes to resuming. + */ + if (toi_file_target_bdev) + return 0; + + devstart = commandline; + thischar = commandline; + while ((*thischar != ':') && (*thischar != '@') && + ((thischar - commandline) < 250) && (*thischar)) + thischar++; + + if (*thischar == ':') { + colon = thischar; + *colon = 0; + thischar++; + } + + while ((*thischar != '@') && ((thischar - commandline) < 250) + && (*thischar)) + thischar++; + + if (*thischar == '@') { + at_symbol = thischar; + *at_symbol = 0; + } + + /* + * For the toi_file, you can be able to resume, but not hibernate, + * because the resume= is set correctly, but the toi_file_target + * isn't. + * + * We may have come here as a result of setting resume or + * toi_file_target. We only test the toi_file target in the + * former case (it's already done in the later), and we do it before + * setting the block number ourselves. It will overwrite the values + * given on the command line if we don't. + */ + + if (!setting_toi_file_target) /* Concurrent write via /sys? */ + __test_toi_file_target(toi_file_target, 1, 0); + + if (colon) + target_firstblock = (int) simple_strtoul(colon + 1, NULL, 0); + else + target_firstblock = 0; + + if (at_symbol) { + target_blocksize = (int) simple_strtoul(at_symbol + 1, NULL, 0); + if (target_blocksize & (SECTOR_SIZE - 1)) { + printk(KERN_INFO "FileAllocator: Blocksizes are " + "multiples of %d.\n", SECTOR_SIZE); + result = -EINVAL; + goto out; + } + } + + if (!quiet) + printk(KERN_INFO "TuxOnIce FileAllocator: Testing whether you " + "can resume:\n"); + + toi_file_get_target_info(commandline, 0, 1); + + if (!toi_file_target_bdev || IS_ERR(toi_file_target_bdev)) { + toi_file_target_bdev = NULL; + result = -1; + goto out; + } + + if (target_blocksize) + set_devinfo(toi_file_target_bdev, ffs(target_blocksize)); + + result = __test_toi_file_target(commandline, 1, quiet); + +out: + if (result) + clear_toi_state(TOI_CAN_HIBERNATE); + + if (!quiet) + printk(KERN_INFO "Resuming %sabled.\n", result ? "dis" : "en"); + + if (colon) + *colon = ':'; + if (at_symbol) + *at_symbol = '@'; + + return result; +} + +/** + * toi_file_save_config_info - populate toi_file_target + * @buffer: Pointer to a buffer of size PAGE_SIZE. + * + * Save the target's name, not for resume time, but for + * all_settings. + * Returns: + * Number of bytes used for saving our data. + **/ +static int toi_file_save_config_info(char *buffer) +{ + strcpy(buffer, toi_file_target); + return strlen(toi_file_target) + 1; +} + +/** + * toi_file_load_config_info - reload target's name + * @buffer: pointer to the start of the data + * @size: number of bytes that were saved + * + * toi_file_target is set to buffer. + **/ +static void toi_file_load_config_info(char *buffer, int size) +{ + strlcpy(toi_file_target, buffer, size); +} + +static int toi_file_initialise(int starting_cycle) +{ + if (starting_cycle) { + if (toiActiveAllocator != &toi_fileops) + return 0; + + if (starting_cycle & SYSFS_HIBERNATE && !*toi_file_target) { + printk(KERN_INFO "FileAllocator is the active writer, " + "but no filename has been set.\n"); + return 1; + } + } + + if (*toi_file_target) + toi_file_get_target_info(toi_file_target, starting_cycle, 0); + + if (starting_cycle && (toi_file_image_exists(1) == -1)) { + printk("%s is does not have a valid signature for " + "hibernating.\n", toi_file_target); + return 1; + } + + return 0; +} + +static struct toi_sysfs_data sysfs_params[] = { + + SYSFS_STRING("target", SYSFS_RW, toi_file_target, 256, + SYSFS_NEEDS_SM_FOR_WRITE, test_toi_file_target), + SYSFS_INT("enabled", SYSFS_RW, &toi_fileops.enabled, 0, 1, 0, + attempt_to_parse_resume_device2) +}; + +static struct toi_module_ops toi_fileops = { + .type = WRITER_MODULE, + .name = "file storage", + .directory = "file", + .module = THIS_MODULE, + .print_debug_info = toi_file_print_debug_stats, + .save_config_info = toi_file_save_config_info, + .load_config_info = toi_file_load_config_info, + .storage_needed = toi_file_storage_needed, + .initialise = toi_file_initialise, + .cleanup = toi_file_cleanup, + + .noresume_reset = toi_file_noresume_reset, + .storage_available = toi_file_storage_available, + .storage_allocated = toi_file_storage_allocated, + .reserve_header_space = toi_file_reserve_header_space, + .allocate_storage = toi_file_allocate_storage, + .image_exists = toi_file_image_exists, + .mark_resume_attempted = toi_file_mark_resume_attempted, + .write_header_init = toi_file_write_header_init, + .write_header_cleanup = toi_file_write_header_cleanup, + .read_header_init = toi_file_read_header_init, + .read_header_cleanup = toi_file_read_header_cleanup, + .remove_image = toi_file_remove_image, + .parse_sig_location = toi_file_parse_sig_location, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ +static __init int toi_file_load(void) +{ + toi_fileops.rw_init = toi_bio_ops.rw_init; + toi_fileops.rw_cleanup = toi_bio_ops.rw_cleanup; + toi_fileops.read_page = toi_bio_ops.read_page; + toi_fileops.write_page = toi_bio_ops.write_page; + toi_fileops.rw_header_chunk = toi_bio_ops.rw_header_chunk; + toi_fileops.rw_header_chunk_noreadahead = + toi_bio_ops.rw_header_chunk_noreadahead; + toi_fileops.io_flusher = toi_bio_ops.io_flusher; + toi_fileops.update_throughput_throttle = + toi_bio_ops.update_throughput_throttle; + toi_fileops.finish_all_io = toi_bio_ops.finish_all_io; + + return toi_register_module(&toi_fileops); +} + +#ifdef MODULE +static __exit void toi_file_unload(void) +{ + toi_unregister_module(&toi_fileops); +} + +module_init(toi_file_load); +module_exit(toi_file_unload); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Nigel Cunningham"); +MODULE_DESCRIPTION("TuxOnIce FileAllocator"); +#else +late_initcall(toi_file_load); +#endif diff --git a/kernel/power/tuxonice_highlevel.c b/kernel/power/tuxonice_highlevel.c new file mode 100644 index 0000000..e56dbed --- /dev/null +++ b/kernel/power/tuxonice_highlevel.c @@ -0,0 +1,1335 @@ +/* + * kernel/power/tuxonice_highlevel.c + */ +/** \mainpage TuxOnIce. + * + * TuxOnIce provides support for saving and restoring an image of + * system memory to an arbitrary storage device, either on the local computer, + * or across some network. The support is entirely OS based, so TuxOnIce + * works without requiring BIOS, APM or ACPI support. The vast majority of the + * code is also architecture independant, so it should be very easy to port + * the code to new architectures. TuxOnIce includes support for SMP, 4G HighMem + * and preemption. Initramfses and initrds are also supported. + * + * TuxOnIce uses a modular design, in which the method of storing the image is + * completely abstracted from the core code, as are transformations on the data + * such as compression and/or encryption (multiple 'modules' can be used to + * provide arbitrary combinations of functionality). The user interface is also + * modular, so that arbitrarily simple or complex interfaces can be used to + * provide anything from debugging information through to eye candy. + * + * \section Copyright + * + * TuxOnIce is released under the GPLv2. + * + * Copyright (C) 1998-2001 Gabor Kuti
+ * Copyright (C) 1998,2001,2002 Pavel Machek
+ * Copyright (C) 2002-2003 Florent Chabaud
+ * Copyright (C) 2002-2008 Nigel Cunningham (nigel at tuxonice net)
+ * + * \section Credits + * + * Nigel would like to thank the following people for their work: + * + * Bernard Blackham
+ * Web page & Wiki administration, some coding. A person without whom + * TuxOnIce would not be where it is. + * + * Michael Frank
+ * Extensive testing and help with improving stability. I was constantly + * amazed by the quality and quantity of Michael's help. + * + * Pavel Machek
+ * Modifications, defectiveness pointing, being with Gabor at the very + * beginning, suspend to swap space, stop all tasks. Port to 2.4.18-ac and + * 2.5.17. Even though Pavel and I disagree on the direction suspend to + * disk should take, I appreciate the valuable work he did in helping Gabor + * get the concept working. + * + * ..and of course the myriads of TuxOnIce users who have helped diagnose + * and fix bugs, made suggestions on how to improve the code, proofread + * documentation, and donated time and money. + * + * Thanks also to corporate sponsors: + * + * Redhat.Sometime employer from May 2006 (my fault, not Redhat's!). + * + * Cyclades.com. Nigel's employers from Dec 2004 until May 2006, who + * allowed him to work on TuxOnIce and PM related issues on company time. + * + * LinuxFund.org. Sponsored Nigel's work on TuxOnIce for four months Oct + * 2003 to Jan 2004. + * + * LAC Linux. Donated P4 hardware that enabled development and ongoing + * maintenance of SMP and Highmem support. + * + * OSDL. Provided access to various hardware configurations, make + * occasional small donations to the project. + */ + +#include +#include +#include +#include +#include +#include +#include +#include /* for get/set_fs & KERNEL_DS on i386 */ + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_power_off.h" +#include "tuxonice_storage.h" +#include "tuxonice_checksum.h" +#include "tuxonice_builtin.h" +#include "tuxonice_atomic_copy.h" +#include "tuxonice_alloc.h" +#include "tuxonice_cluster.h" + +/*! Pageset metadata. */ +struct pagedir pagedir2 = {2}; +EXPORT_SYMBOL_GPL(pagedir2); + +static mm_segment_t oldfs; +static DEFINE_MUTEX(tuxonice_in_use); +static int block_dump_save; +static char pre_hibernate_command[256]; +static char post_hibernate_command[256]; + +/* Binary signature if an image is present */ +char *tuxonice_signature = "\xed\xc3\x02\xe9\x98\x56\xe5\x0c"; +EXPORT_SYMBOL_GPL(tuxonice_signature); + +int do_toi_step(int step); + +unsigned long boot_kernel_data_buffer; + +static char *result_strings[] = { + "Hiberation was aborted", + "The user requested that we cancel the hibernation", + "No storage was available", + "Insufficient storage was available", + "Freezing filesystems and/or tasks failed", + "A pre-existing image was used", + "We would free memory, but image size limit doesn't allow this", + "Unable to free enough memory to hibernate", + "Unable to obtain the Power Management Semaphore", + "A device suspend/resume returned an error", + "A system device suspend/resume returned an error", + "The extra pages allowance is too small", + "We were unable to successfully prepare an image", + "TuxOnIce module initialisation failed", + "TuxOnIce module cleanup failed", + "I/O errors were encountered", + "Ran out of memory", + "An error was encountered while reading the image", + "Platform preparation failed", + "CPU Hotplugging failed", + "Architecture specific preparation failed", + "Pages needed resaving, but we were told to abort if this happens", + "We can't hibernate at the moment (invalid resume= or filewriter " + "target?)", + "A hibernation preparation notifier chain member cancelled the " + "hibernation", + "Pre-snapshot preparation failed", + "Pre-restore preparation failed", + "Failed to disable usermode helpers", + "Can't resume from alternate image", + "Header reservation too small", +}; + +/** + * toi_finish_anything - cleanup after doing anything + * @hibernate_or_resume: Whether finishing a cycle or attempt at + * resuming. + * + * This is our basic clean-up routine, matching start_anything below. We + * call cleanup routines, drop module references and restore process fs and + * cpus allowed masks, together with the global block_dump variable's value. + **/ +void toi_finish_anything(int hibernate_or_resume) +{ + toi_cleanup_modules(hibernate_or_resume); + toi_put_modules(); + if (hibernate_or_resume) { + block_dump = block_dump_save; + set_cpus_allowed(current, CPU_MASK_ALL); + toi_alloc_print_debug_stats(); + + if (hibernate_or_resume == SYSFS_HIBERNATE && + strlen(post_hibernate_command)) + toi_launch_userspace_program(post_hibernate_command, + 0, UMH_WAIT_PROC, 0); + atomic_inc(&snapshot_device_available); + mutex_unlock(&pm_mutex); + } + + set_fs(oldfs); + mutex_unlock(&tuxonice_in_use); +} + +/** + * toi_start_anything - basic initialisation for TuxOnIce + * @toi_or_resume: Whether starting a cycle or attempt at resuming. + * + * Our basic initialisation routine. Take references on modules, use the + * kernel segment, recheck resume= if no active allocator is set, initialise + * modules, save and reset block_dump and ensure we're running on CPU0. + **/ +int toi_start_anything(int hibernate_or_resume) +{ + int starting_cycle = (hibernate_or_resume == SYSFS_HIBERNATE); + + mutex_lock(&tuxonice_in_use); + + oldfs = get_fs(); + set_fs(KERNEL_DS); + + if (hibernate_or_resume) { + mutex_lock(&pm_mutex); + + if (!atomic_add_unless(&snapshot_device_available, -1, 0)) + goto snapshotdevice_unavailable; + } + + if (starting_cycle && strlen(pre_hibernate_command)) { + int result = toi_launch_userspace_program(pre_hibernate_command, + 0, UMH_WAIT_PROC, 0); + if (result) { + printk(KERN_INFO "Pre-hibernate command '%s' returned" + " %d. Aborting.\n", + pre_hibernate_command, result); + goto prehibernate_err; + } + } + + if (hibernate_or_resume == SYSFS_HIBERNATE) + toi_print_modules(); + + if (toi_get_modules()) { + printk(KERN_INFO "TuxOnIce: Get modules failed!\n"); + goto prehibernate_err; + } + + if (hibernate_or_resume) { + block_dump_save = block_dump; + block_dump = 0; + set_cpus_allowed(current, + cpumask_of_cpu(first_cpu(cpu_online_map))); + } + + if (toi_initialise_modules_early(hibernate_or_resume)) + goto early_init_err; + + if (!toiActiveAllocator) + toi_attempt_to_parse_resume_device(!hibernate_or_resume); + + if (!toi_initialise_modules_late(hibernate_or_resume)) + return 0; + + toi_cleanup_modules(hibernate_or_resume); +early_init_err: + if (hibernate_or_resume) { + block_dump_save = block_dump; + set_cpus_allowed(current, CPU_MASK_ALL); + } +prehibernate_err: + if (hibernate_or_resume) + atomic_inc(&snapshot_device_available); +snapshotdevice_unavailable: + if (hibernate_or_resume) + mutex_unlock(&pm_mutex); + set_fs(oldfs); + mutex_unlock(&tuxonice_in_use); + return -EBUSY; +} + +/* + * Nosave page tracking. + * + * Here rather than in prepare_image because we want to do it once only at the + * start of a cycle. + */ + +/** + * mark_nosave_pages - set up our Nosave bitmap + * + * Build a bitmap of Nosave pages from the list. The bitmap allows faster + * use when preparing the image. + **/ +static void mark_nosave_pages(void) +{ + struct nosave_region *region; + + list_for_each_entry(region, &nosave_regions, list) { + unsigned long pfn; + + for (pfn = region->start_pfn; pfn < region->end_pfn; pfn++) + if (pfn_valid(pfn)) + SetPageNosave(pfn_to_page(pfn)); + } +} + +static int alloc_a_bitmap(struct memory_bitmap **bm) +{ + int result = 0; + + *bm = kzalloc(sizeof(struct memory_bitmap), GFP_KERNEL); + if (!*bm) { + printk(KERN_ERR "Failed to kzalloc memory for a bitmap.\n"); + return -ENOMEM; + } + + result = memory_bm_create(*bm, GFP_KERNEL, 0); + + if (result) { + printk(KERN_ERR "Failed to create a bitmap.\n"); + kfree(*bm); + } + + return result; +} + +/** + * allocate_bitmaps - allocate bitmaps used to record page states + * + * Allocate the bitmaps we use to record the various TuxOnIce related + * page states. + **/ +static int allocate_bitmaps(void) +{ + if (alloc_a_bitmap(&pageset1_map) || + alloc_a_bitmap(&pageset1_copy_map) || + alloc_a_bitmap(&pageset2_map) || + alloc_a_bitmap(&io_map) || + alloc_a_bitmap(&nosave_map) || + alloc_a_bitmap(&free_map) || + alloc_a_bitmap(&page_resave_map)) + return 1; + + return 0; +} + +static void free_a_bitmap(struct memory_bitmap **bm) +{ + if (!*bm) + return; + + memory_bm_free(*bm, 0); + kfree(*bm); + *bm = NULL; +} + +/** + * free_bitmaps - free the bitmaps used to record page states + * + * Free the bitmaps allocated above. It is not an error to call + * memory_bm_free on a bitmap that isn't currently allocated. + **/ +static void free_bitmaps(void) +{ + free_a_bitmap(&pageset1_map); + free_a_bitmap(&pageset1_copy_map); + free_a_bitmap(&pageset2_map); + free_a_bitmap(&io_map); + free_a_bitmap(&nosave_map); + free_a_bitmap(&free_map); + free_a_bitmap(&page_resave_map); +} + +/** + * io_MB_per_second - return the number of MB/s read or written + * @write: Whether to return the speed at which we wrote. + * + * Calculate the number of megabytes per second that were read or written. + **/ +static int io_MB_per_second(int write) +{ + return (toi_bkd.toi_io_time[write][1]) ? + MB((unsigned long) toi_bkd.toi_io_time[write][0]) * HZ / + toi_bkd.toi_io_time[write][1] : 0; +} + +#define SNPRINTF(a...) do { len += scnprintf(((char *) buffer) + len, \ + count - len - 1, ## a); } while (0) + +/** + * get_debug_info - fill a buffer with debugging information + * @buffer: The buffer to be filled. + * @count: The size of the buffer, in bytes. + * + * Fill a (usually PAGE_SIZEd) buffer with the debugging info that we will + * either printk or return via sysfs. + **/ +static int get_toi_debug_info(const char *buffer, int count) +{ + int len = 0, i, first_result = 1; + + SNPRINTF("TuxOnIce debugging info:\n"); + SNPRINTF("- TuxOnIce core : " TOI_CORE_VERSION "\n"); + SNPRINTF("- Kernel Version : " UTS_RELEASE "\n"); + SNPRINTF("- Compiler vers. : %d.%d\n", __GNUC__, __GNUC_MINOR__); + SNPRINTF("- Attempt number : %d\n", nr_hibernates); + SNPRINTF("- Parameters : %ld %ld %ld %d %d %ld\n", + toi_result, + toi_bkd.toi_action, + toi_bkd.toi_debug_state, + toi_bkd.toi_default_console_level, + image_size_limit, + toi_poweroff_method); + SNPRINTF("- Overall expected compression percentage: %d.\n", + 100 - toi_expected_compression_ratio()); + len += toi_print_module_debug_info(((char *) buffer) + len, + count - len - 1); + if (toi_bkd.toi_io_time[0][1]) { + if ((io_MB_per_second(0) < 5) || (io_MB_per_second(1) < 5)) { + SNPRINTF("- I/O speed: Write %ld KB/s", + (KB((unsigned long) toi_bkd.toi_io_time[0][0]) * HZ / + toi_bkd.toi_io_time[0][1])); + if (toi_bkd.toi_io_time[1][1]) + SNPRINTF(", Read %ld KB/s", + (KB((unsigned long) + toi_bkd.toi_io_time[1][0]) * HZ / + toi_bkd.toi_io_time[1][1])); + } else { + SNPRINTF("- I/O speed: Write %ld MB/s", + (MB((unsigned long) toi_bkd.toi_io_time[0][0]) * HZ / + toi_bkd.toi_io_time[0][1])); + if (toi_bkd.toi_io_time[1][1]) + SNPRINTF(", Read %ld MB/s", + (MB((unsigned long) + toi_bkd.toi_io_time[1][0]) * HZ / + toi_bkd.toi_io_time[1][1])); + } + SNPRINTF(".\n"); + } else + SNPRINTF("- No I/O speed stats available.\n"); + SNPRINTF("- Extra pages : %ld used/%ld.\n", + extra_pd1_pages_used, extra_pd1_pages_allowance); + + for (i = 0; i < TOI_NUM_RESULT_STATES; i++) + if (test_result_state(i)) { + SNPRINTF("%s: %s.\n", first_result ? + "- Result " : + " ", + result_strings[i]); + first_result = 0; + } + if (first_result) + SNPRINTF("- Result : %s.\n", nr_hibernates ? + "Succeeded" : + "No hibernation attempts so far"); + return len; +} + +/** + * do_cleanup - cleanup after attempting to hibernate or resume + * @get_debug_info: Whether to allocate and return debugging info. + * + * Cleanup after attempting to hibernate or resume, possibly getting + * debugging info as we do so. + **/ +static void do_cleanup(int get_debug_info, int restarting) +{ + int i = 0; + char *buffer = NULL; + + if (get_debug_info) + toi_prepare_status(DONT_CLEAR_BAR, "Cleaning up..."); + + free_checksum_pages(); + + if (get_debug_info) + buffer = (char *) toi_get_zeroed_page(20, TOI_ATOMIC_GFP); + + if (buffer) + i = get_toi_debug_info(buffer, PAGE_SIZE); + + toi_free_extra_pagedir_memory(); + + pagedir1.size = 0; + pagedir2.size = 0; + set_highmem_size(pagedir1, 0); + set_highmem_size(pagedir2, 0); + + if (boot_kernel_data_buffer) { + if (!test_toi_state(TOI_BOOT_KERNEL)) + toi_free_page(37, boot_kernel_data_buffer); + boot_kernel_data_buffer = 0; + } + + clear_toi_state(TOI_BOOT_KERNEL); + thaw_processes(); + + if (test_action_state(TOI_KEEP_IMAGE) && + !test_result_state(TOI_ABORTED)) { + toi_message(TOI_ANY_SECTION, TOI_LOW, 1, + "TuxOnIce: Not invalidating the image due " + "to Keep Image being enabled.\n"); + set_result_state(TOI_KEPT_IMAGE); + } else + if (toiActiveAllocator) + toiActiveAllocator->remove_image(); + + free_bitmaps(); + usermodehelper_enable(); + + if (test_toi_state(TOI_NOTIFIERS_PREPARE)) { + pm_notifier_call_chain(PM_POST_HIBERNATION); + clear_toi_state(TOI_NOTIFIERS_PREPARE); + } + + if (buffer && i) { + /* Printk can only handle 1023 bytes, including + * its level mangling. */ + for (i = 0; i < 3; i++) + printk("%s", buffer + (1023 * i)); + toi_free_page(20, (unsigned long) buffer); + } + + if (!test_action_state(TOI_LATE_CPU_HOTPLUG)) + enable_nonboot_cpus(); + + if (!restarting) + toi_cleanup_console(); + + free_attention_list(); + + if (!restarting) + toi_deactivate_storage(0); + + clear_toi_state(TOI_IGNORE_LOGLEVEL); + clear_toi_state(TOI_TRYING_TO_RESUME); + clear_toi_state(TOI_NOW_RESUMING); +} + +/** + * check_still_keeping_image - we kept an image; check whether to reuse it. + * + * We enter this routine when we have kept an image. If the user has said they + * want to still keep it, all we need to do is powerdown. If powering down + * means hibernating to ram and the power doesn't run out, we'll return 1. + * If we do power off properly or the battery runs out, we'll resume via the + * normal paths. + * + * If the user has said they want to remove the previously kept image, we + * remove it, and return 0. We'll then store a new image. + **/ +static int check_still_keeping_image(void) +{ + if (test_action_state(TOI_KEEP_IMAGE)) { + printk(KERN_INFO "Image already stored: powering down " + "immediately."); + do_toi_step(STEP_HIBERNATE_POWERDOWN); + return 1; /* Just in case we're using S3 */ + } + + printk(KERN_INFO "Invalidating previous image.\n"); + toiActiveAllocator->remove_image(); + + return 0; +} + +/** + * toi_init - prepare to hibernate to disk + * + * Initialise variables & data structures, in preparation for + * hibernating to disk. + **/ +static int toi_init(int restarting) +{ + int result, i, j; + + toi_result = 0; + + printk(KERN_INFO "Initiating a hibernation cycle.\n"); + + nr_hibernates++; + + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + toi_bkd.toi_io_time[i][j] = 0; + + if (!test_toi_state(TOI_CAN_HIBERNATE) || + allocate_bitmaps()) + return 1; + + mark_nosave_pages(); + + if (!restarting) + toi_prepare_console(); + + result = pm_notifier_call_chain(PM_HIBERNATION_PREPARE); + if (result) { + set_result_state(TOI_NOTIFIERS_PREPARE_FAILED); + return 1; + } + set_toi_state(TOI_NOTIFIERS_PREPARE); + + result = usermodehelper_disable(); + if (result) { + printk(KERN_ERR "TuxOnIce: Failed to disable usermode " + "helpers\n"); + set_result_state(TOI_USERMODE_HELPERS_ERR); + return 1; + } + + boot_kernel_data_buffer = toi_get_zeroed_page(37, TOI_ATOMIC_GFP); + if (!boot_kernel_data_buffer) { + printk(KERN_ERR "TuxOnIce: Failed to allocate " + "boot_kernel_data_buffer.\n"); + set_result_state(TOI_OUT_OF_MEMORY); + return 1; + } + + if (test_action_state(TOI_LATE_CPU_HOTPLUG) || + !disable_nonboot_cpus()) + return 1; + + set_abort_result(TOI_CPU_HOTPLUG_FAILED); + return 0; +} + +/** + * can_hibernate - perform basic 'Can we hibernate?' tests + * + * Perform basic tests that must pass if we're going to be able to hibernate: + * Can we get the pm_mutex? Is resume= valid (we need to know where to write + * the image header). + **/ +static int can_hibernate(void) +{ + if (!test_toi_state(TOI_CAN_HIBERNATE)) + toi_attempt_to_parse_resume_device(0); + + if (!test_toi_state(TOI_CAN_HIBERNATE)) { + printk(KERN_INFO "TuxOnIce: Hibernation is disabled.\n" + "This may be because you haven't put something along " + "the lines of\n\nresume=swap:/dev/hda1\n\n" + "in lilo.conf or equivalent. (Where /dev/hda1 is your " + "swap partition).\n"); + set_abort_result(TOI_CANT_SUSPEND); + return 0; + } + + if (strlen(alt_resume_param)) { + attempt_to_parse_alt_resume_param(); + + if (!strlen(alt_resume_param)) { + printk(KERN_INFO "Alternate resume parameter now " + "invalid. Aborting.\n"); + set_abort_result(TOI_CANT_USE_ALT_RESUME); + return 0; + } + } + + return 1; +} + +/** + * do_post_image_write - having written an image, figure out what to do next + * + * After writing an image, we might load an alternate image or power down. + * Powering down might involve hibernating to ram, in which case we also + * need to handle reloading pageset2. + **/ +static int do_post_image_write(void) +{ + /* If switching images fails, do normal powerdown */ + if (alt_resume_param[0]) + do_toi_step(STEP_RESUME_ALT_IMAGE); + + toi_power_down(); + + barrier(); + mb(); + return 0; +} + +/** + * __save_image - do the hard work of saving the image + * + * High level routine for getting the image saved. The key assumptions made + * are that processes have been frozen and sufficient memory is available. + * + * We also exit through here at resume time, coming back from toi_hibernate + * after the atomic restore. This is the reason for the toi_in_hibernate + * test. + **/ +static int __save_image(void) +{ + int temp_result, did_copy = 0; + + toi_prepare_status(DONT_CLEAR_BAR, "Starting to save the image.."); + + toi_message(TOI_ANY_SECTION, TOI_LOW, 1, + " - Final values: %d and %d.\n", + pagedir1.size, pagedir2.size); + + toi_cond_pause(1, "About to write pagedir2."); + + temp_result = write_pageset(&pagedir2); + + if (temp_result == -1 || test_result_state(TOI_ABORTED)) + return 1; + + toi_cond_pause(1, "About to copy pageset 1."); + + if (test_result_state(TOI_ABORTED)) + return 1; + + toi_deactivate_storage(1); + + toi_prepare_status(DONT_CLEAR_BAR, "Doing atomic copy/restore."); + + toi_in_hibernate = 1; + + if (toi_go_atomic(PMSG_FREEZE, 1)) + goto Failed; + + temp_result = toi_hibernate(); + if (!temp_result) + did_copy = 1; + + /* We return here at resume time too! */ + toi_end_atomic(ATOMIC_ALL_STEPS, toi_in_hibernate, temp_result); + +Failed: + if (toi_activate_storage(1)) + panic("Failed to reactivate our storage."); + + /* Resume time? */ + if (!toi_in_hibernate) { + copyback_post(); + return 0; + } + + /* Nope. Hibernating. So, see if we can save the image... */ + + if (temp_result || test_result_state(TOI_ABORTED)) { + if (did_copy) + goto abort_reloading_pagedir_two; + else + return 1; + } + + toi_update_status(pagedir2.size, pagedir1.size + pagedir2.size, + NULL); + + if (test_result_state(TOI_ABORTED)) + goto abort_reloading_pagedir_two; + + toi_cond_pause(1, "About to write pageset1."); + + toi_message(TOI_ANY_SECTION, TOI_LOW, 1, "-- Writing pageset1\n"); + + temp_result = write_pageset(&pagedir1); + + /* We didn't overwrite any memory, so no reread needs to be done. */ + if (test_action_state(TOI_TEST_FILTER_SPEED)) + return 1; + + if (temp_result == 1 || test_result_state(TOI_ABORTED)) + goto abort_reloading_pagedir_two; + + toi_cond_pause(1, "About to write header."); + + if (test_result_state(TOI_ABORTED)) + goto abort_reloading_pagedir_two; + + temp_result = write_image_header(); + + if (test_action_state(TOI_TEST_BIO)) + return 1; + + if (!temp_result && !test_result_state(TOI_ABORTED)) + return 0; + +abort_reloading_pagedir_two: + temp_result = read_pageset2(1); + + /* If that failed, we're sunk. Panic! */ + if (temp_result) + panic("Attempt to reload pagedir 2 while aborting " + "a hibernate failed."); + + return 1; +} + +static void map_ps2_pages(int enable) +{ + unsigned long pfn = 0; + + pfn = memory_bm_next_pfn(pageset2_map); + + while (pfn != BM_END_OF_MAP) { + struct page *page = pfn_to_page(pfn); + kernel_map_pages(page, 1, enable); + pfn = memory_bm_next_pfn(pageset2_map); + } +} + +/** + * do_save_image - save the image and handle the result + * + * Save the prepared image. If we fail or we're in the path returning + * from the atomic restore, cleanup. + **/ +static int do_save_image(void) +{ + int result; + map_ps2_pages(0); + result = __save_image(); + map_ps2_pages(1); + return result; +} + +/** + * do_prepare_image - try to prepare an image + * + * Seek to initialise and prepare an image to be saved. On failure, + * cleanup. + **/ +static int do_prepare_image(void) +{ + int restarting = test_result_state(TOI_EXTRA_PAGES_ALLOW_TOO_SMALL); + + if (!restarting && toi_activate_storage(0)) + return 1; + + /* + * If kept image and still keeping image and hibernating to RAM, we will + * return 1 after hibernating and resuming (provided the power doesn't + * run out. In that case, we skip directly to cleaning up and exiting. + */ + + if (!can_hibernate() || + (test_result_state(TOI_KEPT_IMAGE) && + check_still_keeping_image())) + return 1; + + if (toi_init(restarting) && !toi_prepare_image() && + !test_result_state(TOI_ABORTED)) + return 0; + + return 1; +} + +/** + * do_check_can_resume - find out whether an image has been stored + * + * Read whether an image exists. We use the same routine as the + * image_exists sysfs entry, and just look to see whether the + * first character in the resulting buffer is a '1'. + **/ +int do_check_can_resume(void) +{ + char *buf = (char *) toi_get_zeroed_page(21, TOI_ATOMIC_GFP); + int result = 0; + + if (!buf) + return 0; + + /* Only interested in first byte, so throw away return code. */ + image_exists_read(buf, PAGE_SIZE); + + if (buf[0] == '1') + result = 1; + + toi_free_page(21, (unsigned long) buf); + return result; +} +EXPORT_SYMBOL_GPL(do_check_can_resume); + +/** + * do_load_atomic_copy - load the first part of an image, if it exists + * + * Check whether we have an image. If one exists, do sanity checking + * (possibly invalidating the image or even rebooting if the user + * requests that) before loading it into memory in preparation for the + * atomic restore. + * + * If and only if we have an image loaded and ready to restore, we return 1. + **/ +static int do_load_atomic_copy(void) +{ + int read_image_result = 0; + + if (sizeof(swp_entry_t) != sizeof(long)) { + printk(KERN_WARNING "TuxOnIce: The size of swp_entry_t != size" + " of long. Please report this!\n"); + return 1; + } + + if (!resume_file[0]) + printk(KERN_WARNING "TuxOnIce: " + "You need to use a resume= command line parameter to " + "tell TuxOnIce where to look for an image.\n"); + + toi_activate_storage(0); + + if (!(test_toi_state(TOI_RESUME_DEVICE_OK)) && + !toi_attempt_to_parse_resume_device(0)) { + /* + * Without a usable storage device we can do nothing - + * even if noresume is given + */ + + if (!toiNumAllocators) + printk(KERN_ALERT "TuxOnIce: " + "No storage allocators have been registered.\n"); + else + printk(KERN_ALERT "TuxOnIce: " + "Missing or invalid storage location " + "(resume= parameter). Please correct and " + "rerun lilo (or equivalent) before " + "hibernating.\n"); + toi_deactivate_storage(0); + return 1; + } + + if (allocate_bitmaps()) + return 1; + + read_image_result = read_pageset1(); /* non fatal error ignored */ + + if (test_toi_state(TOI_NORESUME_SPECIFIED)) + clear_toi_state(TOI_NORESUME_SPECIFIED); + + toi_deactivate_storage(0); + + if (read_image_result) + return 1; + + return 0; +} + +/** + * prepare_restore_load_alt_image - save & restore alt image variables + * + * Save and restore the pageset1 maps, when loading an alternate image. + **/ +static void prepare_restore_load_alt_image(int prepare) +{ + static struct memory_bitmap *pageset1_map_save, *pageset1_copy_map_save; + + if (prepare) { + pageset1_map_save = pageset1_map; + pageset1_map = NULL; + pageset1_copy_map_save = pageset1_copy_map; + pageset1_copy_map = NULL; + set_toi_state(TOI_LOADING_ALT_IMAGE); + toi_reset_alt_image_pageset2_pfn(); + } else { + memory_bm_free(pageset1_map, 0); + pageset1_map = pageset1_map_save; + memory_bm_free(pageset1_copy_map, 0); + pageset1_copy_map = pageset1_copy_map_save; + clear_toi_state(TOI_NOW_RESUMING); + clear_toi_state(TOI_LOADING_ALT_IMAGE); + } +} + +/** + * do_toi_step - perform a step in hibernating or resuming + * + * Perform a step in hibernating or resuming an image. This abstraction + * is in preparation for implementing cluster support, and perhaps replacing + * uswsusp too (haven't looked whether that's possible yet). + **/ +int do_toi_step(int step) +{ + switch (step) { + case STEP_HIBERNATE_PREPARE_IMAGE: + return do_prepare_image(); + case STEP_HIBERNATE_SAVE_IMAGE: + return do_save_image(); + case STEP_HIBERNATE_POWERDOWN: + return do_post_image_write(); + case STEP_RESUME_CAN_RESUME: + return do_check_can_resume(); + case STEP_RESUME_LOAD_PS1: + return do_load_atomic_copy(); + case STEP_RESUME_DO_RESTORE: + /* + * If we succeed, this doesn't return. + * Instead, we return from do_save_image() in the + * hibernated kernel. + */ + return toi_atomic_restore(); + case STEP_RESUME_ALT_IMAGE: + printk(KERN_INFO "Trying to resume alternate image.\n"); + toi_in_hibernate = 0; + save_restore_alt_param(SAVE, NOQUIET); + prepare_restore_load_alt_image(1); + if (!do_check_can_resume()) { + printk(KERN_INFO "Nothing to resume from.\n"); + goto out; + } + if (!do_load_atomic_copy()) + toi_atomic_restore(); + + printk(KERN_INFO "Failed to load image.\n"); +out: + prepare_restore_load_alt_image(0); + save_restore_alt_param(RESTORE, NOQUIET); + break; + case STEP_CLEANUP: + do_cleanup(1, 0); + break; + case STEP_QUIET_CLEANUP: + do_cleanup(0, 0); + break; + } + + return 0; +} +EXPORT_SYMBOL_GPL(do_toi_step); + +/* -- Functions for kickstarting a hibernate or resume --- */ + +/** + * __toi_try_resume - try to do the steps in resuming + * + * Check if we have an image and if so try to resume. Clear the status + * flags too. + **/ +void __toi_try_resume(void) +{ + set_toi_state(TOI_TRYING_TO_RESUME); + resume_attempted = 1; + + current->flags |= PF_MEMALLOC; + + if (do_toi_step(STEP_RESUME_CAN_RESUME) && + !do_toi_step(STEP_RESUME_LOAD_PS1)) + do_toi_step(STEP_RESUME_DO_RESTORE); + + do_cleanup(0, 0); + + current->flags &= ~PF_MEMALLOC; + + clear_toi_state(TOI_IGNORE_LOGLEVEL); + clear_toi_state(TOI_TRYING_TO_RESUME); + clear_toi_state(TOI_NOW_RESUMING); +} + +/** + * _toi_try_resume - wrapper calling __toi_try_resume from do_mounts + * + * Wrapper for when __toi_try_resume is called from init/do_mounts.c, + * rather than from echo > /sys/power/tuxonice/do_resume. + **/ +static void _toi_try_resume(void) +{ + resume_attempted = 1; + + /* + * There's a comment in kernel/power/disk.c that indicates + * we should be able to use mutex_lock_nested below. That + * doesn't seem to cut it, though, so let's just turn lockdep + * off for now. + */ + lockdep_off(); + + if (toi_start_anything(SYSFS_RESUMING)) + goto out; + + __toi_try_resume(); + + /* + * For initramfs, we have to clear the boot time + * flag after trying to resume + */ + clear_toi_state(TOI_BOOT_TIME); + + toi_finish_anything(SYSFS_RESUMING); +out: + lockdep_on(); +} + +/** + * _toi_try_hibernate - try to start a hibernation cycle + * @have_pmsem: Whether the pm_sem is already taken. + * + * Start a hibernation cycle, coming in from either + * echo > /sys/power/tuxonice/do_suspend + * + * or + * + * echo disk > /sys/power/state + * + * In the later case, we come in without pm_sem taken; in the + * former, it has been taken. + **/ +int _toi_try_hibernate(void) +{ + int result = 0, sys_power_disk = 0, retries = 0; + + if (!mutex_is_locked(&tuxonice_in_use)) { + /* Came in via /sys/power/disk */ + if (toi_start_anything(SYSFS_HIBERNATING)) + return -EBUSY; + sys_power_disk = 1; + } + + current->flags |= PF_MEMALLOC; + + if (test_toi_state(TOI_CLUSTER_MODE)) { + toi_initiate_cluster_hibernate(); + goto out; + } + +prepare: + result = do_toi_step(STEP_HIBERNATE_PREPARE_IMAGE); + + if (result || test_action_state(TOI_FREEZER_TEST)) + goto out; + + result = do_toi_step(STEP_HIBERNATE_SAVE_IMAGE); + + if (test_result_state(TOI_EXTRA_PAGES_ALLOW_TOO_SMALL)) { + if (retries < 2) { + do_cleanup(0, 1); + retries++; + clear_result_state(TOI_ABORTED); + extra_pd1_pages_allowance = extra_pd1_pages_used + 500; + printk(KERN_INFO "Automatically adjusting the extra" + " pages allowance to %ld and restarting.\n", + extra_pd1_pages_allowance); + goto prepare; + } + + printk(KERN_INFO "Adjusted extra pages allowance twice and " + "still couldn't hibernate successfully. Giving up."); + } + + /* This code runs at resume time too! */ + if (!result && toi_in_hibernate) + result = do_toi_step(STEP_HIBERNATE_POWERDOWN); +out: + do_cleanup(1, 0); + current->flags &= ~PF_MEMALLOC; + + if (sys_power_disk) + toi_finish_anything(SYSFS_HIBERNATING); + + return result; +} + +/* + * channel_no: If !0, -c is added to args (userui). + */ +int toi_launch_userspace_program(char *command, int channel_no, + enum umh_wait wait, int debug) +{ + int retval; + static char *envp[] = { + "HOME=/", + "TERM=linux", + "PATH=/sbin:/usr/sbin:/bin:/usr/bin", + NULL }; + static char *argv[] = + { NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL }; + char *channel = NULL; + int arg = 0, size; + char test_read[255]; + char *orig_posn = command; + + if (!strlen(orig_posn)) + return 1; + + if (channel_no) { + channel = toi_kzalloc(4, 6, GFP_KERNEL); + if (!channel) { + printk(KERN_INFO "Failed to allocate memory in " + "preparing to launch userspace program.\n"); + return 1; + } + } + + /* Up to 6 args supported */ + while (arg < 6) { + sscanf(orig_posn, "%s", test_read); + size = strlen(test_read); + if (!(size)) + break; + argv[arg] = toi_kzalloc(5, size + 1, TOI_ATOMIC_GFP); + strcpy(argv[arg], test_read); + orig_posn += size + 1; + *test_read = 0; + arg++; + } + + if (channel_no) { + sprintf(channel, "-c%d", channel_no); + argv[arg] = channel; + } else + arg--; + + if (debug) { + argv[++arg] = toi_kzalloc(5, 8, TOI_ATOMIC_GFP); + strcpy(argv[arg], "--debug"); + } + + retval = call_usermodehelper(argv[0], argv, envp, wait); + + /* + * If the program reports an error, retval = 256. Don't complain + * about that here. + */ + if (retval && retval != 256) + printk(KERN_ERR "Failed to launch userspace program '%s': " + "Error %d\n", command, retval); + + { + int i; + for (i = 0; i < arg; i++) + if (argv[i] && argv[i] != channel) + toi_kfree(5, argv[i]); + } + + toi_kfree(4, channel); + + return retval; +} + +/* + * This array contains entries that are automatically registered at + * boot. Modules and the console code register their own entries separately. + */ +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_LONG("extra_pages_allowance", SYSFS_RW, + &extra_pd1_pages_allowance, 0, LONG_MAX, 0), + SYSFS_CUSTOM("image_exists", SYSFS_RW, image_exists_read, + image_exists_write, SYSFS_NEEDS_SM_FOR_BOTH, NULL), + SYSFS_STRING("resume", SYSFS_RW, resume_file, 255, + SYSFS_NEEDS_SM_FOR_WRITE, + attempt_to_parse_resume_device2), + SYSFS_STRING("alt_resume_param", SYSFS_RW, alt_resume_param, 255, + SYSFS_NEEDS_SM_FOR_WRITE, + attempt_to_parse_alt_resume_param), + SYSFS_CUSTOM("debug_info", SYSFS_READONLY, get_toi_debug_info, NULL, 0, + NULL), + SYSFS_BIT("ignore_rootfs", SYSFS_RW, &toi_bkd.toi_action, + TOI_IGNORE_ROOTFS, 0), + SYSFS_INT("image_size_limit", SYSFS_RW, &image_size_limit, -2, + INT_MAX, 0, NULL), + SYSFS_UL("last_result", SYSFS_RW, &toi_result, 0, 0, 0), + SYSFS_BIT("no_multithreaded_io", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_MULTITHREADED_IO, 0), + SYSFS_BIT("no_flusher_thread", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_FLUSHER_THREAD, 0), + SYSFS_BIT("full_pageset2", SYSFS_RW, &toi_bkd.toi_action, + TOI_PAGESET2_FULL, 0), + SYSFS_BIT("reboot", SYSFS_RW, &toi_bkd.toi_action, TOI_REBOOT, 0), + SYSFS_BIT("replace_swsusp", SYSFS_RW, &toi_bkd.toi_action, + TOI_REPLACE_SWSUSP, 0), + SYSFS_STRING("resume_commandline", SYSFS_RW, + toi_bkd.toi_nosave_commandline, COMMAND_LINE_SIZE, 0, + NULL), + SYSFS_STRING("version", SYSFS_READONLY, TOI_CORE_VERSION, 0, 0, NULL), + SYSFS_BIT("no_load_direct", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_DIRECT_LOAD, 0), + SYSFS_BIT("freezer_test", SYSFS_RW, &toi_bkd.toi_action, + TOI_FREEZER_TEST, 0), + SYSFS_BIT("test_bio", SYSFS_RW, &toi_bkd.toi_action, TOI_TEST_BIO, 0), + SYSFS_BIT("test_filter_speed", SYSFS_RW, &toi_bkd.toi_action, + TOI_TEST_FILTER_SPEED, 0), + SYSFS_BIT("no_pageset2", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_PAGESET2, 0), + SYSFS_BIT("no_pageset2_if_unneeded", SYSFS_RW, &toi_bkd.toi_action, + TOI_NO_PS2_IF_UNNEEDED, 0), + SYSFS_BIT("late_cpu_hotplug", SYSFS_RW, &toi_bkd.toi_action, + TOI_LATE_CPU_HOTPLUG, 0), + SYSFS_STRING("pre_hibernate_command", SYSFS_RW, pre_hibernate_command, + 0, 255, NULL), + SYSFS_STRING("post_hibernate_command", SYSFS_RW, post_hibernate_command, + 0, 255, NULL), +#ifdef CONFIG_TOI_KEEP_IMAGE + SYSFS_BIT("keep_image", SYSFS_RW , &toi_bkd.toi_action, TOI_KEEP_IMAGE, + 0), +#endif +}; + +static struct toi_core_fns my_fns = { + .get_nonconflicting_page = __toi_get_nonconflicting_page, + .post_context_save = __toi_post_context_save, + .try_hibernate = _toi_try_hibernate, + .try_resume = _toi_try_resume, +}; + +/** + * core_load - initialisation of TuxOnIce core + * + * Initialise the core, beginning with sysfs. Checksum and so on are part of + * the core, but have their own initialisation routines because they either + * aren't compiled in all the time or have their own subdirectories. + **/ +static __init int core_load(void) +{ + int i, + numfiles = sizeof(sysfs_params) / sizeof(struct toi_sysfs_data); + + printk(KERN_INFO "TuxOnIce " TOI_CORE_VERSION + " (http://tuxonice.net)\n"); + strncpy(pre_hibernate_command, CONFIG_TOI_DEFAULT_PRE_HIBERNATE, 255); + strncpy(post_hibernate_command, CONFIG_TOI_DEFAULT_POST_HIBERNATE, 255); + + if (toi_sysfs_init()) + return 1; + + for (i = 0; i < numfiles; i++) + toi_register_sysfs_file(tuxonice_kobj, &sysfs_params[i]); + + toi_core_fns = &my_fns; + + if (toi_alloc_init()) + return 1; + if (toi_checksum_init()) + return 1; + if (toi_usm_init()) + return 1; + if (toi_ui_init()) + return 1; + if (toi_poweroff_init()) + return 1; + if (toi_cluster_init()) + return 1; + + return 0; +} + +#ifdef MODULE +/** + * core_unload: Prepare to unload the core code. + **/ +static __exit void core_unload(void) +{ + int i, + numfiles = sizeof(sysfs_params) / sizeof(struct toi_sysfs_data); + + toi_alloc_exit(); + toi_checksum_exit(); + toi_poweroff_exit(); + toi_ui_exit(); + toi_usm_exit(); + toi_cluster_exit(); + + for (i = 0; i < numfiles; i++) + toi_unregister_sysfs_file(tuxonice_kobj, &sysfs_params[i]); + + toi_core_fns = NULL; + + toi_sysfs_exit(); +} +MODULE_LICENSE("GPL"); +module_init(core_load); +module_exit(core_unload); +#else +late_initcall(core_load); +#endif diff --git a/kernel/power/tuxonice_io.c b/kernel/power/tuxonice_io.c new file mode 100644 index 0000000..1207ad3 --- /dev/null +++ b/kernel/power/tuxonice_io.c @@ -0,0 +1,1499 @@ +/* + * kernel/power/tuxonice_io.c + * + * Copyright (C) 1998-2001 Gabor Kuti + * Copyright (C) 1998,2001,2002 Pavel Machek + * Copyright (C) 2002-2003 Florent Chabaud + * Copyright (C) 2002-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * It contains high level IO routines for hibernating. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_pageflags.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_storage.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_extent.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_builtin.h" +#include "tuxonice_checksum.h" +#include "tuxonice_alloc.h" +char alt_resume_param[256]; + +/* Variables shared between threads and updated under the mutex */ +static int io_write, io_finish_at, io_base, io_barmax, io_pageset, io_result; +static int io_index, io_nextupdate, io_pc, io_pc_step; +static DEFINE_MUTEX(io_mutex); +static DEFINE_PER_CPU(struct page *, last_sought); +static DEFINE_PER_CPU(struct page *, last_high_page); +static DEFINE_PER_CPU(char *, checksum_locn); +static DEFINE_PER_CPU(struct pbe *, last_low_page); +static atomic_t io_count; +atomic_t toi_io_workers; +EXPORT_SYMBOL_GPL(toi_io_workers); + +DECLARE_WAIT_QUEUE_HEAD(toi_io_queue_flusher); +EXPORT_SYMBOL_GPL(toi_io_queue_flusher); + +int toi_bio_queue_flusher_should_finish; +EXPORT_SYMBOL_GPL(toi_bio_queue_flusher_should_finish); + +/* Indicates that this thread should be used for checking throughput */ +#define MONITOR ((void *) 1) + +/** + * toi_attempt_to_parse_resume_device - determine if we can hibernate + * + * Can we hibernate, using the current resume= parameter? + **/ +int toi_attempt_to_parse_resume_device(int quiet) +{ + struct list_head *Allocator; + struct toi_module_ops *thisAllocator; + int result, returning = 0; + + if (toi_activate_storage(0)) + return 0; + + toiActiveAllocator = NULL; + clear_toi_state(TOI_RESUME_DEVICE_OK); + clear_toi_state(TOI_CAN_RESUME); + clear_result_state(TOI_ABORTED); + + if (!toiNumAllocators) { + if (!quiet) + printk(KERN_INFO "TuxOnIce: No storage allocators have " + "been registered. Hibernating will be " + "disabled.\n"); + goto cleanup; + } + + if (!resume_file[0]) { + if (!quiet) + printk(KERN_INFO "TuxOnIce: Resume= parameter is empty." + " Hibernating will be disabled.\n"); + goto cleanup; + } + + list_for_each(Allocator, &toiAllocators) { + thisAllocator = list_entry(Allocator, struct toi_module_ops, + type_list); + + /* + * Not sure why you'd want to disable an allocator, but + * we should honour the flag if we're providing it + */ + if (!thisAllocator->enabled) + continue; + + result = thisAllocator->parse_sig_location( + resume_file, (toiNumAllocators == 1), + quiet); + + switch (result) { + case -EINVAL: + /* For this allocator, but not a valid + * configuration. Error already printed. */ + goto cleanup; + + case 0: + /* For this allocator and valid. */ + toiActiveAllocator = thisAllocator; + + set_toi_state(TOI_RESUME_DEVICE_OK); + set_toi_state(TOI_CAN_RESUME); + returning = 1; + goto cleanup; + } + } + if (!quiet) + printk(KERN_INFO "TuxOnIce: No matching enabled allocator " + "found. Resuming disabled.\n"); +cleanup: + toi_deactivate_storage(0); + return returning; +} +EXPORT_SYMBOL_GPL(toi_attempt_to_parse_resume_device); + +void attempt_to_parse_resume_device2(void) +{ + toi_prepare_usm(); + toi_attempt_to_parse_resume_device(0); + toi_cleanup_usm(); +} +EXPORT_SYMBOL_GPL(attempt_to_parse_resume_device2); + +void save_restore_alt_param(int replace, int quiet) +{ + static char resume_param_save[255]; + static unsigned long toi_state_save; + + if (replace) { + toi_state_save = toi_state; + strcpy(resume_param_save, resume_file); + strcpy(resume_file, alt_resume_param); + } else { + strcpy(resume_file, resume_param_save); + toi_state = toi_state_save; + } + toi_attempt_to_parse_resume_device(quiet); +} + +void attempt_to_parse_alt_resume_param(void) +{ + int ok = 0; + + /* Temporarily set resume_param to the poweroff value */ + if (!strlen(alt_resume_param)) + return; + + printk(KERN_INFO "=== Trying Poweroff Resume2 ===\n"); + save_restore_alt_param(SAVE, NOQUIET); + if (test_toi_state(TOI_CAN_RESUME)) + ok = 1; + + printk(KERN_INFO "=== Done ===\n"); + save_restore_alt_param(RESTORE, QUIET); + + /* If not ok, clear the string */ + if (ok) + return; + + printk(KERN_INFO "Can't resume from that location; clearing " + "alt_resume_param.\n"); + alt_resume_param[0] = '\0'; +} + +/** + * noresume_reset_modules - reset data structures in case of non resuming + * + * When we read the start of an image, modules (and especially the + * active allocator) might need to reset data structures if we + * decide to remove the image rather than resuming from it. + **/ +static void noresume_reset_modules(void) +{ + struct toi_module_ops *this_filter; + + list_for_each_entry(this_filter, &toi_filters, type_list) + if (this_filter->noresume_reset) + this_filter->noresume_reset(); + + if (toiActiveAllocator && toiActiveAllocator->noresume_reset) + toiActiveAllocator->noresume_reset(); +} + +/** + * fill_toi_header - fill the hibernate header structure + * @struct toi_header: Header data structure to be filled. + **/ +static int fill_toi_header(struct toi_header *sh) +{ + int i, error; + + error = init_header((struct swsusp_info *) sh); + if (error) + return error; + + sh->pagedir = pagedir1; + sh->pageset_2_size = pagedir2.size; + sh->param0 = toi_result; + sh->param1 = toi_bkd.toi_action; + sh->param2 = toi_bkd.toi_debug_state; + sh->param3 = toi_bkd.toi_default_console_level; + sh->root_fs = current->fs->root.mnt->mnt_sb->s_dev; + for (i = 0; i < 4; i++) + sh->io_time[i/2][i%2] = toi_bkd.toi_io_time[i/2][i%2]; + sh->bkd = boot_kernel_data_buffer; + return 0; +} + +/** + * rw_init_modules - initialize modules + * @rw: Whether we are reading of writing an image. + * @which: Section of the image being processed. + * + * Iterate over modules, preparing the ones that will be used to read or write + * data. + **/ +static int rw_init_modules(int rw, int which) +{ + struct toi_module_ops *this_module; + /* Initialise page transformers */ + list_for_each_entry(this_module, &toi_filters, type_list) { + if (!this_module->enabled) + continue; + if (this_module->rw_init && this_module->rw_init(rw, which)) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Failed to initialize the %s filter.", + this_module->name); + return 1; + } + } + + /* Initialise allocator */ + if (toiActiveAllocator->rw_init(rw, which)) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Failed to initialise the allocator."); + return 1; + } + + /* Initialise other modules */ + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + this_module->type == FILTER_MODULE || + this_module->type == WRITER_MODULE) + continue; + if (this_module->rw_init && this_module->rw_init(rw, which)) { + set_abort_result(TOI_FAILED_MODULE_INIT); + printk(KERN_INFO "Setting aborted flag due to module " + "init failure.\n"); + return 1; + } + } + + return 0; +} + +/** + * rw_cleanup_modules - cleanup modules + * @rw: Whether we are reading of writing an image. + * + * Cleanup components after reading or writing a set of pages. + * Only the allocator may fail. + **/ +static int rw_cleanup_modules(int rw) +{ + struct toi_module_ops *this_module; + int result = 0; + + /* Cleanup other modules */ + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + this_module->type == FILTER_MODULE || + this_module->type == WRITER_MODULE) + continue; + if (this_module->rw_cleanup) + result |= this_module->rw_cleanup(rw); + } + + /* Flush data and cleanup */ + list_for_each_entry(this_module, &toi_filters, type_list) { + if (!this_module->enabled) + continue; + if (this_module->rw_cleanup) + result |= this_module->rw_cleanup(rw); + } + + result |= toiActiveAllocator->rw_cleanup(rw); + + return result; +} + +static struct page *copy_page_from_orig_page(struct page *orig_page) +{ + int is_high = PageHighMem(orig_page), index, min, max; + struct page *high_page = NULL, + **my_last_high_page = &__get_cpu_var(last_high_page), + **my_last_sought = &__get_cpu_var(last_sought); + struct pbe *this, **my_last_low_page = &__get_cpu_var(last_low_page); + void *compare; + + if (is_high) { + if (*my_last_sought && *my_last_high_page && + *my_last_sought < orig_page) + high_page = *my_last_high_page; + else + high_page = (struct page *) restore_highmem_pblist; + this = (struct pbe *) kmap(high_page); + compare = orig_page; + } else { + if (*my_last_sought && *my_last_low_page && + *my_last_sought < orig_page) + this = *my_last_low_page; + else + this = restore_pblist; + compare = page_address(orig_page); + } + + *my_last_sought = orig_page; + + /* Locate page containing pbe */ + while (this[PBES_PER_PAGE - 1].next && + this[PBES_PER_PAGE - 1].orig_address < compare) { + if (is_high) { + struct page *next_high_page = (struct page *) + this[PBES_PER_PAGE - 1].next; + kunmap(high_page); + this = kmap(next_high_page); + high_page = next_high_page; + } else + this = this[PBES_PER_PAGE - 1].next; + } + + /* Do a binary search within the page */ + min = 0; + max = PBES_PER_PAGE; + index = PBES_PER_PAGE / 2; + while (max - min) { + if (!this[index].orig_address || + this[index].orig_address > compare) + max = index; + else if (this[index].orig_address == compare) { + if (is_high) { + struct page *page = this[index].address; + *my_last_high_page = high_page; + kunmap(high_page); + return page; + } + *my_last_low_page = this; + return virt_to_page(this[index].address); + } else + min = index; + index = ((max + min) / 2); + }; + + if (is_high) + kunmap(high_page); + + abort_hibernate(TOI_FAILED_IO, "Failed to get destination page for" + " orig page %p. This[min].orig_address=%p.\n", orig_page, + this[index].orig_address); + return NULL; +} + +/** + * worker_rw_loop - main loop to read/write pages + * + * The main I/O loop for reading or writing pages. The io_map bitmap is used to + * track the pages to read/write. + * If we are reading, the pages are loaded to their final (mapped) pfn. + **/ +static int worker_rw_loop(void *data) +{ + unsigned long data_pfn, write_pfn, next_jiffies = jiffies + HZ / 2, + jif_index = 1; + int result = 0, my_io_index = 0, last_worker; + struct toi_module_ops *first_filter = toi_get_next_filter(NULL); + struct page *buffer = toi_alloc_page(28, TOI_ATOMIC_GFP); + + current->flags |= PF_NOFREEZE; + + atomic_inc(&toi_io_workers); + mutex_lock(&io_mutex); + + do { + unsigned int buf_size; + int was_present = 1; + + if (data && jiffies > next_jiffies) { + next_jiffies += HZ / 2; + if (toiActiveAllocator->update_throughput_throttle) + toiActiveAllocator->update_throughput_throttle( + jif_index); + jif_index++; + } + + /* + * What page to use? If reading, don't know yet which page's + * data will be read, so always use the buffer. If writing, + * use the copy (Pageset1) or original page (Pageset2), but + * always write the pfn of the original page. + */ + if (io_write) { + struct page *page; + char **my_checksum_locn = &__get_cpu_var(checksum_locn); + + data_pfn = memory_bm_next_pfn(io_map); + + /* Another thread could have beaten us to it. */ + if (data_pfn == BM_END_OF_MAP) { + if (atomic_read(&io_count)) { + printk(KERN_INFO "Ran out of pfns but " + "io_count is still %d.\n", + atomic_read(&io_count)); + BUG(); + } + break; + } + + my_io_index = io_finish_at - + atomic_sub_return(1, &io_count); + + memory_bm_clear_bit(io_map, data_pfn); + page = pfn_to_page(data_pfn); + + was_present = kernel_page_present(page); + if (!was_present) + kernel_map_pages(page, 1, 1); + + if (io_pageset == 1) + write_pfn = memory_bm_next_pfn(pageset1_map); + else { + write_pfn = data_pfn; + *my_checksum_locn = + tuxonice_get_next_checksum(); + } + + mutex_unlock(&io_mutex); + + if (io_pageset == 2 && + tuxonice_calc_checksum(page, *my_checksum_locn)) + return 1; + + result = first_filter->write_page(write_pfn, page, + PAGE_SIZE); + + if (!was_present) + kernel_map_pages(page, 1, 0); + } else { /* Reading */ + my_io_index = io_finish_at - + atomic_sub_return(1, &io_count); + mutex_unlock(&io_mutex); + + /* + * Are we aborting? If so, don't submit any more I/O as + * resetting the resume_attempted flag (from ui.c) will + * clear the bdev flags, making this thread oops. + */ + if (unlikely(test_toi_state(TOI_STOP_RESUME))) { + atomic_dec(&toi_io_workers); + if (!atomic_read(&toi_io_workers)) + set_toi_state(TOI_IO_STOPPED); + while (1) + schedule(); + } + + /* See toi_bio_read_page in tuxonice_block_io.c: + * read the next page in the image. + */ + result = first_filter->read_page(&write_pfn, buffer, + &buf_size); + if (buf_size != PAGE_SIZE) { + abort_hibernate(TOI_FAILED_IO, + "I/O pipeline returned %d bytes instead" + " of %ud.\n", buf_size, PAGE_SIZE); + mutex_lock(&io_mutex); + break; + } + } + + if (result) { + io_result = result; + if (io_write) { + printk(KERN_INFO "Write chunk returned %d.\n", + result); + abort_hibernate(TOI_FAILED_IO, + "Failed to write a chunk of the " + "image."); + mutex_lock(&io_mutex); + break; + } + panic("Read chunk returned (%d)", result); + } + + /* + * Discard reads of resaved pages while reading ps2 + * and unwanted pages while rereading ps2 when aborting. + */ + if (!io_write && !PageResave(pfn_to_page(write_pfn))) { + struct page *final_page = pfn_to_page(write_pfn), + *copy_page = final_page; + char *virt, *buffer_virt; + + if (io_pageset == 1 && !load_direct(final_page)) { + copy_page = + copy_page_from_orig_page(final_page); + BUG_ON(!copy_page); + } + + if (memory_bm_test_bit(io_map, write_pfn)) { + virt = kmap(copy_page); + buffer_virt = kmap(buffer); + was_present = kernel_page_present(copy_page); + if (!was_present) + kernel_map_pages(copy_page, 1, 1); + memcpy(virt, buffer_virt, PAGE_SIZE); + if (!was_present) + kernel_map_pages(copy_page, 1, 0); + kunmap(copy_page); + kunmap(buffer); + memory_bm_clear_bit(io_map, write_pfn); + } else { + mutex_lock(&io_mutex); + atomic_inc(&io_count); + mutex_unlock(&io_mutex); + } + } + + if (my_io_index + io_base == io_nextupdate) + io_nextupdate = toi_update_status(my_io_index + + io_base, io_barmax, " %d/%d MB ", + MB(io_base+my_io_index+1), MB(io_barmax)); + + if (my_io_index == io_pc) { + printk("%s%d%%...", io_pc_step == 1 ? KERN_ERR : "", + 20 * io_pc_step); + io_pc_step++; + io_pc = io_finish_at * io_pc_step / 5; + } + + toi_cond_pause(0, NULL); + + /* + * Subtle: If there's less I/O still to be done than threads + * running, quit. This stops us doing I/O beyond the end of + * the image when reading. + * + * Possible race condition. Two threads could do the test at + * the same time; one should exit and one should continue. + * Therefore we take the mutex before comparing and exiting. + */ + + mutex_lock(&io_mutex); + + } while (atomic_read(&io_count) >= atomic_read(&toi_io_workers) && + !(io_write && test_result_state(TOI_ABORTED))); + + last_worker = atomic_dec_and_test(&toi_io_workers); + mutex_unlock(&io_mutex); + + if (last_worker) { + toi_bio_queue_flusher_should_finish = 1; + wake_up(&toi_io_queue_flusher); + result = toiActiveAllocator->finish_all_io(); + } + + toi__free_page(28, buffer); + + return result; +} + +static int start_other_threads(void) +{ + int cpu, num_started = 0; + struct task_struct *p; + + for_each_online_cpu(cpu) { + if (cpu == smp_processor_id()) + continue; + + p = kthread_create(worker_rw_loop, num_started ? NULL : MONITOR, + "ktoi_io/%d", cpu); + if (IS_ERR(p)) { + printk(KERN_ERR "ktoi_io for %i failed\n", cpu); + continue; + } + kthread_bind(p, cpu); + p->flags |= PF_MEMALLOC; + wake_up_process(p); + num_started++; + } + + return num_started; +} + +/** + * do_rw_loop - main highlevel function for reading or writing pages + * + * Create the io_map bitmap and call worker_rw_loop to perform I/O operations. + **/ +static int do_rw_loop(int write, int finish_at, struct memory_bitmap *pageflags, + int base, int barmax, int pageset) +{ + int index = 0, cpu, num_other_threads = 0, result = 0; + unsigned long pfn; + + if (!finish_at) + return 0; + + io_write = write; + io_finish_at = finish_at; + io_base = base; + io_barmax = barmax; + io_pageset = pageset; + io_index = 0; + io_pc = io_finish_at / 5; + io_pc_step = 1; + io_result = 0; + io_nextupdate = base + 1; + toi_bio_queue_flusher_should_finish = 0; + + for_each_online_cpu(cpu) { + per_cpu(last_sought, cpu) = NULL; + per_cpu(last_low_page, cpu) = NULL; + per_cpu(last_high_page, cpu) = NULL; + } + + /* Ensure all bits clear */ + memory_bm_clear(io_map); + + /* Set the bits for the pages to write */ + memory_bm_position_reset(pageflags); + + pfn = memory_bm_next_pfn(pageflags); + + while (pfn != BM_END_OF_MAP && index < finish_at) { + memory_bm_set_bit(io_map, pfn); + pfn = memory_bm_next_pfn(pageflags); + index++; + } + + BUG_ON(index < finish_at); + + atomic_set(&io_count, finish_at); + + memory_bm_position_reset(pageset1_map); + + clear_toi_state(TOI_IO_STOPPED); + memory_bm_position_reset(io_map); + + if (!test_action_state(TOI_NO_MULTITHREADED_IO)) + num_other_threads = start_other_threads(); + + if (!num_other_threads || !toiActiveAllocator->io_flusher || + test_action_state(TOI_NO_FLUSHER_THREAD)) + worker_rw_loop(num_other_threads ? NULL : MONITOR); + else + result = toiActiveAllocator->io_flusher(write); + + while (atomic_read(&toi_io_workers)) + schedule(); + + set_toi_state(TOI_IO_STOPPED); + if (unlikely(test_toi_state(TOI_STOP_RESUME))) { + while (1) + schedule(); + } + + if (!io_result && !result && !test_result_state(TOI_ABORTED)) { + unsigned long next; + + printk("done.\n"); + + toi_update_status(io_base + io_finish_at, io_barmax, + " %d/%d MB ", + MB(io_base + io_finish_at), MB(io_barmax)); + + memory_bm_position_reset(io_map); + next = memory_bm_next_pfn(io_map); + if (next != BM_END_OF_MAP) { + printk(KERN_INFO "Finished I/O loop but still work to " + "do?\nFinish at = %d. io_count = %d.\n", + finish_at, atomic_read(&io_count)); + printk(KERN_INFO "I/O bitmap still records work to do." + "%ld.\n", next); + BUG(); + } + } + + return io_result ? io_result : result; +} + +/** + * write_pageset - write a pageset to disk. + * @pagedir: Which pagedir to write. + * + * Returns: + * Zero on success or -1 on failure. + **/ +int write_pageset(struct pagedir *pagedir) +{ + int finish_at, base = 0, start_time, end_time; + int barmax = pagedir1.size + pagedir2.size; + long error = 0; + struct memory_bitmap *pageflags; + + /* + * Even if there is nothing to read or write, the allocator + * may need the init/cleanup for it's housekeeping. (eg: + * Pageset1 may start where pageset2 ends when writing). + */ + finish_at = pagedir->size; + + if (pagedir->id == 1) { + toi_prepare_status(DONT_CLEAR_BAR, + "Writing kernel & process data..."); + base = pagedir2.size; + if (test_action_state(TOI_TEST_FILTER_SPEED) || + test_action_state(TOI_TEST_BIO)) + pageflags = pageset1_map; + else + pageflags = pageset1_copy_map; + } else { + toi_prepare_status(DONT_CLEAR_BAR, "Writing caches..."); + pageflags = pageset2_map; + } + + start_time = jiffies; + + if (rw_init_modules(1, pagedir->id)) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Failed to initialise modules for writing."); + error = 1; + } + + if (!error) + error = do_rw_loop(1, finish_at, pageflags, base, barmax, + pagedir->id); + + if (rw_cleanup_modules(WRITE) && !error) { + abort_hibernate(TOI_FAILED_MODULE_CLEANUP, + "Failed to cleanup after writing."); + error = 1; + } + + end_time = jiffies; + + if ((end_time - start_time) && (!test_result_state(TOI_ABORTED))) { + toi_bkd.toi_io_time[0][0] += finish_at, + toi_bkd.toi_io_time[0][1] += (end_time - start_time); + } + + return error; +} + +/** + * read_pageset - highlevel function to read a pageset from disk + * @pagedir: pageset to read + * @overwrittenpagesonly: Whether to read the whole pageset or + * only part of it. + * + * Returns: + * Zero on success or -1 on failure. + **/ +static int read_pageset(struct pagedir *pagedir, int overwrittenpagesonly) +{ + int result = 0, base = 0, start_time, end_time; + int finish_at = pagedir->size; + int barmax = pagedir1.size + pagedir2.size; + struct memory_bitmap *pageflags; + + if (pagedir->id == 1) { + toi_prepare_status(DONT_CLEAR_BAR, + "Reading kernel & process data..."); + pageflags = pageset1_map; + } else { + toi_prepare_status(DONT_CLEAR_BAR, "Reading caches..."); + if (overwrittenpagesonly) { + barmax = min(pagedir1.size, pagedir2.size); + finish_at = min(pagedir1.size, pagedir2.size); + } else + base = pagedir1.size; + pageflags = pageset2_map; + } + + start_time = jiffies; + + if (rw_init_modules(0, pagedir->id)) { + toiActiveAllocator->remove_image(); + result = 1; + } else + result = do_rw_loop(0, finish_at, pageflags, base, barmax, + pagedir->id); + + if (rw_cleanup_modules(READ) && !result) { + abort_hibernate(TOI_FAILED_MODULE_CLEANUP, + "Failed to cleanup after reading."); + result = 1; + } + + /* Statistics */ + end_time = jiffies; + + if ((end_time - start_time) && (!test_result_state(TOI_ABORTED))) { + toi_bkd.toi_io_time[1][0] += finish_at, + toi_bkd.toi_io_time[1][1] += (end_time - start_time); + } + + return result; +} + +/** + * write_module_configs - store the modules configuration + * + * The configuration for each module is stored in the image header. + * Returns: Int + * Zero on success, Error value otherwise. + **/ +static int write_module_configs(void) +{ + struct toi_module_ops *this_module; + char *buffer = (char *) toi_get_zeroed_page(22, TOI_ATOMIC_GFP); + int len, index = 1; + struct toi_module_header toi_module_header; + + if (!buffer) { + printk(KERN_INFO "Failed to allocate a buffer for saving " + "module configuration info.\n"); + return -ENOMEM; + } + + /* + * We have to know which data goes with which module, so we at + * least write a length of zero for a module. Note that we are + * also assuming every module's config data takes <= PAGE_SIZE. + */ + + /* For each module (in registration order) */ + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || !this_module->storage_needed || + (this_module->type == WRITER_MODULE && + toiActiveAllocator != this_module)) + continue; + + /* Get the data from the module */ + len = 0; + if (this_module->save_config_info) + len = this_module->save_config_info(buffer); + + /* Save the details of the module */ + toi_module_header.enabled = this_module->enabled; + toi_module_header.type = this_module->type; + toi_module_header.index = index++; + strncpy(toi_module_header.name, this_module->name, + sizeof(toi_module_header.name)); + toiActiveAllocator->rw_header_chunk(WRITE, + this_module, + (char *) &toi_module_header, + sizeof(toi_module_header)); + + /* Save the size of the data and any data returned */ + toiActiveAllocator->rw_header_chunk(WRITE, + this_module, + (char *) &len, sizeof(int)); + if (len) + toiActiveAllocator->rw_header_chunk( + WRITE, this_module, buffer, len); + } + + /* Write a blank header to terminate the list */ + toi_module_header.name[0] = '\0'; + toiActiveAllocator->rw_header_chunk(WRITE, NULL, + (char *) &toi_module_header, sizeof(toi_module_header)); + + toi_free_page(22, (unsigned long) buffer); + return 0; +} + +/** + * read_one_module_config - read and configure one module + * + * Read the configuration for one module, and configure the module + * to match if it is loaded. + * + * Returns: Int + * Zero on success, Error value otherwise. + **/ +static int read_one_module_config(struct toi_module_header *header) +{ + struct toi_module_ops *this_module; + int result, len; + char *buffer; + + /* Find the module */ + this_module = toi_find_module_given_name(header->name); + + if (!this_module) { + if (header->enabled) { + toi_early_boot_message(1, TOI_CONTINUE_REQ, + "It looks like we need module %s for reading " + "the image but it hasn't been registered.\n", + header->name); + if (!(test_toi_state(TOI_CONTINUE_REQ))) + return -EINVAL; + } else + printk(KERN_INFO "Module %s configuration data found, " + "but the module hasn't registered. Looks like " + "it was disabled, so we're ignoring its data.", + header->name); + } + + /* Get the length of the data (if any) */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, (char *) &len, + sizeof(int)); + if (result) { + printk(KERN_ERR "Failed to read the length of the module %s's" + " configuration data.\n", + header->name); + return -EINVAL; + } + + /* Read any data and pass to the module (if we found one) */ + if (!len) + return 0; + + buffer = (char *) toi_get_zeroed_page(23, TOI_ATOMIC_GFP); + + if (!buffer) { + printk(KERN_ERR "Failed to allocate a buffer for reloading " + "module configuration info.\n"); + return -ENOMEM; + } + + toiActiveAllocator->rw_header_chunk(READ, NULL, buffer, len); + + if (!this_module) + goto out; + + if (!this_module->save_config_info) + printk(KERN_ERR "Huh? Module %s appears to have a " + "save_config_info, but not a load_config_info " + "function!\n", this_module->name); + else + this_module->load_config_info(buffer, len); + + /* + * Now move this module to the tail of its lists. This will put it in + * order. Any new modules will end up at the top of the lists. They + * should have been set to disabled when loaded (people will + * normally not edit an initrd to load a new module and then hibernate + * without using it!). + */ + + toi_move_module_tail(this_module); + + this_module->enabled = header->enabled; + +out: + toi_free_page(23, (unsigned long) buffer); + return 0; +} + +/** + * read_module_configs - reload module configurations from the image header. + * + * Returns: Int + * Zero on success or an error code. + **/ +static int read_module_configs(void) +{ + int result = 0; + struct toi_module_header toi_module_header; + struct toi_module_ops *this_module; + + /* All modules are initially disabled. That way, if we have a module + * loaded now that wasn't loaded when we hibernated, it won't be used + * in trying to read the data. + */ + list_for_each_entry(this_module, &toi_modules, module_list) + this_module->enabled = 0; + + /* Get the first module header */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, + (char *) &toi_module_header, + sizeof(toi_module_header)); + if (result) { + printk(KERN_ERR "Failed to read the next module header.\n"); + return -EINVAL; + } + + /* For each module (in registration order) */ + while (toi_module_header.name[0]) { + result = read_one_module_config(&toi_module_header); + + if (result) + return -EINVAL; + + /* Get the next module header */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, + (char *) &toi_module_header, + sizeof(toi_module_header)); + + if (result) { + printk(KERN_ERR "Failed to read the next module " + "header.\n"); + return -EINVAL; + } + } + + return 0; +} + +/** + * write_image_header - write the image header after write the image proper + * + * Returns: Int + * Zero on success, error value otherwise. + **/ +int write_image_header(void) +{ + int ret; + int total = pagedir1.size + pagedir2.size+2; + char *header_buffer = NULL; + + /* Now prepare to write the header */ + ret = toiActiveAllocator->write_header_init(); + if (ret) { + abort_hibernate(TOI_FAILED_MODULE_INIT, + "Active allocator's write_header_init" + " function failed."); + goto write_image_header_abort; + } + + /* Get a buffer */ + header_buffer = (char *) toi_get_zeroed_page(24, TOI_ATOMIC_GFP); + if (!header_buffer) { + abort_hibernate(TOI_OUT_OF_MEMORY, + "Out of memory when trying to get page for header!"); + goto write_image_header_abort; + } + + /* Write hibernate header */ + if (fill_toi_header((struct toi_header *) header_buffer)) { + abort_hibernate(TOI_OUT_OF_MEMORY, + "Failure to fill header information!"); + goto write_image_header_abort; + } + toiActiveAllocator->rw_header_chunk(WRITE, NULL, + header_buffer, sizeof(struct toi_header)); + + toi_free_page(24, (unsigned long) header_buffer); + + /* Write module configurations */ + ret = write_module_configs(); + if (ret) { + abort_hibernate(TOI_FAILED_IO, + "Failed to write module configs."); + goto write_image_header_abort; + } + + memory_bm_write(pageset1_map, toiActiveAllocator->rw_header_chunk); + + /* Flush data and let allocator cleanup */ + if (toiActiveAllocator->write_header_cleanup()) { + abort_hibernate(TOI_FAILED_IO, + "Failed to cleanup writing header."); + goto write_image_header_abort_no_cleanup; + } + + if (test_result_state(TOI_ABORTED)) + goto write_image_header_abort_no_cleanup; + + toi_update_status(total, total, NULL); + + return 0; + +write_image_header_abort: + toiActiveAllocator->write_header_cleanup(); +write_image_header_abort_no_cleanup: + return -1; +} + +/** + * sanity_check - check the header + * @sh: the header which was saved at hibernate time. + * + * Perform a few checks, seeking to ensure that the kernel being + * booted matches the one hibernated. They need to match so we can + * be _sure_ things will work. It is not absolutely impossible for + * resuming from a different kernel to work, just not assured. + **/ +static char *sanity_check(struct toi_header *sh) +{ + char *reason = check_image_kernel((struct swsusp_info *) sh); + + if (reason) + return reason; + + if (!test_action_state(TOI_IGNORE_ROOTFS)) { + const struct super_block *sb; + list_for_each_entry(sb, &super_blocks, s_list) { + if ((!(sb->s_flags & MS_RDONLY)) && + (sb->s_type->fs_flags & FS_REQUIRES_DEV)) + return "Device backed fs has been mounted " + "rw prior to resume or initrd/ramfs " + "is mounted rw."; + } + } + + return NULL; +} + +static DECLARE_WAIT_QUEUE_HEAD(freeze_wait); + +#define FREEZE_IN_PROGRESS (~0) + +static int freeze_result; + +static void do_freeze(struct work_struct *dummy) +{ + freeze_result = freeze_processes(); + wake_up(&freeze_wait); +} + +static DECLARE_WORK(freeze_work, do_freeze); + +/** + * __read_pageset1 - test for the existence of an image and attempt to load it + * + * Returns: Int + * Zero if image found and pageset1 successfully loaded. + * Error if no image found or loaded. + **/ +static int __read_pageset1(void) +{ + int i, result = 0; + char *header_buffer = (char *) toi_get_zeroed_page(25, TOI_ATOMIC_GFP), + *sanity_error = NULL; + struct toi_header *toi_header; + + if (!header_buffer) { + printk(KERN_INFO "Unable to allocate a page for reading the " + "signature.\n"); + return -ENOMEM; + } + + /* Check for an image */ + result = toiActiveAllocator->image_exists(1); + if (!result) { + result = -ENODATA; + noresume_reset_modules(); + printk(KERN_INFO "TuxOnIce: No image found.\n"); + goto out; + } + + /* + * Prepare the active allocator for reading the image header. The + * activate allocator might read its own configuration. + * + * NB: This call may never return because there might be a signature + * for a different image such that we warn the user and they choose + * to reboot. (If the device ids look erroneous (2.4 vs 2.6) or the + * location of the image might be unavailable if it was stored on a + * network connection). + */ + + result = toiActiveAllocator->read_header_init(); + if (result) { + printk(KERN_INFO "TuxOnIce: Failed to initialise, reading the " + "image header.\n"); + goto out_remove_image; + } + + /* Check for noresume command line option */ + if (test_toi_state(TOI_NORESUME_SPECIFIED)) { + printk(KERN_INFO "TuxOnIce: Noresume on command line. Removed " + "image.\n"); + goto out_remove_image; + } + + /* Check whether we've resumed before */ + if (test_toi_state(TOI_RESUMED_BEFORE)) { + toi_early_boot_message(1, 0, NULL); + if (!(test_toi_state(TOI_CONTINUE_REQ))) { + printk(KERN_INFO "TuxOnIce: Tried to resume before: " + "Invalidated image.\n"); + goto out_remove_image; + } + } + + clear_toi_state(TOI_CONTINUE_REQ); + + /* Read hibernate header */ + result = toiActiveAllocator->rw_header_chunk(READ, NULL, + header_buffer, sizeof(struct toi_header)); + if (result < 0) { + printk(KERN_ERR "TuxOnIce: Failed to read the image " + "signature.\n"); + goto out_remove_image; + } + + toi_header = (struct toi_header *) header_buffer; + + /* + * NB: This call may also result in a reboot rather than returning. + */ + + sanity_error = sanity_check(toi_header); + if (sanity_error) { + toi_early_boot_message(1, TOI_CONTINUE_REQ, + sanity_error); + printk(KERN_INFO "TuxOnIce: Sanity check failed.\n"); + goto out_remove_image; + } + + /* + * We have an image and it looks like it will load okay. + * + * Get metadata from header. Don't override commandline parameters. + * + * We don't need to save the image size limit because it's not used + * during resume and will be restored with the image anyway. + */ + + memcpy((char *) &pagedir1, + (char *) &toi_header->pagedir, sizeof(pagedir1)); + toi_result = toi_header->param0; + toi_bkd.toi_action = toi_header->param1; + toi_bkd.toi_debug_state = toi_header->param2; + toi_bkd.toi_default_console_level = toi_header->param3; + clear_toi_state(TOI_IGNORE_LOGLEVEL); + pagedir2.size = toi_header->pageset_2_size; + for (i = 0; i < 4; i++) + toi_bkd.toi_io_time[i/2][i%2] = + toi_header->io_time[i/2][i%2]; + + set_toi_state(TOI_BOOT_KERNEL); + boot_kernel_data_buffer = toi_header->bkd; + + /* Read module configurations */ + result = read_module_configs(); + if (result) { + pagedir1.size = 0; + pagedir2.size = 0; + printk(KERN_INFO "TuxOnIce: Failed to read TuxOnIce module " + "configurations.\n"); + clear_action_state(TOI_KEEP_IMAGE); + goto out_remove_image; + } + + toi_prepare_console(); + + set_toi_state(TOI_NOW_RESUMING); + + if (!test_action_state(TOI_LATE_CPU_HOTPLUG)) { + toi_prepare_status(DONT_CLEAR_BAR, "Disable nonboot cpus."); + if (disable_nonboot_cpus()) { + set_abort_result(TOI_CPU_HOTPLUG_FAILED); + goto out_reset_console; + } + } + + if (usermodehelper_disable()) + goto out_enable_nonboot_cpus; + + current->flags |= PF_NOFREEZE; + freeze_result = FREEZE_IN_PROGRESS; + + schedule_work_on(first_cpu(cpu_online_map), &freeze_work); + + toi_cond_pause(1, "About to read original pageset1 locations."); + + /* + * See _toi_rw_header_chunk in tuxonice_block_io.c: + * Initialize pageset1_map by reading the map from the image. + */ + if (memory_bm_read(pageset1_map, toiActiveAllocator->rw_header_chunk)) + goto out_thaw; + + /* + * See toi_rw_cleanup in tuxonice_block_io.c: + * Clean up after reading the header. + */ + result = toiActiveAllocator->read_header_cleanup(); + if (result) { + printk(KERN_ERR "TuxOnIce: Failed to cleanup after reading the " + "image header.\n"); + goto out_thaw; + } + + toi_cond_pause(1, "About to read pagedir."); + + /* + * Get the addresses of pages into which we will load the kernel to + * be copied back and check if they conflict with the ones we are using. + */ + if (toi_get_pageset1_load_addresses()) { + printk(KERN_INFO "TuxOnIce: Failed to get load addresses for " + "pageset1.\n"); + goto out_thaw; + } + + /* Read the original kernel back */ + toi_cond_pause(1, "About to read pageset 1."); + + /* Given the pagemap, read back the data from disk */ + if (read_pageset(&pagedir1, 0)) { + toi_prepare_status(DONT_CLEAR_BAR, "Failed to read pageset 1."); + result = -EIO; + goto out_thaw; + } + + toi_cond_pause(1, "About to restore original kernel."); + result = 0; + + if (!test_action_state(TOI_KEEP_IMAGE) && + toiActiveAllocator->mark_resume_attempted) + toiActiveAllocator->mark_resume_attempted(1); + + wait_event(freeze_wait, freeze_result != FREEZE_IN_PROGRESS); +out: + current->flags &= ~PF_NOFREEZE; + toi_free_page(25, (unsigned long) header_buffer); + return result; + +out_thaw: + wait_event(freeze_wait, freeze_result != FREEZE_IN_PROGRESS); + thaw_processes(); + usermodehelper_enable(); +out_enable_nonboot_cpus: + enable_nonboot_cpus(); +out_reset_console: + toi_cleanup_console(); +out_remove_image: + result = -EINVAL; + if (!test_action_state(TOI_KEEP_IMAGE)) + toiActiveAllocator->remove_image(); + toiActiveAllocator->read_header_cleanup(); + noresume_reset_modules(); + goto out; +} + +/** + * read_pageset1 - highlevel function to read the saved pages + * + * Attempt to read the header and pageset1 of a hibernate image. + * Handle the outcome, complaining where appropriate. + **/ +int read_pageset1(void) +{ + int error; + + error = __read_pageset1(); + + if (error && error != -ENODATA && error != -EINVAL && + !test_result_state(TOI_ABORTED)) + abort_hibernate(TOI_IMAGE_ERROR, + "TuxOnIce: Error %d resuming\n", error); + + return error; +} + +/** + * get_have_image_data - check the image header + **/ +static char *get_have_image_data(void) +{ + char *output_buffer = (char *) toi_get_zeroed_page(26, TOI_ATOMIC_GFP); + struct toi_header *toi_header; + + if (!output_buffer) { + printk(KERN_INFO "Output buffer null.\n"); + return NULL; + } + + /* Check for an image */ + if (!toiActiveAllocator->image_exists(1) || + toiActiveAllocator->read_header_init() || + toiActiveAllocator->rw_header_chunk(READ, NULL, + output_buffer, sizeof(struct toi_header))) { + sprintf(output_buffer, "0\n"); + /* + * From an initrd/ramfs, catting have_image and + * getting a result of 0 is sufficient. + */ + clear_toi_state(TOI_BOOT_TIME); + goto out; + } + + toi_header = (struct toi_header *) output_buffer; + + sprintf(output_buffer, "1\n%s\n%s\n", + toi_header->uts.machine, + toi_header->uts.version); + + /* Check whether we've resumed before */ + if (test_toi_state(TOI_RESUMED_BEFORE)) + strcat(output_buffer, "Resumed before.\n"); + +out: + noresume_reset_modules(); + return output_buffer; +} + +/** + * read_pageset2 - read second part of the image + * @overwrittenpagesonly: Read only pages which would have been + * verwritten by pageset1? + * + * Read in part or all of pageset2 of an image, depending upon + * whether we are hibernating and have only overwritten a portion + * with pageset1 pages, or are resuming and need to read them + * all. + * + * Returns: Int + * Zero if no error, otherwise the error value. + **/ +int read_pageset2(int overwrittenpagesonly) +{ + int result = 0; + + if (!pagedir2.size) + return 0; + + result = read_pageset(&pagedir2, overwrittenpagesonly); + + toi_cond_pause(1, "Pagedir 2 read."); + + return result; +} + +/** + * image_exists_read - has an image been found? + * @page: Output buffer + * + * Store 0 or 1 in page, depending on whether an image is found. + * Incoming buffer is PAGE_SIZE and result is guaranteed + * to be far less than that, so we don't worry about + * overflow. + **/ +int image_exists_read(const char *page, int count) +{ + int len = 0; + char *result; + + if (toi_activate_storage(0)) + return count; + + if (!test_toi_state(TOI_RESUME_DEVICE_OK)) + toi_attempt_to_parse_resume_device(0); + + if (!toiActiveAllocator) { + len = sprintf((char *) page, "-1\n"); + } else { + result = get_have_image_data(); + if (result) { + len = sprintf((char *) page, "%s", result); + toi_free_page(26, (unsigned long) result); + } + } + + toi_deactivate_storage(0); + + return len; +} + +/** + * image_exists_write - invalidate an image if one exists + **/ +int image_exists_write(const char *buffer, int count) +{ + if (toi_activate_storage(0)) + return count; + + if (toiActiveAllocator && toiActiveAllocator->image_exists(1)) + toiActiveAllocator->remove_image(); + + toi_deactivate_storage(0); + + clear_result_state(TOI_KEPT_IMAGE); + + return count; +} diff --git a/kernel/power/tuxonice_io.h b/kernel/power/tuxonice_io.h new file mode 100644 index 0000000..86e8996 --- /dev/null +++ b/kernel/power/tuxonice_io.h @@ -0,0 +1,71 @@ +/* + * kernel/power/tuxonice_io.h + * + * Copyright (C) 2005-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * It contains high level IO routines for hibernating. + * + */ + +#include +#include "tuxonice_pagedir.h" +#include "power.h" + +/* Non-module data saved in our image header */ +struct toi_header { + /* + * Mirror struct swsusp_info, but without + * the page aligned attribute + */ + struct new_utsname uts; + u32 version_code; + unsigned long num_physpages; + int cpus; + unsigned long image_pages; + unsigned long pages; + unsigned long size; + + /* Our own data */ + unsigned long orig_mem_free; + int page_size; + int pageset_2_size; + int param0; + int param1; + int param2; + int param3; + int progress0; + int progress1; + int progress2; + int progress3; + int io_time[2][2]; + struct pagedir pagedir; + dev_t root_fs; + unsigned long bkd; /* Boot kernel data locn */ +}; + +extern int write_pageset(struct pagedir *pagedir); +extern int write_image_header(void); +extern int read_pageset1(void); +extern int read_pageset2(int overwrittenpagesonly); + +extern int toi_attempt_to_parse_resume_device(int quiet); +extern void attempt_to_parse_resume_device2(void); +extern void attempt_to_parse_alt_resume_param(void); +int image_exists_read(const char *page, int count); +int image_exists_write(const char *buffer, int count); +extern void save_restore_alt_param(int replace, int quiet); +extern atomic_t toi_io_workers; + +/* Args to save_restore_alt_param */ +#define RESTORE 0 +#define SAVE 1 + +#define NOQUIET 0 +#define QUIET 1 + +extern dev_t name_to_dev_t(char *line); + +extern wait_queue_head_t toi_io_queue_flusher; +extern int toi_bio_queue_flusher_should_finish; diff --git a/kernel/power/tuxonice_modules.c b/kernel/power/tuxonice_modules.c new file mode 100644 index 0000000..a0cd9eb --- /dev/null +++ b/kernel/power/tuxonice_modules.c @@ -0,0 +1,494 @@ +/* + * kernel/power/tuxonice_modules.c + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + */ + +#include +#include +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_ui.h" + +LIST_HEAD(toi_filters); +LIST_HEAD(toiAllocators); +LIST_HEAD(toi_modules); + +struct toi_module_ops *toiActiveAllocator; +EXPORT_SYMBOL_GPL(toiActiveAllocator); + +static int toi_num_filters; +int toiNumAllocators, toi_num_modules; + +/* + * toi_header_storage_for_modules + * + * Returns the amount of space needed to store configuration + * data needed by the modules prior to copying back the original + * kernel. We can exclude data for pageset2 because it will be + * available anyway once the kernel is copied back. + */ +long toi_header_storage_for_modules(void) +{ + struct toi_module_ops *this_module; + int bytes = 0; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + (this_module->type == WRITER_MODULE && + toiActiveAllocator != this_module)) + continue; + if (this_module->storage_needed) { + int this = this_module->storage_needed() + + sizeof(struct toi_module_header) + + sizeof(int); + this_module->header_requested = this; + bytes += this; + } + } + + /* One more for the empty terminator */ + return bytes + sizeof(struct toi_module_header); +} + +void print_toi_header_storage_for_modules(void) +{ + struct toi_module_ops *this_module; + int bytes = 0; + + printk(KERN_DEBUG "Header storage:\n"); + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || + (this_module->type == WRITER_MODULE && + toiActiveAllocator != this_module)) + continue; + if (this_module->storage_needed) { + int this = this_module->storage_needed() + + sizeof(struct toi_module_header) + + sizeof(int); + this_module->header_requested = this; + bytes += this; + printk(KERN_DEBUG "+ %16s : %-4d/%d.\n", + this_module->name, + this_module->header_used, this); + } + } + + printk(KERN_DEBUG "+ empty terminator : %ld.\n", + sizeof(struct toi_module_header)); + printk(KERN_DEBUG " ====\n"); + printk(KERN_DEBUG " %ld\n", + bytes + sizeof(struct toi_module_header)); +} + +/* + * toi_memory_for_modules + * + * Returns the amount of memory requested by modules for + * doing their work during the cycle. + */ + +long toi_memory_for_modules(int print_parts) +{ + long bytes = 0, result; + struct toi_module_ops *this_module; + + if (print_parts) + printk(KERN_INFO "Memory for modules:\n===================\n"); + list_for_each_entry(this_module, &toi_modules, module_list) { + int this; + if (!this_module->enabled) + continue; + if (this_module->memory_needed) { + this = this_module->memory_needed(); + if (print_parts) + printk(KERN_INFO "%10d bytes (%5ld pages) for " + "module '%s'.\n", this, + DIV_ROUND_UP(this, PAGE_SIZE), + this_module->name); + bytes += this; + } + } + + result = DIV_ROUND_UP(bytes, PAGE_SIZE); + if (print_parts) + printk(KERN_INFO " => %ld bytes, %ld pages.\n", bytes, result); + + return result; +} + +/* + * toi_expected_compression_ratio + * + * Returns the compression ratio expected when saving the image. + */ + +int toi_expected_compression_ratio(void) +{ + int ratio = 100; + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled) + continue; + if (this_module->expected_compression) + ratio = ratio * this_module->expected_compression() + / 100; + } + + return ratio; +} + +/* toi_find_module_given_dir + * Functionality : Return a module (if found), given a pointer + * to its directory name + */ + +static struct toi_module_ops *toi_find_module_given_dir(char *name) +{ + struct toi_module_ops *this_module, *found_module = NULL; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!strcmp(name, this_module->directory)) { + found_module = this_module; + break; + } + } + + return found_module; +} + +/* toi_find_module_given_name + * Functionality : Return a module (if found), given a pointer + * to its name + */ + +struct toi_module_ops *toi_find_module_given_name(char *name) +{ + struct toi_module_ops *this_module, *found_module = NULL; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!strcmp(name, this_module->name)) { + found_module = this_module; + break; + } + } + + return found_module; +} + +/* + * toi_print_module_debug_info + * Functionality : Get debugging info from modules into a buffer. + */ +int toi_print_module_debug_info(char *buffer, int buffer_size) +{ + struct toi_module_ops *this_module; + int len = 0; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled) + continue; + if (this_module->print_debug_info) { + int result; + result = this_module->print_debug_info(buffer + len, + buffer_size - len); + len += result; + } + } + + /* Ensure null terminated */ + buffer[buffer_size] = 0; + + return len; +} + +/* + * toi_register_module + * + * Register a module. + */ +int toi_register_module(struct toi_module_ops *module) +{ + int i; + struct kobject *kobj; + + module->enabled = 1; + + if (toi_find_module_given_name(module->name)) { + printk(KERN_INFO "TuxOnIce: Trying to load module %s," + " which is already registered.\n", + module->name); + return -EBUSY; + } + + switch (module->type) { + case FILTER_MODULE: + list_add_tail(&module->type_list, &toi_filters); + toi_num_filters++; + break; + case WRITER_MODULE: + list_add_tail(&module->type_list, &toiAllocators); + toiNumAllocators++; + break; + case MISC_MODULE: + case MISC_HIDDEN_MODULE: + break; + default: + printk(KERN_ERR "Hmmm. Module '%s' has an invalid type." + " It has been ignored.\n", module->name); + return -EINVAL; + } + list_add_tail(&module->module_list, &toi_modules); + toi_num_modules++; + + if ((!module->directory && !module->shared_directory) || + !module->sysfs_data || !module->num_sysfs_entries) + return 0; + + /* + * Modules may share a directory, but those with shared_dir + * set must be loaded (via symbol dependencies) after parents + * and unloaded beforehand. + */ + if (module->shared_directory) { + struct toi_module_ops *shared = + toi_find_module_given_dir(module->shared_directory); + if (!shared) { + printk(KERN_ERR "TuxOnIce: Module %s wants to share " + "%s's directory but %s isn't loaded.\n", + module->name, module->shared_directory, + module->shared_directory); + toi_unregister_module(module); + return -ENODEV; + } + kobj = shared->dir_kobj; + } else { + if (!strncmp(module->directory, "[ROOT]", 6)) + kobj = tuxonice_kobj; + else + kobj = make_toi_sysdir(module->directory); + } + module->dir_kobj = kobj; + for (i = 0; i < module->num_sysfs_entries; i++) { + int result = toi_register_sysfs_file(kobj, + &module->sysfs_data[i]); + if (result) + return result; + } + return 0; +} +EXPORT_SYMBOL_GPL(toi_register_module); + +/* + * toi_unregister_module + * + * Remove a module. + */ +void toi_unregister_module(struct toi_module_ops *module) +{ + int i; + + if (module->dir_kobj) + for (i = 0; i < module->num_sysfs_entries; i++) + toi_unregister_sysfs_file(module->dir_kobj, + &module->sysfs_data[i]); + + if (!module->shared_directory && module->directory && + strncmp(module->directory, "[ROOT]", 6)) + remove_toi_sysdir(module->dir_kobj); + + switch (module->type) { + case FILTER_MODULE: + list_del(&module->type_list); + toi_num_filters--; + break; + case WRITER_MODULE: + list_del(&module->type_list); + toiNumAllocators--; + if (toiActiveAllocator == module) { + toiActiveAllocator = NULL; + clear_toi_state(TOI_CAN_RESUME); + clear_toi_state(TOI_CAN_HIBERNATE); + } + break; + case MISC_MODULE: + case MISC_HIDDEN_MODULE: + break; + default: + printk(KERN_ERR "Module '%s' has an invalid type." + " It has been ignored.\n", module->name); + return; + } + list_del(&module->module_list); + toi_num_modules--; +} +EXPORT_SYMBOL_GPL(toi_unregister_module); + +/* + * toi_move_module_tail + * + * Rearrange modules when reloading the config. + */ +void toi_move_module_tail(struct toi_module_ops *module) +{ + switch (module->type) { + case FILTER_MODULE: + if (toi_num_filters > 1) + list_move_tail(&module->type_list, &toi_filters); + break; + case WRITER_MODULE: + if (toiNumAllocators > 1) + list_move_tail(&module->type_list, &toiAllocators); + break; + case MISC_MODULE: + case MISC_HIDDEN_MODULE: + break; + default: + printk(KERN_ERR "Module '%s' has an invalid type." + " It has been ignored.\n", module->name); + return; + } + if ((toi_num_filters + toiNumAllocators) > 1) + list_move_tail(&module->module_list, &toi_modules); +} + +/* + * toi_initialise_modules + * + * Get ready to do some work! + */ +int toi_initialise_modules(int starting_cycle, int early) +{ + struct toi_module_ops *this_module; + int result; + + list_for_each_entry(this_module, &toi_modules, module_list) { + this_module->header_requested = 0; + this_module->header_used = 0; + if (!this_module->enabled) + continue; + if (this_module->early != early) + continue; + if (this_module->initialise) { + toi_message(TOI_MEMORY, TOI_MEDIUM, 1, + "Initialising module %s.\n", + this_module->name); + result = this_module->initialise(starting_cycle); + if (result) { + toi_cleanup_modules(starting_cycle); + return result; + } + this_module->initialised = 1; + } + } + + return 0; +} + +/* + * toi_cleanup_modules + * + * Tell modules the work is done. + */ +void toi_cleanup_modules(int finishing_cycle) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (!this_module->enabled || !this_module->initialised) + continue; + if (this_module->cleanup) { + toi_message(TOI_MEMORY, TOI_MEDIUM, 1, + "Cleaning up module %s.\n", + this_module->name); + this_module->cleanup(finishing_cycle); + } + this_module->initialised = 0; + } +} + +/* + * toi_get_next_filter + * + * Get the next filter in the pipeline. + */ +struct toi_module_ops *toi_get_next_filter(struct toi_module_ops *filter_sought) +{ + struct toi_module_ops *last_filter = NULL, *this_filter = NULL; + + list_for_each_entry(this_filter, &toi_filters, type_list) { + if (!this_filter->enabled) + continue; + if ((last_filter == filter_sought) || (!filter_sought)) + return this_filter; + last_filter = this_filter; + } + + return toiActiveAllocator; +} +EXPORT_SYMBOL_GPL(toi_get_next_filter); + +/** + * toi_show_modules: Printk what support is loaded. + */ +void toi_print_modules(void) +{ + struct toi_module_ops *this_module; + int prev = 0; + + printk(KERN_INFO "TuxOnIce " TOI_CORE_VERSION ", with support for"); + + list_for_each_entry(this_module, &toi_modules, module_list) { + if (this_module->type == MISC_HIDDEN_MODULE) + continue; + printk("%s %s%s%s", prev ? "," : "", + this_module->enabled ? "" : "[", + this_module->name, + this_module->enabled ? "" : "]"); + prev = 1; + } + + printk(".\n"); +} + +/* toi_get_modules + * + * Take a reference to modules so they can't go away under us. + */ + +int toi_get_modules(void) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) { + struct toi_module_ops *this_module2; + + if (try_module_get(this_module->module)) + continue; + + /* Failed! Reverse gets and return error */ + list_for_each_entry(this_module2, &toi_modules, + module_list) { + if (this_module == this_module2) + return -EINVAL; + module_put(this_module2->module); + } + } + return 0; +} + +/* toi_put_modules + * + * Release our references to modules we used. + */ + +void toi_put_modules(void) +{ + struct toi_module_ops *this_module; + + list_for_each_entry(this_module, &toi_modules, module_list) + module_put(this_module->module); +} diff --git a/kernel/power/tuxonice_modules.h b/kernel/power/tuxonice_modules.h new file mode 100644 index 0000000..79494e2 --- /dev/null +++ b/kernel/power/tuxonice_modules.h @@ -0,0 +1,181 @@ +/* + * kernel/power/tuxonice_modules.h + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * It contains declarations for modules. Modules are additions to + * TuxOnIce that provide facilities such as image compression or + * encryption, backends for storage of the image and user interfaces. + * + */ + +#ifndef TOI_MODULES_H +#define TOI_MODULES_H + +/* This is the maximum size we store in the image header for a module name */ +#define TOI_MAX_MODULE_NAME_LENGTH 30 + +/* Per-module metadata */ +struct toi_module_header { + char name[TOI_MAX_MODULE_NAME_LENGTH]; + int enabled; + int type; + int index; + int data_length; + unsigned long signature; +}; + +enum { + FILTER_MODULE, + WRITER_MODULE, + MISC_MODULE, /* Block writer, eg. */ + MISC_HIDDEN_MODULE, +}; + +enum { + TOI_ASYNC, + TOI_SYNC +}; + +struct toi_module_ops { + /* Functions common to all modules */ + int type; + char *name; + char *directory; + char *shared_directory; + struct kobject *dir_kobj; + struct module *module; + int enabled, early, initialised; + struct list_head module_list; + + /* List of filters or allocators */ + struct list_head list, type_list; + + /* + * Requirements for memory and storage in + * the image header.. + */ + int (*memory_needed) (void); + int (*storage_needed) (void); + + int header_requested, header_used; + + int (*expected_compression) (void); + + /* + * Debug info + */ + int (*print_debug_info) (char *buffer, int size); + int (*save_config_info) (char *buffer); + void (*load_config_info) (char *buffer, int len); + + /* + * Initialise & cleanup - general routines called + * at the start and end of a cycle. + */ + int (*initialise) (int starting_cycle); + void (*cleanup) (int finishing_cycle); + + /* + * Calls for allocating storage (allocators only). + * + * Header space is requested separately and cannot fail, but the + * reservation is only applied when main storage is allocated. + * The header space reservation is thus always set prior to + * requesting the allocation of storage - and prior to querying + * how much storage is available. + */ + + int (*storage_available) (void); + void (*reserve_header_space) (int space_requested); + int (*allocate_storage) (int space_requested); + int (*storage_allocated) (void); + + /* + * Routines used in image I/O. + */ + int (*rw_init) (int rw, int stream_number); + int (*rw_cleanup) (int rw); + int (*write_page) (unsigned long index, struct page *buffer_page, + unsigned int buf_size); + int (*read_page) (unsigned long *index, struct page *buffer_page, + unsigned int *buf_size); + int (*io_flusher) (int rw); + + /* Reset module if image exists but reading aborted */ + void (*noresume_reset) (void); + + /* Read and write the metadata */ + int (*write_header_init) (void); + int (*write_header_cleanup) (void); + + int (*read_header_init) (void); + int (*read_header_cleanup) (void); + + int (*rw_header_chunk) (int rw, struct toi_module_ops *owner, + char *buffer_start, int buffer_size); + + int (*rw_header_chunk_noreadahead) (int rw, + struct toi_module_ops *owner, char *buffer_start, + int buffer_size); + + /* Attempt to parse an image location */ + int (*parse_sig_location) (char *buffer, int only_writer, int quiet); + + /* Throttle I/O according to throughput */ + void (*update_throughput_throttle) (int jif_index); + + /* Flush outstanding I/O */ + int (*finish_all_io) (void); + + /* Determine whether image exists that we can restore */ + int (*image_exists) (int quiet); + + /* Mark the image as having tried to resume */ + int (*mark_resume_attempted) (int); + + /* Destroy image if one exists */ + int (*remove_image) (void); + + /* Sysfs Data */ + struct toi_sysfs_data *sysfs_data; + int num_sysfs_entries; +}; + +extern int toi_num_modules, toiNumAllocators; + +extern struct toi_module_ops *toiActiveAllocator; +extern struct list_head toi_filters, toiAllocators, toi_modules; + +extern void toi_prepare_console_modules(void); +extern void toi_cleanup_console_modules(void); + +extern struct toi_module_ops *toi_find_module_given_name(char *name); +extern struct toi_module_ops *toi_get_next_filter(struct toi_module_ops *); + +extern int toi_register_module(struct toi_module_ops *module); +extern void toi_move_module_tail(struct toi_module_ops *module); + +extern long toi_header_storage_for_modules(void); +extern long toi_memory_for_modules(int print_parts); +extern void print_toi_header_storage_for_modules(void); +extern int toi_expected_compression_ratio(void); + +extern int toi_print_module_debug_info(char *buffer, int buffer_size); +extern int toi_register_module(struct toi_module_ops *module); +extern void toi_unregister_module(struct toi_module_ops *module); + +extern int toi_initialise_modules(int starting_cycle, int early); +#define toi_initialise_modules_early(starting) \ + toi_initialise_modules(starting, 1) +#define toi_initialise_modules_late(starting) \ + toi_initialise_modules(starting, 0) +extern void toi_cleanup_modules(int finishing_cycle); + +extern void toi_print_modules(void); + +int toi_get_modules(void); +void toi_put_modules(void); +#endif diff --git a/kernel/power/tuxonice_netlink.c b/kernel/power/tuxonice_netlink.c new file mode 100644 index 0000000..bb027a7 --- /dev/null +++ b/kernel/power/tuxonice_netlink.c @@ -0,0 +1,343 @@ +/* + * kernel/power/tuxonice_netlink.c + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Functions for communicating with a userspace helper via netlink. + */ + + +#include +#include "tuxonice_netlink.h" +#include "tuxonice.h" +#include "tuxonice_modules.h" +#include "tuxonice_alloc.h" + +static struct user_helper_data *uhd_list; + +/* + * Refill our pool of SKBs for use in emergencies (eg, when eating memory and + * none can be allocated). + */ +static void toi_fill_skb_pool(struct user_helper_data *uhd) +{ + while (uhd->pool_level < uhd->pool_limit) { + struct sk_buff *new_skb = + alloc_skb(NLMSG_SPACE(uhd->skb_size), TOI_ATOMIC_GFP); + + if (!new_skb) + break; + + new_skb->next = uhd->emerg_skbs; + uhd->emerg_skbs = new_skb; + uhd->pool_level++; + } +} + +/* + * Try to allocate a single skb. If we can't get one, try to use one from + * our pool. + */ +static struct sk_buff *toi_get_skb(struct user_helper_data *uhd) +{ + struct sk_buff *skb = + alloc_skb(NLMSG_SPACE(uhd->skb_size), TOI_ATOMIC_GFP); + + if (skb) + return skb; + + skb = uhd->emerg_skbs; + if (skb) { + uhd->pool_level--; + uhd->emerg_skbs = skb->next; + skb->next = NULL; + } + + return skb; +} + +static void put_skb(struct user_helper_data *uhd, struct sk_buff *skb) +{ + if (uhd->pool_level < uhd->pool_limit) { + skb->next = uhd->emerg_skbs; + uhd->emerg_skbs = skb; + } else + kfree_skb(skb); +} + +void toi_send_netlink_message(struct user_helper_data *uhd, + int type, void *params, size_t len) +{ + struct sk_buff *skb; + struct nlmsghdr *nlh; + void *dest; + struct task_struct *t; + + if (uhd->pid == -1) + return; + + if (uhd->debug) + printk(KERN_ERR "toi_send_netlink_message: Send " + "message type %d.\n", type); + + skb = toi_get_skb(uhd); + if (!skb) { + printk(KERN_INFO "toi_netlink: Can't allocate skb!\n"); + return; + } + + /* NLMSG_PUT contains a hidden goto nlmsg_failure */ + nlh = NLMSG_PUT(skb, 0, uhd->sock_seq, type, len); + uhd->sock_seq++; + + dest = NLMSG_DATA(nlh); + if (params && len > 0) + memcpy(dest, params, len); + + netlink_unicast(uhd->nl, skb, uhd->pid, 0); + + read_lock(&tasklist_lock); + t = find_task_by_pid_type_ns(PIDTYPE_PID, uhd->pid, &init_pid_ns); + if (!t) { + read_unlock(&tasklist_lock); + if (uhd->pid > -1) + printk(KERN_INFO "Hmm. Can't find the userspace task" + " %d.\n", uhd->pid); + return; + } + wake_up_process(t); + read_unlock(&tasklist_lock); + + yield(); + + return; + +nlmsg_failure: + if (skb) + put_skb(uhd, skb); + + if (uhd->debug) + printk(KERN_ERR "toi_send_netlink_message: Failed to send " + "message type %d.\n", type); +} +EXPORT_SYMBOL_GPL(toi_send_netlink_message); + +static void send_whether_debugging(struct user_helper_data *uhd) +{ + static u8 is_debugging = 1; + + toi_send_netlink_message(uhd, NETLINK_MSG_IS_DEBUGGING, + &is_debugging, sizeof(u8)); +} + +/* + * Set the PF_NOFREEZE flag on the given process to ensure it can run whilst we + * are hibernating. + */ +static int nl_set_nofreeze(struct user_helper_data *uhd, __u32 pid) +{ + struct task_struct *t; + + if (uhd->debug) + printk(KERN_ERR "nl_set_nofreeze for pid %d.\n", pid); + + read_lock(&tasklist_lock); + t = find_task_by_pid_type_ns(PIDTYPE_PID, pid, &init_pid_ns); + if (!t) { + read_unlock(&tasklist_lock); + printk(KERN_INFO "Strange. Can't find the userspace task %d.\n", + pid); + return -EINVAL; + } + + t->flags |= PF_NOFREEZE; + + read_unlock(&tasklist_lock); + uhd->pid = pid; + + toi_send_netlink_message(uhd, NETLINK_MSG_NOFREEZE_ACK, NULL, 0); + + return 0; +} + +/* + * Called when the userspace process has informed us that it's ready to roll. + */ +static int nl_ready(struct user_helper_data *uhd, u32 version) +{ + if (version != uhd->interface_version) { + printk(KERN_INFO "%s userspace process using invalid interface" + " version (%d - kernel wants %d). Trying to " + "continue without it.\n", + uhd->name, version, uhd->interface_version); + if (uhd->not_ready) + uhd->not_ready(); + return -EINVAL; + } + + complete(&uhd->wait_for_process); + + return 0; +} + +void toi_netlink_close_complete(struct user_helper_data *uhd) +{ + if (uhd->nl) { + netlink_kernel_release(uhd->nl); + uhd->nl = NULL; + } + + while (uhd->emerg_skbs) { + struct sk_buff *next = uhd->emerg_skbs->next; + kfree_skb(uhd->emerg_skbs); + uhd->emerg_skbs = next; + } + + uhd->pid = -1; +} +EXPORT_SYMBOL_GPL(toi_netlink_close_complete); + +static int toi_nl_gen_rcv_msg(struct user_helper_data *uhd, + struct sk_buff *skb, struct nlmsghdr *nlh) +{ + int type = nlh->nlmsg_type; + int *data; + int err; + + if (uhd->debug) + printk(KERN_ERR "toi_user_rcv_skb: Received message %d.\n", + type); + + /* Let the more specific handler go first. It returns + * 1 for valid messages that it doesn't know. */ + err = uhd->rcv_msg(skb, nlh); + if (err != 1) + return err; + + /* Only allow one task to receive NOFREEZE privileges */ + if (type == NETLINK_MSG_NOFREEZE_ME && uhd->pid != -1) { + printk(KERN_INFO "Received extra nofreeze me requests.\n"); + return -EBUSY; + } + + data = NLMSG_DATA(nlh); + + switch (type) { + case NETLINK_MSG_NOFREEZE_ME: + return nl_set_nofreeze(uhd, nlh->nlmsg_pid); + case NETLINK_MSG_GET_DEBUGGING: + send_whether_debugging(uhd); + return 0; + case NETLINK_MSG_READY: + if (nlh->nlmsg_len != NLMSG_LENGTH(sizeof(u32))) { + printk(KERN_INFO "Invalid ready mesage.\n"); + if (uhd->not_ready) + uhd->not_ready(); + return -EINVAL; + } + return nl_ready(uhd, (u32) *data); + case NETLINK_MSG_CLEANUP: + toi_netlink_close_complete(uhd); + return 0; + } + + return -EINVAL; +} + +static void toi_user_rcv_skb(struct sk_buff *skb) +{ + int err; + struct nlmsghdr *nlh; + struct user_helper_data *uhd = uhd_list; + + while (uhd && uhd->netlink_id != skb->sk->sk_protocol) + uhd = uhd->next; + + if (!uhd) + return; + + while (skb->len >= NLMSG_SPACE(0)) { + u32 rlen; + + nlh = (struct nlmsghdr *) skb->data; + if (nlh->nlmsg_len < sizeof(*nlh) || skb->len < nlh->nlmsg_len) + return; + + rlen = NLMSG_ALIGN(nlh->nlmsg_len); + if (rlen > skb->len) + rlen = skb->len; + + err = toi_nl_gen_rcv_msg(uhd, skb, nlh); + if (err) + netlink_ack(skb, nlh, err); + else if (nlh->nlmsg_flags & NLM_F_ACK) + netlink_ack(skb, nlh, 0); + skb_pull(skb, rlen); + } +} + +static int netlink_prepare(struct user_helper_data *uhd) +{ + uhd->next = uhd_list; + uhd_list = uhd; + + uhd->sock_seq = 0x42c0ffee; + uhd->nl = netlink_kernel_create(&init_net, uhd->netlink_id, 0, + toi_user_rcv_skb, NULL, THIS_MODULE); + if (!uhd->nl) { + printk(KERN_INFO "Failed to allocate netlink socket for %s.\n", + uhd->name); + return -ENOMEM; + } + + toi_fill_skb_pool(uhd); + + return 0; +} + +void toi_netlink_close(struct user_helper_data *uhd) +{ + struct task_struct *t; + + read_lock(&tasklist_lock); + t = find_task_by_pid_type_ns(PIDTYPE_PID, uhd->pid, &init_pid_ns); + if (t) + t->flags &= ~PF_NOFREEZE; + read_unlock(&tasklist_lock); + + toi_send_netlink_message(uhd, NETLINK_MSG_CLEANUP, NULL, 0); +} +EXPORT_SYMBOL_GPL(toi_netlink_close); + +int toi_netlink_setup(struct user_helper_data *uhd) +{ + /* In case userui didn't cleanup properly on us */ + toi_netlink_close_complete(uhd); + + if (netlink_prepare(uhd) < 0) { + printk(KERN_INFO "Netlink prepare failed.\n"); + return 1; + } + + if (toi_launch_userspace_program(uhd->program, uhd->netlink_id, + UMH_WAIT_EXEC, uhd->debug) < 0) { + printk(KERN_INFO "Launch userspace program failed.\n"); + toi_netlink_close_complete(uhd); + return 1; + } + + /* Wait 2 seconds for the userspace process to make contact */ + wait_for_completion_timeout(&uhd->wait_for_process, 2*HZ); + + if (uhd->pid == -1) { + printk(KERN_INFO "%s: Failed to contact userspace process.\n", + uhd->name); + toi_netlink_close_complete(uhd); + return 1; + } + + return 0; +} +EXPORT_SYMBOL_GPL(toi_netlink_setup); diff --git a/kernel/power/tuxonice_netlink.h b/kernel/power/tuxonice_netlink.h new file mode 100644 index 0000000..37e174b --- /dev/null +++ b/kernel/power/tuxonice_netlink.h @@ -0,0 +1,62 @@ +/* + * kernel/power/tuxonice_netlink.h + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Declarations for functions for communicating with a userspace helper + * via netlink. + */ + +#include +#include + +#define NETLINK_MSG_BASE 0x10 + +#define NETLINK_MSG_READY 0x10 +#define NETLINK_MSG_NOFREEZE_ME 0x16 +#define NETLINK_MSG_GET_DEBUGGING 0x19 +#define NETLINK_MSG_CLEANUP 0x24 +#define NETLINK_MSG_NOFREEZE_ACK 0x27 +#define NETLINK_MSG_IS_DEBUGGING 0x28 + +struct user_helper_data { + int (*rcv_msg) (struct sk_buff *skb, struct nlmsghdr *nlh); + void (*not_ready) (void); + struct sock *nl; + u32 sock_seq; + pid_t pid; + char *comm; + char program[256]; + int pool_level; + int pool_limit; + struct sk_buff *emerg_skbs; + int skb_size; + int netlink_id; + char *name; + struct user_helper_data *next; + struct completion wait_for_process; + u32 interface_version; + int must_init; + int debug; +}; + +#ifdef CONFIG_NET +int toi_netlink_setup(struct user_helper_data *uhd); +void toi_netlink_close(struct user_helper_data *uhd); +void toi_send_netlink_message(struct user_helper_data *uhd, + int type, void *params, size_t len); +void toi_netlink_close_complete(struct user_helper_data *uhd); +#else +static inline int toi_netlink_setup(struct user_helper_data *uhd) +{ + return 0; +} + +static inline void toi_netlink_close(struct user_helper_data *uhd) { }; +static inline void toi_send_netlink_message(struct user_helper_data *uhd, + int type, void *params, size_t len) { }; +static inline void toi_netlink_close_complete(struct user_helper_data *uhd) + { }; +#endif diff --git a/kernel/power/tuxonice_pagedir.c b/kernel/power/tuxonice_pagedir.c new file mode 100644 index 0000000..0c2f7fc --- /dev/null +++ b/kernel/power/tuxonice_pagedir.c @@ -0,0 +1,381 @@ +/* + * kernel/power/tuxonice_pagedir.c + * + * Copyright (C) 1998-2001 Gabor Kuti + * Copyright (C) 1998,2001,2002 Pavel Machek + * Copyright (C) 2002-2003 Florent Chabaud + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Routines for handling pagesets. + * Note that pbes aren't actually stored as such. They're stored as + * bitmaps and extents. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice_pageflags.h" +#include "tuxonice_ui.h" +#include "tuxonice_pagedir.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice.h" +#include "power.h" +#include "tuxonice_builtin.h" +#include "tuxonice_alloc.h" + +static int ptoi_pfn; +static struct pbe *this_low_pbe; +static struct pbe **last_low_pbe_ptr; +static struct memory_bitmap dup_map1, dup_map2; + +void toi_reset_alt_image_pageset2_pfn(void) +{ + memory_bm_position_reset(pageset2_map); +} + +static struct page *first_conflicting_page; + +/* + * free_conflicting_pages + */ + +static void free_conflicting_pages(void) +{ + while (first_conflicting_page) { + struct page *next = + *((struct page **) kmap(first_conflicting_page)); + kunmap(first_conflicting_page); + toi__free_page(29, first_conflicting_page); + first_conflicting_page = next; + } +} + +/* __toi_get_nonconflicting_page + * + * Description: Gets order zero pages that won't be overwritten + * while copying the original pages. + */ + +struct page *___toi_get_nonconflicting_page(int can_be_highmem) +{ + struct page *page; + gfp_t flags = TOI_ATOMIC_GFP; + if (can_be_highmem) + flags |= __GFP_HIGHMEM; + + + if (test_toi_state(TOI_LOADING_ALT_IMAGE) && + pageset2_map && + (ptoi_pfn != BM_END_OF_MAP)) { + do { + ptoi_pfn = memory_bm_next_pfn(pageset2_map); + if (ptoi_pfn != BM_END_OF_MAP) { + page = pfn_to_page(ptoi_pfn); + if (!PagePageset1(page) && + (can_be_highmem || !PageHighMem(page))) + return page; + } + } while (ptoi_pfn != BM_END_OF_MAP); + } + + do { + page = toi_alloc_page(29, flags); + if (!page) { + printk(KERN_INFO "Failed to get nonconflicting " + "page.\n"); + return NULL; + } + if (PagePageset1(page)) { + struct page **next = (struct page **) kmap(page); + *next = first_conflicting_page; + first_conflicting_page = page; + kunmap(page); + } + } while (PagePageset1(page)); + + return page; +} + +unsigned long __toi_get_nonconflicting_page(void) +{ + struct page *page = ___toi_get_nonconflicting_page(0); + return page ? (unsigned long) page_address(page) : 0; +} + +static struct pbe *get_next_pbe(struct page **page_ptr, struct pbe *this_pbe, + int highmem) +{ + if (((((unsigned long) this_pbe) & (PAGE_SIZE - 1)) + + 2 * sizeof(struct pbe)) > PAGE_SIZE) { + struct page *new_page = + ___toi_get_nonconflicting_page(highmem); + if (!new_page) + return ERR_PTR(-ENOMEM); + this_pbe = (struct pbe *) kmap(new_page); + memset(this_pbe, 0, PAGE_SIZE); + *page_ptr = new_page; + } else + this_pbe++; + + return this_pbe; +} + +/** + * get_pageset1_load_addresses - generate pbes for conflicting pages + * + * We check here that pagedir & pages it points to won't collide + * with pages where we're going to restore from the loaded pages + * later. + * + * Returns: + * Zero on success, one if couldn't find enough pages (shouldn't + * happen). + **/ +int toi_get_pageset1_load_addresses(void) +{ + int pfn, highallocd = 0, lowallocd = 0; + int low_needed = pagedir1.size - get_highmem_size(pagedir1); + int high_needed = get_highmem_size(pagedir1); + int low_pages_for_highmem = 0; + gfp_t flags = GFP_ATOMIC | __GFP_NOWARN | __GFP_HIGHMEM; + struct page *page, *high_pbe_page = NULL, *last_high_pbe_page = NULL, + *low_pbe_page; + struct pbe **last_high_pbe_ptr = &restore_highmem_pblist, + *this_high_pbe = NULL; + int orig_low_pfn, orig_high_pfn; + int high_pbes_done = 0, low_pbes_done = 0; + int low_direct = 0, high_direct = 0; + int high_to_free, low_to_free, result = 0; + + /* + * We are about to allocate all available memory, and processes + * might not have finished freezing yet. To avoid potential OOMs, + * disable non boot cpus and do this with IRQs disabled + */ + + disable_nonboot_cpus(); + local_irq_disable(); + + /* + * We need to duplicate pageset1's map because memory_bm_next_pfn's + * state gets stomped on by the PagePageset1() test in setup_pbes. + */ + memory_bm_create(&dup_map1, GFP_ATOMIC, 0); + memory_bm_dup(pageset1_map, &dup_map1); + + memory_bm_create(&dup_map2, GFP_ATOMIC, 0); + memory_bm_dup(pageset1_map, &dup_map2); + + memory_bm_position_reset(pageset1_map); + memory_bm_position_reset(&dup_map1); + memory_bm_position_reset(&dup_map2); + + last_low_pbe_ptr = &restore_pblist; + + /* First, allocate pages for the start of our pbe lists. */ + if (high_needed) { + high_pbe_page = ___toi_get_nonconflicting_page(1); + if (!high_pbe_page) { + result = -ENOMEM; + goto out; + } + this_high_pbe = (struct pbe *) kmap(high_pbe_page); + memset(this_high_pbe, 0, PAGE_SIZE); + } + + low_pbe_page = ___toi_get_nonconflicting_page(0); + if (!low_pbe_page) { + result = -ENOMEM; + goto out; + } + this_low_pbe = (struct pbe *) page_address(low_pbe_page); + + /* + * Next, allocate all possible memory to find where we can + * load data directly into destination pages. I'd like to do + * this in bigger chunks, but then we can't free pages + * individually later. + */ + + do { + page = toi_alloc_page(30, flags); + if (page) + SetPagePageset1Copy(page); + } while (page); + + /* + * Find out how many high- and lowmem pages we allocated above, + * and how many pages we can reload directly to their original + * location. + */ + memory_bm_position_reset(pageset1_copy_map); + for (pfn = memory_bm_next_pfn(pageset1_copy_map); pfn != BM_END_OF_MAP; + pfn = memory_bm_next_pfn(pageset1_copy_map)) { + int is_high; + page = pfn_to_page(pfn); + is_high = PageHighMem(page); + + if (PagePageset1(page)) { + if (test_action_state(TOI_NO_DIRECT_LOAD)) { + ClearPagePageset1Copy(page); + toi__free_page(30, page); + continue; + } else { + if (is_high) + high_direct++; + else + low_direct++; + } + } else { + if (is_high) + highallocd++; + else + lowallocd++; + } + } + + high_needed -= high_direct; + low_needed -= low_direct; + + /* + * Do we need to use some lowmem pages for the copies of highmem + * pages? + */ + if (high_needed > highallocd) { + low_pages_for_highmem = high_needed - highallocd; + high_needed -= low_pages_for_highmem; + low_needed += low_pages_for_highmem; + } + + high_to_free = highallocd - high_needed; + low_to_free = lowallocd - low_needed; + + /* + * Now generate our pbes (which will be used for the atomic restore), + * and free unneeded pages. + */ + memory_bm_position_reset(pageset1_copy_map); + for (pfn = memory_bm_next_pfn(pageset1_copy_map); pfn != BM_END_OF_MAP; + pfn = memory_bm_next_pfn(pageset1_copy_map)) { + int is_high; + page = pfn_to_page(pfn); + is_high = PageHighMem(page); + + if (PagePageset1(page)) + continue; + + /* Free the page? */ + if ((is_high && high_to_free) || + (!is_high && low_to_free)) { + ClearPagePageset1Copy(page); + toi__free_page(30, page); + if (is_high) + high_to_free--; + else + low_to_free--; + continue; + } + + /* Nope. We're going to use this page. Add a pbe. */ + if (is_high || low_pages_for_highmem) { + struct page *orig_page; + high_pbes_done++; + if (!is_high) + low_pages_for_highmem--; + do { + orig_high_pfn = memory_bm_next_pfn(&dup_map1); + BUG_ON(orig_high_pfn == BM_END_OF_MAP); + orig_page = pfn_to_page(orig_high_pfn); + } while (!PageHighMem(orig_page) || + load_direct(orig_page)); + + this_high_pbe->orig_address = orig_page; + this_high_pbe->address = page; + this_high_pbe->next = NULL; + if (last_high_pbe_page != high_pbe_page) { + *last_high_pbe_ptr = + (struct pbe *) high_pbe_page; + if (!last_high_pbe_page) + last_high_pbe_page = high_pbe_page; + } else + *last_high_pbe_ptr = this_high_pbe; + last_high_pbe_ptr = &this_high_pbe->next; + if (last_high_pbe_page != high_pbe_page) { + kunmap(last_high_pbe_page); + last_high_pbe_page = high_pbe_page; + } + this_high_pbe = get_next_pbe(&high_pbe_page, + this_high_pbe, 1); + if (IS_ERR(this_high_pbe)) { + printk(KERN_INFO + "This high pbe is an error.\n"); + return -ENOMEM; + } + } else { + struct page *orig_page; + low_pbes_done++; + do { + orig_low_pfn = memory_bm_next_pfn(&dup_map2); + BUG_ON(orig_low_pfn == BM_END_OF_MAP); + orig_page = pfn_to_page(orig_low_pfn); + } while (PageHighMem(orig_page) || + load_direct(orig_page)); + + this_low_pbe->orig_address = page_address(orig_page); + this_low_pbe->address = page_address(page); + this_low_pbe->next = NULL; + *last_low_pbe_ptr = this_low_pbe; + last_low_pbe_ptr = &this_low_pbe->next; + this_low_pbe = get_next_pbe(&low_pbe_page, + this_low_pbe, 0); + if (IS_ERR(this_low_pbe)) { + printk(KERN_INFO "this_low_pbe is an error.\n"); + return -ENOMEM; + } + } + } + + if (high_pbe_page) + kunmap(high_pbe_page); + + if (last_high_pbe_page != high_pbe_page) { + if (last_high_pbe_page) + kunmap(last_high_pbe_page); + toi__free_page(29, high_pbe_page); + } + + free_conflicting_pages(); + +out: + memory_bm_free(&dup_map1, 0); + memory_bm_free(&dup_map2, 0); + + local_irq_enable(); + enable_nonboot_cpus(); + + return result; +} + +int add_boot_kernel_data_pbe(void) +{ + this_low_pbe->address = (char *) __toi_get_nonconflicting_page(); + if (!this_low_pbe->address) { + printk(KERN_INFO "Failed to get bkd atomic restore buffer."); + return -ENOMEM; + } + + toi_bkd.size = sizeof(toi_bkd); + memcpy(this_low_pbe->address, &toi_bkd, sizeof(toi_bkd)); + + *last_low_pbe_ptr = this_low_pbe; + this_low_pbe->orig_address = (char *) boot_kernel_data_buffer; + this_low_pbe->next = NULL; + return 0; +} diff --git a/kernel/power/tuxonice_pagedir.h b/kernel/power/tuxonice_pagedir.h new file mode 100644 index 0000000..9d0d929 --- /dev/null +++ b/kernel/power/tuxonice_pagedir.h @@ -0,0 +1,50 @@ +/* + * kernel/power/tuxonice_pagedir.h + * + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Declarations for routines for handling pagesets. + */ + +#ifndef KERNEL_POWER_PAGEDIR_H +#define KERNEL_POWER_PAGEDIR_H + +/* Pagedir + * + * Contains the metadata for a set of pages saved in the image. + */ + +struct pagedir { + int id; + long size; +#ifdef CONFIG_HIGHMEM + long size_high; +#endif +}; + +#ifdef CONFIG_HIGHMEM +#define get_highmem_size(pagedir) (pagedir.size_high) +#define set_highmem_size(pagedir, sz) do { pagedir.size_high = sz; } while (0) +#define inc_highmem_size(pagedir) do { pagedir.size_high++; } while (0) +#define get_lowmem_size(pagedir) (pagedir.size - pagedir.size_high) +#else +#define get_highmem_size(pagedir) (0) +#define set_highmem_size(pagedir, sz) do { } while (0) +#define inc_highmem_size(pagedir) do { } while (0) +#define get_lowmem_size(pagedir) (pagedir.size) +#endif + +extern struct pagedir pagedir1, pagedir2; + +extern void toi_copy_pageset1(void); + +extern int toi_get_pageset1_load_addresses(void); + +extern unsigned long __toi_get_nonconflicting_page(void); +struct page *___toi_get_nonconflicting_page(int can_be_highmem); + +extern void toi_reset_alt_image_pageset2_pfn(void); +extern int add_boot_kernel_data_pbe(void); +#endif diff --git a/kernel/power/tuxonice_pageflags.c b/kernel/power/tuxonice_pageflags.c new file mode 100644 index 0000000..5310faa --- /dev/null +++ b/kernel/power/tuxonice_pageflags.c @@ -0,0 +1,27 @@ +/* + * kernel/power/tuxonice_pageflags.c + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Routines for serialising and relocating pageflags in which we + * store our image metadata. + */ + +#include +#include "tuxonice_pageflags.h" +#include "power.h" + +int toi_pageflags_space_needed(void) +{ + int total = 0; + struct bm_block *bb; + + total = sizeof(unsigned int); + + list_for_each_entry(bb, &pageset1_map->blocks, hook) + total += 2 * sizeof(unsigned long) + PAGE_SIZE; + + return total; +} diff --git a/kernel/power/tuxonice_pageflags.h b/kernel/power/tuxonice_pageflags.h new file mode 100644 index 0000000..14dc7c5 --- /dev/null +++ b/kernel/power/tuxonice_pageflags.h @@ -0,0 +1,74 @@ +/* + * kernel/power/tuxonice_pageflags.h + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + */ + +#ifndef KERNEL_POWER_TUXONICE_PAGEFLAGS_H +#define KERNEL_POWER_TUXONICE_PAGEFLAGS_H + +#include "power.h" + +extern struct memory_bitmap *pageset1_map; +extern struct memory_bitmap *pageset1_copy_map; +extern struct memory_bitmap *pageset2_map; +extern struct memory_bitmap *page_resave_map; +extern struct memory_bitmap *io_map; +extern struct memory_bitmap *nosave_map; +extern struct memory_bitmap *free_map; + +#define PagePageset1(page) \ + (memory_bm_test_bit(pageset1_map, page_to_pfn(page))) +#define SetPagePageset1(page) \ + (memory_bm_set_bit(pageset1_map, page_to_pfn(page))) +#define ClearPagePageset1(page) \ + (memory_bm_clear_bit(pageset1_map, page_to_pfn(page))) + +#define PagePageset1Copy(page) \ + (memory_bm_test_bit(pageset1_copy_map, page_to_pfn(page))) +#define SetPagePageset1Copy(page) \ + (memory_bm_set_bit(pageset1_copy_map, page_to_pfn(page))) +#define ClearPagePageset1Copy(page) \ + (memory_bm_clear_bit(pageset1_copy_map, page_to_pfn(page))) + +#define PagePageset2(page) \ + (memory_bm_test_bit(pageset2_map, page_to_pfn(page))) +#define SetPagePageset2(page) \ + (memory_bm_set_bit(pageset2_map, page_to_pfn(page))) +#define ClearPagePageset2(page) \ + (memory_bm_clear_bit(pageset2_map, page_to_pfn(page))) + +#define PageWasRW(page) \ + (memory_bm_test_bit(pageset2_map, page_to_pfn(page))) +#define SetPageWasRW(page) \ + (memory_bm_set_bit(pageset2_map, page_to_pfn(page))) +#define ClearPageWasRW(page) \ + (memory_bm_clear_bit(pageset2_map, page_to_pfn(page))) + +#define PageResave(page) (page_resave_map ? \ + memory_bm_test_bit(page_resave_map, page_to_pfn(page)) : 0) +#define SetPageResave(page) \ + (memory_bm_set_bit(page_resave_map, page_to_pfn(page))) +#define ClearPageResave(page) \ + (memory_bm_clear_bit(page_resave_map, page_to_pfn(page))) + +#define PageNosave(page) (nosave_map ? \ + memory_bm_test_bit(nosave_map, page_to_pfn(page)) : 0) +#define SetPageNosave(page) \ + (memory_bm_set_bit(nosave_map, page_to_pfn(page))) +#define ClearPageNosave(page) \ + (memory_bm_clear_bit(nosave_map, page_to_pfn(page))) + +#define PageNosaveFree(page) (free_map ? \ + memory_bm_test_bit(free_map, page_to_pfn(page)) : 0) +#define SetPageNosaveFree(page) \ + (memory_bm_set_bit(free_map, page_to_pfn(page))) +#define ClearPageNosaveFree(page) \ + (memory_bm_clear_bit(free_map, page_to_pfn(page))) + +extern void save_pageflags(struct memory_bitmap *pagemap); +extern int load_pageflags(struct memory_bitmap *pagemap); +extern int toi_pageflags_space_needed(void); +#endif diff --git a/kernel/power/tuxonice_power_off.c b/kernel/power/tuxonice_power_off.c new file mode 100644 index 0000000..2edd7fc --- /dev/null +++ b/kernel/power/tuxonice_power_off.c @@ -0,0 +1,282 @@ +/* + * kernel/power/tuxonice_power_off.c + * + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Support for powering down. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "tuxonice.h" +#include "tuxonice_ui.h" +#include "tuxonice_power_off.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_io.h" + +unsigned long toi_poweroff_method; /* 0 - Kernel power off */ +EXPORT_SYMBOL_GPL(toi_poweroff_method); + +static int wake_delay; +static char lid_state_file[256], wake_alarm_dir[256]; +static struct file *lid_file, *alarm_file, *epoch_file; +static int post_wake_state = -1; + +static int did_suspend_to_both; + +/* + * __toi_power_down + * Functionality : Powers down or reboots the computer once the image + * has been written to disk. + * Key Assumptions : Able to reboot/power down via code called or that + * the warning emitted if the calls fail will be visible + * to the user (ie printk resumes devices). + */ + +static void __toi_power_down(int method) +{ + int error; + + toi_cond_pause(1, test_action_state(TOI_REBOOT) ? "Ready to reboot." : + "Powering down."); + + if (test_result_state(TOI_ABORTED)) + goto out; + + if (test_action_state(TOI_REBOOT)) + kernel_restart(NULL); + + switch (method) { + case 0: + break; + case 3: + /* + * Re-read the overwritten part of pageset2 to make post-resume + * faster. + */ + if (read_pageset2(1)) + panic("Attempt to reload pagedir 2 failed. " + "Try rebooting."); + + error = pm_notifier_call_chain(PM_SUSPEND_PREPARE); + if (!error) { + error = suspend_devices_and_enter(PM_SUSPEND_MEM); + if (!error) + did_suspend_to_both = 1; + } + pm_notifier_call_chain(PM_POST_SUSPEND); + + /* Success - we're now post-resume-from-ram */ + if (did_suspend_to_both) + return; + + /* Failed to suspend to ram - do normal power off */ + break; + case 4: + /* + * If succeeds, doesn't return. If fails, do a simple + * powerdown. + */ + hibernation_platform_enter(); + break; + case 5: + /* Historic entry only now */ + break; + } + + if (method && method != 5) + toi_cond_pause(1, + "Falling back to alternate power off method."); + + if (test_result_state(TOI_ABORTED)) + goto out; + + kernel_power_off(); + kernel_halt(); + toi_cond_pause(1, "Powerdown failed."); + while (1) + cpu_relax(); + +out: + if (read_pageset2(1)) + panic("Attempt to reload pagedir 2 failed. Try rebooting."); + return; +} + +#define CLOSE_FILE(file) \ + if (file) { \ + filp_close(file, NULL); file = NULL; \ + } + +static void powerdown_cleanup(int toi_or_resume) +{ + if (!toi_or_resume) + return; + + CLOSE_FILE(lid_file); + CLOSE_FILE(alarm_file); + CLOSE_FILE(epoch_file); +} + +static void open_file(char *format, char *arg, struct file **var, int mode, + char *desc) +{ + char buf[256]; + + if (strlen(arg)) { + sprintf(buf, format, arg); + *var = filp_open(buf, mode, 0); + if (IS_ERR(*var) || !*var) { + printk(KERN_INFO "Failed to open %s file '%s' (%p).\n", + desc, buf, *var); + *var = NULL; + } + } +} + +static int powerdown_init(int toi_or_resume) +{ + if (!toi_or_resume) + return 0; + + did_suspend_to_both = 0; + + open_file("/proc/acpi/button/%s/state", lid_state_file, &lid_file, + O_RDONLY, "lid"); + + if (strlen(wake_alarm_dir)) { + open_file("/sys/class/rtc/%s/wakealarm", wake_alarm_dir, + &alarm_file, O_WRONLY, "alarm"); + + open_file("/sys/class/rtc/%s/since_epoch", wake_alarm_dir, + &epoch_file, O_RDONLY, "epoch"); + } + + return 0; +} + +static int lid_closed(void) +{ + char array[25]; + ssize_t size; + loff_t pos = 0; + + if (!lid_file) + return 0; + + size = vfs_read(lid_file, (char __user *) array, 25, &pos); + if ((int) size < 1) { + printk(KERN_INFO "Failed to read lid state file (%d).\n", + (int) size); + return 0; + } + + if (!strcmp(array, "state: closed\n")) + return 1; + + return 0; +} + +static void write_alarm_file(int value) +{ + ssize_t size; + char buf[40]; + loff_t pos = 0; + + if (!alarm_file) + return; + + sprintf(buf, "%d\n", value); + + size = vfs_write(alarm_file, (char __user *)buf, strlen(buf), &pos); + + if (size < 0) + printk(KERN_INFO "Error %d writing alarm value %s.\n", + (int) size, buf); +} + +/** + * toi_check_resleep: See whether to powerdown again after waking. + * + * After waking, check whether we should powerdown again in a (usually + * different) way. We only do this if the lid switch is still closed. + */ +void toi_check_resleep(void) +{ + /* We only return if we suspended to ram and woke. */ + if (lid_closed() && post_wake_state >= 0) + __toi_power_down(post_wake_state); +} + +void toi_power_down(void) +{ + if (alarm_file && wake_delay) { + char array[25]; + loff_t pos = 0; + size_t size = vfs_read(epoch_file, (char __user *) array, 25, + &pos); + + if (((int) size) < 1) + printk(KERN_INFO "Failed to read epoch file (%d).\n", + (int) size); + else { + unsigned long since_epoch = + simple_strtol(array, NULL, 0); + + /* Clear any wakeup time. */ + write_alarm_file(0); + + /* Set new wakeup time. */ + write_alarm_file(since_epoch + wake_delay); + } + } + + __toi_power_down(toi_poweroff_method); + + toi_check_resleep(); +} +EXPORT_SYMBOL_GPL(toi_power_down); + +static struct toi_sysfs_data sysfs_params[] = { +#if defined(CONFIG_ACPI) + SYSFS_STRING("lid_file", SYSFS_RW, lid_state_file, 256, 0, NULL), + SYSFS_INT("wake_delay", SYSFS_RW, &wake_delay, 0, INT_MAX, 0, NULL), + SYSFS_STRING("wake_alarm_dir", SYSFS_RW, wake_alarm_dir, 256, 0, NULL), + SYSFS_INT("post_wake_state", SYSFS_RW, &post_wake_state, -1, 5, 0, + NULL), + SYSFS_UL("powerdown_method", SYSFS_RW, &toi_poweroff_method, 0, 5, 0), + SYSFS_INT("did_suspend_to_both", SYSFS_READONLY, &did_suspend_to_both, + 0, 0, 0, NULL) +#endif +}; + +static struct toi_module_ops powerdown_ops = { + .type = MISC_HIDDEN_MODULE, + .name = "poweroff", + .initialise = powerdown_init, + .cleanup = powerdown_cleanup, + .directory = "[ROOT]", + .module = THIS_MODULE, + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +int toi_poweroff_init(void) +{ + return toi_register_module(&powerdown_ops); +} + +void toi_poweroff_exit(void) +{ + toi_unregister_module(&powerdown_ops); +} diff --git a/kernel/power/tuxonice_power_off.h b/kernel/power/tuxonice_power_off.h new file mode 100644 index 0000000..a85633a --- /dev/null +++ b/kernel/power/tuxonice_power_off.h @@ -0,0 +1,24 @@ +/* + * kernel/power/tuxonice_power_off.h + * + * Copyright (C) 2006-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Support for the powering down. + */ + +int toi_pm_state_finish(void); +void toi_power_down(void); +extern unsigned long toi_poweroff_method; +int toi_poweroff_init(void); +void toi_poweroff_exit(void); +void toi_check_resleep(void); + +extern int platform_begin(int platform_mode); +extern int platform_pre_snapshot(int platform_mode); +extern void platform_leave(int platform_mode); +extern void platform_end(int platform_mode); +extern void platform_finish(int platform_mode); +extern int platform_pre_restore(int platform_mode); +extern void platform_restore_cleanup(int platform_mode); diff --git a/kernel/power/tuxonice_prepare_image.c b/kernel/power/tuxonice_prepare_image.c new file mode 100644 index 0000000..1ee506a --- /dev/null +++ b/kernel/power/tuxonice_prepare_image.c @@ -0,0 +1,1056 @@ +/* + * kernel/power/tuxonice_prepare_image.c + * + * Copyright (C) 2003-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * We need to eat memory until we can: + * 1. Perform the save without changing anything (RAM_NEEDED < #pages) + * 2. Fit it all in available space (toiActiveAllocator->available_space() >= + * main_storage_needed()) + * 3. Reload the pagedir and pageset1 to places that don't collide with their + * final destinations, not knowing to what extent the resumed kernel will + * overlap with the one loaded at boot time. I think the resumed kernel + * should overlap completely, but I don't want to rely on this as it is + * an unproven assumption. We therefore assume there will be no overlap at + * all (worse case). + * 4. Meet the user's requested limit (if any) on the size of the image. + * The limit is in MB, so pages/256 (assuming 4K pages). + * + */ + +#include +#include +#include +#include +#include +#include + +#include "tuxonice_pageflags.h" +#include "tuxonice_modules.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_extent.h" +#include "tuxonice_prepare_image.h" +#include "tuxonice_block_io.h" +#include "tuxonice.h" +#include "tuxonice_checksum.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_alloc.h" +#include "tuxonice_atomic_copy.h" + +static long num_nosave, main_storage_allocated, storage_available, + header_storage_needed; +long extra_pd1_pages_allowance = CONFIG_TOI_DEFAULT_EXTRA_PAGES_ALLOWANCE; +int image_size_limit; +static int no_ps2_needed; + +struct attention_list { + struct task_struct *task; + struct attention_list *next; +}; + +static struct attention_list *attention_list; + +#define PAGESET1 0 +#define PAGESET2 1 + +void free_attention_list(void) +{ + struct attention_list *last = NULL; + + while (attention_list) { + last = attention_list; + attention_list = attention_list->next; + toi_kfree(6, last); + } +} + +static int build_attention_list(void) +{ + int i, task_count = 0; + struct task_struct *p; + struct attention_list *next; + + /* + * Count all userspace process (with task->mm) marked PF_NOFREEZE. + */ + read_lock(&tasklist_lock); + for_each_process(p) + if ((p->flags & PF_NOFREEZE) || p == current) + task_count++; + read_unlock(&tasklist_lock); + + /* + * Allocate attention list structs. + */ + for (i = 0; i < task_count; i++) { + struct attention_list *this = + toi_kzalloc(6, sizeof(struct attention_list), + TOI_WAIT_GFP); + if (!this) { + printk(KERN_INFO "Failed to allocate slab for " + "attention list.\n"); + free_attention_list(); + return 1; + } + this->next = NULL; + if (attention_list) + this->next = attention_list; + attention_list = this; + } + + next = attention_list; + read_lock(&tasklist_lock); + for_each_process(p) + if ((p->flags & PF_NOFREEZE) || p == current) { + next->task = p; + next = next->next; + } + read_unlock(&tasklist_lock); + return 0; +} + +static void pageset2_full(void) +{ + struct zone *zone; + struct page *page; + unsigned long flags; + int i; + + for_each_zone(zone) { + spin_lock_irqsave(&zone->lru_lock, flags); + for_each_lru(i) { + if (!zone_page_state(zone, NR_LRU_BASE + i)) + continue; + + list_for_each_entry(page, &zone->lru[i].list, lru) { + struct address_space *mapping; + + mapping = page_mapping(page); + if (!mapping || !mapping->host || + !(mapping->host->i_flags & S_ATOMIC_COPY)) + SetPagePageset2(page); + } + } + spin_unlock_irqrestore(&zone->lru_lock, flags); + } +} + +/* + * toi_mark_task_as_pageset + * Functionality : Marks all the saveable pages belonging to a given process + * as belonging to a particular pageset. + */ + +static void toi_mark_task_as_pageset(struct task_struct *t, int pageset2) +{ + struct vm_area_struct *vma; + struct mm_struct *mm; + + mm = t->active_mm; + + if (!mm || !mm->mmap) + return; + + if (!irqs_disabled()) + down_read(&mm->mmap_sem); + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long posn; + + if (!vma->vm_start || vma->vm_flags & VM_SPECIAL) + continue; + + for (posn = vma->vm_start; posn < vma->vm_end; + posn += PAGE_SIZE) { + struct page *page = follow_page(vma, posn, 0); + struct address_space *mapping; + + if (!page || !pfn_valid(page_to_pfn(page))) + continue; + + mapping = page_mapping(page); + if (mapping && mapping->host && + mapping->host->i_flags & S_ATOMIC_COPY) + continue; + + if (pageset2) + SetPagePageset2(page); + else { + ClearPagePageset2(page); + SetPagePageset1(page); + } + } + } + + if (!irqs_disabled()) + up_read(&mm->mmap_sem); +} + +static void mark_tasks(int pageset) +{ + struct task_struct *p; + + read_lock(&tasklist_lock); + for_each_process(p) { + if (!p->mm) + continue; + + if (p->flags & PF_KTHREAD) + continue; + + toi_mark_task_as_pageset(p, pageset); + } + read_unlock(&tasklist_lock); + +} + +/* mark_pages_for_pageset2 + * + * Description: Mark unshared pages in processes not needed for hibernate as + * being able to be written out in a separate pagedir. + * HighMem pages are simply marked as pageset2. They won't be + * needed during hibernate. + */ + +static void toi_mark_pages_for_pageset2(void) +{ + struct attention_list *this = attention_list; + + memory_bm_clear(pageset2_map); + + if (test_action_state(TOI_NO_PAGESET2) || no_ps2_needed) + return; + + if (test_action_state(TOI_PAGESET2_FULL)) + pageset2_full(); + else + mark_tasks(PAGESET2); + + /* + * Because the tasks in attention_list are ones related to hibernating, + * we know that they won't go away under us. + */ + + while (this) { + if (!test_result_state(TOI_ABORTED)) + toi_mark_task_as_pageset(this->task, PAGESET1); + this = this->next; + } +} + +/* + * The atomic copy of pageset1 is stored in pageset2 pages. + * But if pageset1 is larger (normally only just after boot), + * we need to allocate extra pages to store the atomic copy. + * The following data struct and functions are used to handle + * the allocation and freeing of that memory. + */ + +static long extra_pages_allocated; + +struct extras { + struct page *page; + int order; + struct extras *next; +}; + +static struct extras *extras_list; + +/* toi_free_extra_pagedir_memory + * + * Description: Free previously allocated extra pagedir memory. + */ +void toi_free_extra_pagedir_memory(void) +{ + /* Free allocated pages */ + while (extras_list) { + struct extras *this = extras_list; + int i; + + extras_list = this->next; + + for (i = 0; i < (1 << this->order); i++) + ClearPageNosave(this->page + i); + + toi_free_pages(9, this->page, this->order); + toi_kfree(7, this); + } + + extra_pages_allocated = 0; +} + +/* toi_allocate_extra_pagedir_memory + * + * Description: Allocate memory for making the atomic copy of pagedir1 in the + * case where it is bigger than pagedir2. + * Arguments: int num_to_alloc: Number of extra pages needed. + * Result: int. Number of extra pages we now have allocated. + */ +static int toi_allocate_extra_pagedir_memory(int extra_pages_needed) +{ + int j, order, num_to_alloc = extra_pages_needed - extra_pages_allocated; + gfp_t flags = TOI_ATOMIC_GFP; + + if (num_to_alloc < 1) + return 0; + + order = fls(num_to_alloc); + if (order >= MAX_ORDER) + order = MAX_ORDER - 1; + + while (num_to_alloc) { + struct page *newpage; + unsigned long virt; + struct extras *extras_entry; + + while ((1 << order) > num_to_alloc) + order--; + + extras_entry = (struct extras *) toi_kzalloc(7, + sizeof(struct extras), TOI_ATOMIC_GFP); + + if (!extras_entry) + return extra_pages_allocated; + + virt = toi_get_free_pages(9, flags, order); + while (!virt && order) { + order--; + virt = toi_get_free_pages(9, flags, order); + } + + if (!virt) { + toi_kfree(7, extras_entry); + return extra_pages_allocated; + } + + newpage = virt_to_page(virt); + + extras_entry->page = newpage; + extras_entry->order = order; + extras_entry->next = NULL; + + if (extras_list) + extras_entry->next = extras_list; + + extras_list = extras_entry; + + for (j = 0; j < (1 << order); j++) { + SetPageNosave(newpage + j); + SetPagePageset1Copy(newpage + j); + } + + extra_pages_allocated += (1 << order); + num_to_alloc -= (1 << order); + } + + return extra_pages_allocated; +} + +/* + * real_nr_free_pages: Count pcp pages for a zone type or all zones + * (-1 for all, otherwise zone_idx() result desired). + */ +long real_nr_free_pages(unsigned long zone_idx_mask) +{ + struct zone *zone; + int result = 0, cpu; + + /* PCP lists */ + for_each_zone(zone) { + if (!populated_zone(zone)) + continue; + + if (!(zone_idx_mask & (1 << zone_idx(zone)))) + continue; + + for_each_online_cpu(cpu) { + struct per_cpu_pageset *pset = zone_pcp(zone, cpu); + struct per_cpu_pages *pcp = &pset->pcp; + result += pcp->count; + } + + result += zone_page_state(zone, NR_FREE_PAGES); + } + return result; +} +EXPORT_SYMBOL_GPL(real_nr_free_pages); + +/* + * Discover how much extra memory will be required by the drivers + * when they're asked to hibernate. We can then ensure that amount + * of memory is available when we really want it. + */ +static void get_extra_pd1_allowance(void) +{ + long orig_num_free = real_nr_free_pages(all_zones_mask), final; + + toi_prepare_status(CLEAR_BAR, "Finding allowance for drivers."); + + if (!toi_go_atomic(PMSG_FREEZE, 1)) { + final = real_nr_free_pages(all_zones_mask); + toi_end_atomic(ATOMIC_ALL_STEPS, 1, 0); + + extra_pd1_pages_allowance = max( + orig_num_free - final + MIN_EXTRA_PAGES_ALLOWANCE, + (long) MIN_EXTRA_PAGES_ALLOWANCE); + } +} + +/* + * Amount of storage needed, possibly taking into account the + * expected compression ratio and possibly also ignoring our + * allowance for extra pages. + */ +static long main_storage_needed(int use_ecr, + int ignore_extra_pd1_allow) +{ + return (pagedir1.size + pagedir2.size + + (ignore_extra_pd1_allow ? 0 : extra_pd1_pages_allowance)) * + (use_ecr ? toi_expected_compression_ratio() : 100) / 100; +} + +/* + * Storage needed for the image header, in bytes until the return. + */ +long get_header_storage_needed(void) +{ + long bytes = (int) sizeof(struct toi_header) + + toi_header_storage_for_modules() + + toi_pageflags_space_needed(); + + return DIV_ROUND_UP(bytes, PAGE_SIZE); +} + +/* + * When freeing memory, pages from either pageset might be freed. + * + * When seeking to free memory to be able to hibernate, for every ps1 page + * freed, we need 2 less pages for the atomic copy because there is one less + * page to copy and one more page into which data can be copied. + * + * Freeing ps2 pages saves us nothing directly. No more memory is available + * for the atomic copy. Indirectly, a ps1 page might be freed (slab?), but + * that's too much work to figure out. + * + * => ps1_to_free functions + * + * Of course if we just want to reduce the image size, because of storage + * limitations or an image size limit either ps will do. + * + * => any_to_free function + */ + +static long highpages_ps1_to_free(void) +{ + return max_t(long, 0, DIV_ROUND_UP(get_highmem_size(pagedir1) - + get_highmem_size(pagedir2), 2) - real_nr_free_high_pages()); +} + +static long lowpages_ps1_to_free(void) +{ + return max_t(long, 0, DIV_ROUND_UP(get_lowmem_size(pagedir1) + + extra_pd1_pages_allowance + MIN_FREE_RAM + + toi_memory_for_modules(0) - get_lowmem_size(pagedir2) - + real_nr_free_low_pages() - extra_pages_allocated, 2)); +} + +static long current_image_size(void) +{ + return pagedir1.size + pagedir2.size + header_storage_needed; +} + +static long storage_still_required(void) +{ + return max_t(long, 0, main_storage_needed(1, 1) - storage_available); +} + +static long ram_still_required(void) +{ + return max_t(long, 0, MIN_FREE_RAM + toi_memory_for_modules(0) - + real_nr_free_low_pages() + 2 * extra_pd1_pages_allowance); +} + +static long any_to_free(int use_image_size_limit) +{ + long user_limit = (use_image_size_limit && image_size_limit > 0) ? + max_t(long, 0, current_image_size() - + (image_size_limit << 8)) : 0, + storage_limit = storage_still_required(), + ram_limit = ram_still_required(), + first_max = max(user_limit, storage_limit); + + return max(first_max, ram_limit); +} + +static int need_pageset2(void) +{ + return (real_nr_free_low_pages() + extra_pages_allocated - + 2 * extra_pd1_pages_allowance - MIN_FREE_RAM - + toi_memory_for_modules(0) - pagedir1.size) < pagedir2.size; +} + +/* amount_needed + * + * Calculates the amount by which the image size needs to be reduced to meet + * our constraints. + */ +static long amount_needed(int use_image_size_limit) +{ + return max(highpages_ps1_to_free() + lowpages_ps1_to_free(), + any_to_free(use_image_size_limit)); +} + +static long image_not_ready(int use_image_size_limit) +{ + toi_message(TOI_EAT_MEMORY, TOI_LOW, 1, + "Amount still needed (%ld) > 0:%d," + " Storage allocd: %ld < %ld: %d.\n", + amount_needed(use_image_size_limit), + (amount_needed(use_image_size_limit) > 0), + main_storage_allocated, + main_storage_needed(1, 1), + main_storage_allocated < main_storage_needed(1, 1)); + + toi_cond_pause(0, NULL); + + return (amount_needed(use_image_size_limit) > 0) || + main_storage_allocated < main_storage_needed(1, 1); +} + +static void display_failure_reason(int tries_exceeded) +{ + long storage_required = storage_still_required(), + ram_required = ram_still_required(), + high_ps1 = highpages_ps1_to_free(), + low_ps1 = lowpages_ps1_to_free(); + + printk(KERN_INFO "Failed to prepare the image because...\n"); + + if (!storage_available) { + printk(KERN_INFO "- You need some storage available to be " + "able to hibernate.\n"); + return; + } + + if (tries_exceeded) + printk(KERN_INFO "- The maximum number of iterations was " + "reached without successfully preparing the " + "image.\n"); + + if (storage_required) { + printk(KERN_INFO " - We need at least %ld pages of storage " + "(ignoring the header), but only have %ld.\n", + main_storage_needed(1, 1), + main_storage_allocated); + set_abort_result(TOI_INSUFFICIENT_STORAGE); + } + + if (ram_required) { + printk(KERN_INFO " - We need %ld more free pages of low " + "memory.\n", ram_required); + printk(KERN_INFO " Minimum free : %8d\n", MIN_FREE_RAM); + printk(KERN_INFO " + Reqd. by modules : %8ld\n", + toi_memory_for_modules(0)); + printk(KERN_INFO " + 2 * extra allow : %8ld\n", + 2 * extra_pd1_pages_allowance); + printk(KERN_INFO " - Currently free : %8ld\n", + real_nr_free_low_pages()); + printk(KERN_INFO " : ========\n"); + printk(KERN_INFO " Still needed : %8ld\n", + ram_required); + + /* Print breakdown of memory needed for modules */ + toi_memory_for_modules(1); + set_abort_result(TOI_UNABLE_TO_FREE_ENOUGH_MEMORY); + } + + if (high_ps1) { + printk(KERN_INFO "- We need to free %ld highmem pageset 1 " + "pages.\n", high_ps1); + set_abort_result(TOI_UNABLE_TO_FREE_ENOUGH_MEMORY); + } + + if (low_ps1) { + printk(KERN_INFO " - We need to free %ld lowmem pageset 1 " + "pages.\n", low_ps1); + set_abort_result(TOI_UNABLE_TO_FREE_ENOUGH_MEMORY); + } +} + +static void display_stats(int always, int sub_extra_pd1_allow) +{ + char buffer[255]; + snprintf(buffer, 254, + "Free:%ld(%ld). Sets:%ld(%ld),%ld(%ld). " + "Nosave:%ld-%ld=%ld. Storage:%lu/%lu(%lu=>%lu). " + "Needed:%ld,%ld,%ld(%d,%ld,%ld,%ld) (PS2:%s)\n", + + /* Free */ + real_nr_free_pages(all_zones_mask), + real_nr_free_low_pages(), + + /* Sets */ + pagedir1.size, pagedir1.size - get_highmem_size(pagedir1), + pagedir2.size, pagedir2.size - get_highmem_size(pagedir2), + + /* Nosave */ + num_nosave, extra_pages_allocated, + num_nosave - extra_pages_allocated, + + /* Storage */ + main_storage_allocated, + storage_available, + main_storage_needed(1, sub_extra_pd1_allow), + main_storage_needed(1, 1), + + /* Needed */ + lowpages_ps1_to_free(), highpages_ps1_to_free(), + any_to_free(1), + MIN_FREE_RAM, toi_memory_for_modules(0), + extra_pd1_pages_allowance, ((long) image_size_limit) << 8, + + need_pageset2() ? "yes" : "no"); + + if (always) + printk("%s", buffer); + else + toi_message(TOI_EAT_MEMORY, TOI_MEDIUM, 1, buffer); +} + +/* generate_free_page_map + * + * Description: This routine generates a bitmap of free pages from the + * lists used by the memory manager. We then use the bitmap + * to quickly calculate which pages to save and in which + * pagesets. + */ +static void generate_free_page_map(void) +{ + int order, pfn, cpu, t; + unsigned long flags, i; + struct zone *zone; + struct list_head *curr; + + for_each_zone(zone) { + if (!populated_zone(zone)) + continue; + + spin_lock_irqsave(&zone->lock, flags); + + for (i = 0; i < zone->spanned_pages; i++) + ClearPageNosaveFree(pfn_to_page( + ZONE_START(zone) + i)); + + for_each_migratetype_order(order, t) { + list_for_each(curr, + &zone->free_area[order].free_list[t]) { + unsigned long j; + + pfn = page_to_pfn(list_entry(curr, struct page, + lru)); + for (j = 0; j < (1UL << order); j++) + SetPageNosaveFree(pfn_to_page(pfn + j)); + } + } + + for_each_online_cpu(cpu) { + struct per_cpu_pageset *pset = zone_pcp(zone, cpu); + struct per_cpu_pages *pcp = &pset->pcp; + struct page *page; + + list_for_each_entry(page, &pcp->list, lru) + SetPageNosaveFree(page); + } + + spin_unlock_irqrestore(&zone->lock, flags); + } +} + +/* size_of_free_region + * + * Description: Return the number of pages that are free, beginning with and + * including this one. + */ +static int size_of_free_region(struct zone *zone, unsigned long start_pfn) +{ + unsigned long this_pfn = start_pfn, + end_pfn = ZONE_START(zone) + zone->spanned_pages - 1; + + while (this_pfn <= end_pfn && PageNosaveFree(pfn_to_page(this_pfn))) + this_pfn++; + + return this_pfn - start_pfn; +} + +/* flag_image_pages + * + * This routine generates our lists of pages to be stored in each + * pageset. Since we store the data using extents, and adding new + * extents might allocate a new extent page, this routine may well + * be called more than once. + */ +static void flag_image_pages(int atomic_copy) +{ + int num_free = 0; + unsigned long loop; + struct zone *zone; + + pagedir1.size = 0; + pagedir2.size = 0; + + set_highmem_size(pagedir1, 0); + set_highmem_size(pagedir2, 0); + + num_nosave = 0; + + memory_bm_clear(pageset1_map); + + generate_free_page_map(); + + /* + * Pages not to be saved are marked Nosave irrespective of being + * reserved. + */ + for_each_zone(zone) { + int highmem = is_highmem(zone); + + if (!populated_zone(zone)) + continue; + + for (loop = 0; loop < zone->spanned_pages; loop++) { + unsigned long pfn = ZONE_START(zone) + loop; + struct page *page; + int chunk_size; + + if (!pfn_valid(pfn)) + continue; + + chunk_size = size_of_free_region(zone, pfn); + if (chunk_size) { + num_free += chunk_size; + loop += chunk_size - 1; + continue; + } + + page = pfn_to_page(pfn); + + if (PageNosave(page)) { + num_nosave++; + continue; + } + + page = highmem ? saveable_highmem_page(zone, pfn) : + saveable_page(zone, pfn); + + if (!page) { + num_nosave++; + continue; + } + + if (PagePageset2(page)) { + pagedir2.size++; + if (PageHighMem(page)) + inc_highmem_size(pagedir2); + else + SetPagePageset1Copy(page); + if (PageResave(page)) { + SetPagePageset1(page); + ClearPagePageset1Copy(page); + pagedir1.size++; + if (PageHighMem(page)) + inc_highmem_size(pagedir1); + } + } else { + pagedir1.size++; + SetPagePageset1(page); + if (PageHighMem(page)) + inc_highmem_size(pagedir1); + } + } + } + + if (!atomic_copy) + toi_message(TOI_EAT_MEMORY, TOI_MEDIUM, 0, + "Count data pages: Set1 (%d) + Set2 (%d) + Nosave (%ld)" + " + NumFree (%d) = %d.\n", + pagedir1.size, pagedir2.size, num_nosave, num_free, + pagedir1.size + pagedir2.size + num_nosave + num_free); +} + +void toi_recalculate_image_contents(int atomic_copy) +{ + memory_bm_clear(pageset1_map); + if (!atomic_copy) { + unsigned long pfn; + memory_bm_position_reset(pageset2_map); + for (pfn = memory_bm_next_pfn(pageset2_map); + pfn != BM_END_OF_MAP; + pfn = memory_bm_next_pfn(pageset2_map)) + ClearPagePageset1Copy(pfn_to_page(pfn)); + /* Need to call this before getting pageset1_size! */ + toi_mark_pages_for_pageset2(); + } + flag_image_pages(atomic_copy); + + if (!atomic_copy) { + storage_available = toiActiveAllocator->storage_available(); + display_stats(0, 0); + } +} + +/* update_image + * + * Allocate [more] memory and storage for the image. + */ +static void update_image(int ps2_recalc) +{ + int wanted, got, old_header_req; + long seek; + + toi_recalculate_image_contents(0); + + /* Include allowance for growth in pagedir1 while writing pagedir 2 */ + wanted = pagedir1.size + extra_pd1_pages_allowance - + get_lowmem_size(pagedir2); + if (wanted > extra_pages_allocated) { + got = toi_allocate_extra_pagedir_memory(wanted); + if (wanted < got) { + toi_message(TOI_EAT_MEMORY, TOI_LOW, 1, + "Want %d extra pages for pageset1, got %d.\n", + wanted, got); + return; + } + } + + if (ps2_recalc) + goto recalc; + + thaw_kernel_threads(); + + /* + * Allocate remaining storage space, if possible, up to the + * maximum we know we'll need. It's okay to allocate the + * maximum if the writer is the swapwriter, but + * we don't want to grab all available space on an NFS share. + * We therefore ignore the expected compression ratio here, + * thereby trying to allocate the maximum image size we could + * need (assuming compression doesn't expand the image), but + * don't complain if we can't get the full amount we're after. + */ + + do { + old_header_req = header_storage_needed; + toiActiveAllocator->reserve_header_space(header_storage_needed); + + /* How much storage is free with the reservation applied? */ + storage_available = toiActiveAllocator->storage_available(); + seek = min(storage_available, main_storage_needed(0, 0)); + + toiActiveAllocator->allocate_storage(seek); + + main_storage_allocated = + toiActiveAllocator->storage_allocated(); + + /* Need more header because more storage allocated? */ + header_storage_needed = get_header_storage_needed(); + + } while (header_storage_needed > old_header_req); + + if (freeze_processes()) + set_abort_result(TOI_FREEZING_FAILED); + +recalc: + toi_recalculate_image_contents(0); +} + +/* attempt_to_freeze + * + * Try to freeze processes. + */ + +static int attempt_to_freeze(void) +{ + int result; + + /* Stop processes before checking again */ + thaw_processes(); + toi_prepare_status(CLEAR_BAR, "Freezing processes & syncing " + "filesystems."); + result = freeze_processes(); + + if (result) + set_abort_result(TOI_FREEZING_FAILED); + + return result; +} + +/* eat_memory + * + * Try to free some memory, either to meet hard or soft constraints on the image + * characteristics. + * + * Hard constraints: + * - Pageset1 must be < half of memory; + * - We must have enough memory free at resume time to have pageset1 + * be able to be loaded in pages that don't conflict with where it has to + * be restored. + * Soft constraints + * - User specificied image size limit. + */ +static void eat_memory(void) +{ + long amount_wanted = 0; + int did_eat_memory = 0; + + /* + * Note that if we have enough storage space and enough free memory, we + * may exit without eating anything. We give up when the last 10 + * iterations ate no extra pages because we're not going to get much + * more anyway, but the few pages we get will take a lot of time. + * + * We freeze processes before beginning, and then unfreeze them if we + * need to eat memory until we think we have enough. If our attempts + * to freeze fail, we give up and abort. + */ + + toi_recalculate_image_contents(0); + amount_wanted = amount_needed(1); + + switch (image_size_limit) { + case -1: /* Don't eat any memory */ + if (amount_wanted > 0) { + set_abort_result(TOI_WOULD_EAT_MEMORY); + return; + } + break; + case -2: /* Free caches only */ + drop_pagecache(); + toi_recalculate_image_contents(0); + amount_wanted = amount_needed(1); + did_eat_memory = 1; + break; + default: + break; + } + + if (amount_wanted > 0 && !test_result_state(TOI_ABORTED) && + image_size_limit != -1) { + long request = amount_wanted; + + toi_prepare_status(CLEAR_BAR, + "Seeking to free %ldMB of memory.", + MB(amount_wanted)); + + thaw_kernel_threads(); + + /* + * Ask for too many because shrink_all_memory doesn't + * currently return enough most of the time. + */ + shrink_all_memory(amount_wanted + 50); + + did_eat_memory = 1; + + toi_recalculate_image_contents(0); + + amount_wanted = amount_needed(1); + + printk("Asked shrink_all_memory for %ld pages, got %ld.\n", + request, request - amount_wanted); + + toi_cond_pause(0, NULL); + + if (freeze_processes()) + set_abort_result(TOI_FREEZING_FAILED); + } + + if (did_eat_memory) + toi_recalculate_image_contents(0); +} + +/* toi_prepare_image + * + * Entry point to the whole image preparation section. + * + * We do four things: + * - Freeze processes; + * - Ensure image size constraints are met; + * - Complete all the preparation for saving the image, + * including allocation of storage. The only memory + * that should be needed when we're finished is that + * for actually storing the image (and we know how + * much is needed for that because the modules tell + * us). + * - Make sure that all dirty buffers are written out. + */ +#define MAX_TRIES 2 +int toi_prepare_image(void) +{ + int result = 1, tries = 1; + + main_storage_allocated = 0; + no_ps2_needed = 0; + + if (attempt_to_freeze()) + return 1; + + if (!extra_pd1_pages_allowance) + get_extra_pd1_allowance(); + + storage_available = toiActiveAllocator->storage_available(); + + if (!storage_available) { + printk(KERN_INFO "No storage available. Didn't try to prepare " + "an image.\n"); + display_failure_reason(0); + set_abort_result(TOI_NOSTORAGE_AVAILABLE); + return 1; + } + + if (build_attention_list()) { + abort_hibernate(TOI_UNABLE_TO_PREPARE_IMAGE, + "Unable to successfully prepare the image.\n"); + return 1; + } + + do { + toi_prepare_status(CLEAR_BAR, + "Preparing Image. Try %d.", tries); + + eat_memory(); + + if (test_result_state(TOI_ABORTED)) + break; + + update_image(0); + + tries++; + + } while (image_not_ready(1) && tries <= MAX_TRIES && + !test_result_state(TOI_ABORTED)); + + result = image_not_ready(0); + + if (!test_result_state(TOI_ABORTED)) { + if (result) { + display_stats(1, 0); + display_failure_reason(tries > MAX_TRIES); + abort_hibernate(TOI_UNABLE_TO_PREPARE_IMAGE, + "Unable to successfully prepare the image.\n"); + } else { + /* Pageset 2 needed? */ + if (!need_pageset2() && + test_action_state(TOI_NO_PS2_IF_UNNEEDED)) { + no_ps2_needed = 1; + update_image(1); + } + + toi_cond_pause(1, "Image preparation complete."); + } + } + + return result ? result : allocate_checksum_pages(); +} diff --git a/kernel/power/tuxonice_prepare_image.h b/kernel/power/tuxonice_prepare_image.h new file mode 100644 index 0000000..9a1de79 --- /dev/null +++ b/kernel/power/tuxonice_prepare_image.h @@ -0,0 +1,36 @@ +/* + * kernel/power/tuxonice_prepare_image.h + * + * Copyright (C) 2003-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + */ + +#include + +extern int toi_prepare_image(void); +extern void toi_recalculate_image_contents(int storage_available); +extern long real_nr_free_pages(unsigned long zone_idx_mask); +extern int image_size_limit; +extern void toi_free_extra_pagedir_memory(void); +extern long extra_pd1_pages_allowance; +extern void free_attention_list(void); + +#define MIN_FREE_RAM 100 +#define MIN_EXTRA_PAGES_ALLOWANCE 500 + +#define all_zones_mask ((unsigned long) ((1 << MAX_NR_ZONES) - 1)) +#ifdef CONFIG_HIGHMEM +#define real_nr_free_high_pages() (real_nr_free_pages(1 << ZONE_HIGHMEM)) +#define real_nr_free_low_pages() (real_nr_free_pages(all_zones_mask - \ + (1 << ZONE_HIGHMEM))) +#else +#define real_nr_free_high_pages() (0) +#define real_nr_free_low_pages() (real_nr_free_pages(all_zones_mask)) + +/* For eat_memory function */ +#define ZONE_HIGHMEM (MAX_NR_ZONES + 1) +#endif + +long get_header_storage_needed(void); diff --git a/kernel/power/tuxonice_storage.c b/kernel/power/tuxonice_storage.c new file mode 100644 index 0000000..5dafc95 --- /dev/null +++ b/kernel/power/tuxonice_storage.c @@ -0,0 +1,282 @@ +/* + * kernel/power/tuxonice_storage.c + * + * Copyright (C) 2005-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Routines for talking to a userspace program that manages storage. + * + * The kernel side: + * - starts the userspace program; + * - sends messages telling it when to open and close the connection; + * - tells it when to quit; + * + * The user space side: + * - passes messages regarding status; + * + */ + +#include +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_netlink.h" +#include "tuxonice_storage.h" +#include "tuxonice_ui.h" + +static struct user_helper_data usm_helper_data; +static struct toi_module_ops usm_ops; +static int message_received, usm_prepare_count; +static int storage_manager_last_action, storage_manager_action; + +static int usm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) +{ + int type; + int *data; + + type = nlh->nlmsg_type; + + /* A control message: ignore them */ + if (type < NETLINK_MSG_BASE) + return 0; + + /* Unknown message: reply with EINVAL */ + if (type >= USM_MSG_MAX) + return -EINVAL; + + /* All operations require privileges, even GET */ + if (security_netlink_recv(skb, CAP_NET_ADMIN)) + return -EPERM; + + /* Only allow one task to receive NOFREEZE privileges */ + if (type == NETLINK_MSG_NOFREEZE_ME && usm_helper_data.pid != -1) + return -EBUSY; + + data = (int *) NLMSG_DATA(nlh); + + switch (type) { + case USM_MSG_SUCCESS: + case USM_MSG_FAILED: + message_received = type; + complete(&usm_helper_data.wait_for_process); + break; + default: + printk(KERN_INFO "Storage manager doesn't recognise " + "message %d.\n", type); + } + + return 1; +} + +#ifdef CONFIG_NET +static int activations; + +int toi_activate_storage(int force) +{ + int tries = 1; + + if (usm_helper_data.pid == -1 || !usm_ops.enabled) + return 0; + + message_received = 0; + activations++; + + if (activations > 1 && !force) + return 0; + + while ((!message_received || message_received == USM_MSG_FAILED) && + tries < 2) { + toi_prepare_status(DONT_CLEAR_BAR, "Activate storage attempt " + "%d.\n", tries); + + init_completion(&usm_helper_data.wait_for_process); + + toi_send_netlink_message(&usm_helper_data, + USM_MSG_CONNECT, + NULL, 0); + + /* Wait 2 seconds for the userspace process to make contact */ + wait_for_completion_timeout(&usm_helper_data.wait_for_process, + 2*HZ); + + tries++; + } + + return 0; +} + +int toi_deactivate_storage(int force) +{ + if (usm_helper_data.pid == -1 || !usm_ops.enabled) + return 0; + + message_received = 0; + activations--; + + if (activations && !force) + return 0; + + init_completion(&usm_helper_data.wait_for_process); + + toi_send_netlink_message(&usm_helper_data, + USM_MSG_DISCONNECT, + NULL, 0); + + wait_for_completion_timeout(&usm_helper_data.wait_for_process, 2*HZ); + + if (!message_received || message_received == USM_MSG_FAILED) { + printk(KERN_INFO "Returning failure disconnecting storage.\n"); + return 1; + } + + return 0; +} +#endif + +static void storage_manager_simulate(void) +{ + printk(KERN_INFO "--- Storage manager simulate ---\n"); + toi_prepare_usm(); + schedule(); + printk(KERN_INFO "--- Activate storage 1 ---\n"); + toi_activate_storage(1); + schedule(); + printk(KERN_INFO "--- Deactivate storage 1 ---\n"); + toi_deactivate_storage(1); + schedule(); + printk(KERN_INFO "--- Cleanup usm ---\n"); + toi_cleanup_usm(); + schedule(); + printk(KERN_INFO "--- Storage manager simulate ends ---\n"); +} + +static int usm_storage_needed(void) +{ + return strlen(usm_helper_data.program); +} + +static int usm_save_config_info(char *buf) +{ + int len = strlen(usm_helper_data.program); + memcpy(buf, usm_helper_data.program, len); + return len; +} + +static void usm_load_config_info(char *buf, int size) +{ + /* Don't load the saved path if one has already been set */ + if (usm_helper_data.program[0]) + return; + + memcpy(usm_helper_data.program, buf, size); +} + +static int usm_memory_needed(void) +{ + /* ball park figure of 32 pages */ + return 32 * PAGE_SIZE; +} + +/* toi_prepare_usm + */ +int toi_prepare_usm(void) +{ + usm_prepare_count++; + + if (usm_prepare_count > 1 || !usm_ops.enabled) + return 0; + + usm_helper_data.pid = -1; + + if (!*usm_helper_data.program) + return 0; + + toi_netlink_setup(&usm_helper_data); + + if (usm_helper_data.pid == -1) + printk(KERN_INFO "TuxOnIce Storage Manager wanted, but couldn't" + " start it.\n"); + + toi_activate_storage(0); + + return usm_helper_data.pid != -1; +} + +void toi_cleanup_usm(void) +{ + usm_prepare_count--; + + if (usm_helper_data.pid > -1 && !usm_prepare_count) { + toi_deactivate_storage(0); + toi_netlink_close(&usm_helper_data); + } +} + +static void storage_manager_activate(void) +{ + if (storage_manager_action == storage_manager_last_action) + return; + + if (storage_manager_action) + toi_prepare_usm(); + else + toi_cleanup_usm(); + + storage_manager_last_action = storage_manager_action; +} + +/* + * User interface specific /sys/power/tuxonice entries. + */ + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_NONE("simulate_atomic_copy", storage_manager_simulate), + SYSFS_INT("enabled", SYSFS_RW, &usm_ops.enabled, 0, 1, 0, NULL), + SYSFS_STRING("program", SYSFS_RW, usm_helper_data.program, 254, 0, + NULL), + SYSFS_INT("activate_storage", SYSFS_RW , &storage_manager_action, 0, 1, + 0, storage_manager_activate) +}; + +static struct toi_module_ops usm_ops = { + .type = MISC_MODULE, + .name = "usm", + .directory = "storage_manager", + .module = THIS_MODULE, + .storage_needed = usm_storage_needed, + .save_config_info = usm_save_config_info, + .load_config_info = usm_load_config_info, + .memory_needed = usm_memory_needed, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* toi_usm_sysfs_init + * Description: Boot time initialisation for user interface. + */ +int toi_usm_init(void) +{ + usm_helper_data.nl = NULL; + usm_helper_data.program[0] = '\0'; + usm_helper_data.pid = -1; + usm_helper_data.skb_size = 0; + usm_helper_data.pool_limit = 6; + usm_helper_data.netlink_id = NETLINK_TOI_USM; + usm_helper_data.name = "userspace storage manager"; + usm_helper_data.rcv_msg = usm_user_rcv_msg; + usm_helper_data.interface_version = 2; + usm_helper_data.must_init = 0; + init_completion(&usm_helper_data.wait_for_process); + + return toi_register_module(&usm_ops); +} + +void toi_usm_exit(void) +{ + toi_netlink_close_complete(&usm_helper_data); + toi_unregister_module(&usm_ops); +} diff --git a/kernel/power/tuxonice_storage.h b/kernel/power/tuxonice_storage.h new file mode 100644 index 0000000..24f8e8a --- /dev/null +++ b/kernel/power/tuxonice_storage.h @@ -0,0 +1,45 @@ +/* + * kernel/power/tuxonice_storage.h + * + * Copyright (C) 2005-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + */ + +#ifdef CONFIG_NET +int toi_prepare_usm(void); +void toi_cleanup_usm(void); + +int toi_activate_storage(int force); +int toi_deactivate_storage(int force); +extern int toi_usm_init(void); +extern void toi_usm_exit(void); +#else +static inline int toi_usm_init(void) { return 0; } +static inline void toi_usm_exit(void) { } + +static inline int toi_activate_storage(int force) +{ + return 0; +} + +static inline int toi_deactivate_storage(int force) +{ + return 0; +} + +static inline int toi_prepare_usm(void) { return 0; } +static inline void toi_cleanup_usm(void) { } +#endif + +enum { + USM_MSG_BASE = 0x10, + + /* Kernel -> Userspace */ + USM_MSG_CONNECT = 0x30, + USM_MSG_DISCONNECT = 0x31, + USM_MSG_SUCCESS = 0x40, + USM_MSG_FAILED = 0x41, + + USM_MSG_MAX, +}; diff --git a/kernel/power/tuxonice_swap.c b/kernel/power/tuxonice_swap.c new file mode 100644 index 0000000..7c0effe --- /dev/null +++ b/kernel/power/tuxonice_swap.c @@ -0,0 +1,1313 @@ +/* + * kernel/power/tuxonice_swap.c + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * Distributed under GPLv2. + * + * This file encapsulates functions for usage of swap space as a + * backing store. + */ + +#include +#include +#include +#include +#include +#include + +#include "tuxonice.h" +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice_io.h" +#include "tuxonice_ui.h" +#include "tuxonice_extent.h" +#include "tuxonice_block_io.h" +#include "tuxonice_alloc.h" +#include "tuxonice_builtin.h" + +static struct toi_module_ops toi_swapops; + +/* --- Struct of pages stored on disk */ + +struct sig_data { + dev_t device; + unsigned long sector; + int resume_attempted; + int orig_sig_type; +}; + +union diskpage { + union swap_header swh; /* swh.magic is the only member used */ + struct sig_data sig_data; +}; + +union p_diskpage { + union diskpage *pointer; + char *ptr; + unsigned long address; +}; + +enum { + IMAGE_SIGNATURE, + NO_IMAGE_SIGNATURE, + TRIED_RESUME, + NO_TRIED_RESUME, +}; + +/* + * Both of these point to versions of the swap header page. original_sig points + * to the data we read from disk at the start of hibernating or checking whether + * to resume. no_image is the page stored in the image header, showing what the + * swap header page looked like at the start of hibernating. + */ +static char *current_signature_page; +static char no_image_signature_contents[sizeof(struct sig_data)]; + +/* Devices used for swap */ +static struct toi_bdev_info devinfo[MAX_SWAPFILES]; + +/* Extent chains for swap & blocks */ +static struct hibernate_extent_chain swapextents; +static struct hibernate_extent_chain block_chain[MAX_SWAPFILES]; + +static dev_t header_dev_t; +static struct block_device *header_block_device; +static unsigned long headerblock; + +/* For swapfile automatically swapon/off'd. */ +static char swapfilename[32] = ""; +static int toi_swapon_status; + +/* Header Page Information */ +static long header_pages_reserved; + +/* Swap Pages */ +static long swap_pages_allocated; + +/* User Specified Parameters. */ + +static unsigned long resume_firstblock; +static dev_t resume_swap_dev_t; +static struct block_device *resume_block_device; + +static struct sysinfo swapinfo; + +/* Block devices open. */ +struct bdev_opened { + dev_t device; + struct block_device *bdev; +}; + +/* + * Entry MAX_SWAPFILES is the resume block device, which may + * be a swap device not enabled when we hibernate. + * Entry MAX_SWAPFILES + 1 is the header block device, which + * is needed before we find out which slot it occupies. + * + * We use a separate struct to devInfo so that we can track + * the bdevs we open, because if we need to abort resuming + * prior to the atomic restore, they need to be closed, but + * closing them after sucessfully resuming would be wrong. + */ +static struct bdev_opened *bdevs_opened[MAX_SWAPFILES + 2]; + +/** + * close_bdev: Close a swap bdev. + * + * int: The swap entry number to close. + */ +static void close_bdev(int i) +{ + struct bdev_opened *this = bdevs_opened[i]; + + if (!this) + return; + + blkdev_put(this->bdev, FMODE_READ | FMODE_NDELAY); + toi_kfree(8, this); + bdevs_opened[i] = NULL; +} + +/** + * close_bdevs: Close all bdevs we opened. + * + * Close all bdevs that we opened and reset the related vars. + */ +static void close_bdevs(void) +{ + int i; + + for (i = 0; i < MAX_SWAPFILES + 2; i++) + close_bdev(i); + + resume_block_device = NULL; + header_block_device = NULL; +} + +/** + * open_bdev: Open a bdev at resume time. + * + * index: The swap index. May be MAX_SWAPFILES for the resume_dev_t + * (the user can have resume= pointing at a swap partition/file that isn't + * swapon'd when they hibernate. MAX_SWAPFILES+1 for the first page of the + * header. It will be from a swap partition that was enabled when we hibernated, + * but we don't know it's real index until we read that first page. + * dev_t: The device major/minor. + * display_errs: Whether to try to do this quietly. + * + * We stored a dev_t in the image header. Open the matching device without + * requiring /dev/ in most cases and record the details needed + * to close it later and avoid duplicating work. + */ +static struct block_device *open_bdev(int index, dev_t device, int display_errs) +{ + struct bdev_opened *this; + struct block_device *bdev; + + if (bdevs_opened[index]) { + if (bdevs_opened[index]->device == device) + return bdevs_opened[index]->bdev; + + close_bdev(index); + } + + bdev = toi_open_by_devnum(device, FMODE_READ | FMODE_NDELAY); + + if (IS_ERR(bdev) || !bdev) { + if (display_errs) + toi_early_boot_message(1, TOI_CONTINUE_REQ, + "Failed to get access to block device " + "\"%x\" (error %d).\n Maybe you need " + "to run mknod and/or lvmsetup in an " + "initrd/ramfs?", device, bdev); + return ERR_PTR(-EINVAL); + } + + this = toi_kzalloc(8, sizeof(struct bdev_opened), GFP_KERNEL); + if (!this) { + printk(KERN_WARNING "TuxOnIce: Failed to allocate memory for " + "opening a bdev."); + blkdev_put(bdev, FMODE_READ | FMODE_NDELAY); + return ERR_PTR(-ENOMEM); + } + + bdevs_opened[index] = this; + this->device = device; + this->bdev = bdev; + + return bdev; +} + +/** + * enable_swapfile: Swapon the user specified swapfile prior to hibernating. + * + * Activate the given swapfile if it wasn't already enabled. Remember whether + * we really did swapon it for swapoffing later. + */ +static void enable_swapfile(void) +{ + int activateswapresult = -EINVAL; + + if (swapfilename[0]) { + /* Attempt to swap on with maximum priority */ + activateswapresult = sys_swapon(swapfilename, 0xFFFF); + if (activateswapresult && activateswapresult != -EBUSY) + printk("TuxOnIce: The swapfile/partition specified by " + "/sys/power/tuxonice/swap/swapfile " + "(%s) could not be turned on (error %d). " + "Attempting to continue.\n", + swapfilename, activateswapresult); + if (!activateswapresult) + toi_swapon_status = 1; + } +} + +/** + * disable_swapfile: Swapoff any file swaponed at the start of the cycle. + * + * If we did successfully swapon a file at the start of the cycle, swapoff + * it now (finishing up). + */ +static void disable_swapfile(void) +{ + if (!toi_swapon_status) + return; + + sys_swapoff(swapfilename); + toi_swapon_status = 0; +} + +/** + * try_to_parse_resume_device: Try to parse resume= + * + * Any "swap:" has been stripped away and we just have the path to deal with. + * We attempt to do name_to_dev_t, open and stat the file. Having opened the + * file, get the struct block_device * to match. + */ +static int try_to_parse_resume_device(char *commandline, int quiet) +{ + struct kstat stat; + int error = 0; + + wait_for_device_probe(); + resume_swap_dev_t = name_to_dev_t(commandline); + + if (!resume_swap_dev_t) { + struct file *file = filp_open(commandline, + O_RDONLY|O_LARGEFILE, 0); + + if (!IS_ERR(file) && file) { + vfs_getattr(file->f_vfsmnt, file->f_dentry, &stat); + filp_close(file, NULL); + } else + error = vfs_stat(commandline, &stat); + if (!error) + resume_swap_dev_t = stat.rdev; + } + + if (!resume_swap_dev_t) { + if (quiet) + return 1; + + if (test_toi_state(TOI_TRYING_TO_RESUME)) + toi_early_boot_message(1, TOI_CONTINUE_REQ, + "Failed to translate \"%s\" into a device id.\n", + commandline); + else + printk("TuxOnIce: Can't translate \"%s\" into a device " + "id yet.\n", commandline); + return 1; + } + + resume_block_device = open_bdev(MAX_SWAPFILES, resume_swap_dev_t, 0); + if (IS_ERR(resume_block_device)) { + if (!quiet) + toi_early_boot_message(1, TOI_CONTINUE_REQ, + "Failed to get access to \"%s\", where" + " the swap header should be found.", + commandline); + return 1; + } + + return 0; +} + +/* + * If we have read part of the image, we might have filled memory with + * data that should be zeroed out. + */ +static void toi_swap_noresume_reset(void) +{ + toi_bio_ops.rw_cleanup(READ); + memset((char *) &devinfo, 0, sizeof(devinfo)); +} + +static int get_current_signature(void) +{ + if (!current_signature_page) { + current_signature_page = (char *) toi_get_zeroed_page(38, + TOI_ATOMIC_GFP); + if (!current_signature_page) + return -ENOMEM; + } + + return toi_bio_ops.bdev_page_io(READ, resume_block_device, + resume_firstblock, virt_to_page(current_signature_page)); +} + +static int parse_signature(void) +{ + union p_diskpage swap_header_page; + struct sig_data *sig; + int type; + char *swap_header; + const char *sigs[] = { + "SWAP-SPACE", "SWAPSPACE2", "S1SUSP", "S2SUSP", "S1SUSPEND" + }; + + int result = get_current_signature(); + if (result) + return result; + + swap_header_page = (union p_diskpage) current_signature_page; + sig = (struct sig_data *) current_signature_page; + swap_header = swap_header_page.pointer->swh.magic.magic; + + for (type = 0; type < 5; type++) + if (!memcmp(sigs[type], swap_header, strlen(sigs[type]))) + return type; + + if (memcmp(tuxonice_signature, swap_header, sizeof(tuxonice_signature))) + return -1; + + header_dev_t = sig->device; + clear_toi_state(TOI_RESUMED_BEFORE); + if (sig->resume_attempted) + set_toi_state(TOI_RESUMED_BEFORE); + headerblock = sig->sector; + + return 10; +} + +static void forget_signatures(void) +{ + if (current_signature_page) { + toi_free_page(38, (unsigned long) current_signature_page); + current_signature_page = NULL; + } +} + +/* + * write_modified_signature + * + * Write a (potentially) modified signature page without forgetting the + * original contents. + */ +static int write_modified_signature(int modification) +{ + union p_diskpage swap_header_page; + struct swap_info_struct *si; + int result; + char *orig_sig; + + /* In case we haven't already */ + result = get_current_signature(); + + if (result) + return result; + + swap_header_page.address = toi_get_zeroed_page(38, TOI_ATOMIC_GFP); + + if (!swap_header_page.address) + return -ENOMEM; + + memcpy(swap_header_page.ptr, current_signature_page, PAGE_SIZE); + + switch (modification) { + case IMAGE_SIGNATURE: + + memcpy(no_image_signature_contents, swap_header_page.ptr, + sizeof(no_image_signature_contents)); + + /* Get the details of the header first page. */ + toi_extent_state_goto_start(&toi_writer_posn); + toi_bio_ops.forward_one_page(1, 1); + + si = get_swap_info_struct(toi_writer_posn.current_chain); + + /* Prepare the signature */ + swap_header_page.pointer->sig_data.device = si->bdev->bd_dev; + swap_header_page.pointer->sig_data.sector = + toi_writer_posn.current_offset; + swap_header_page.pointer->sig_data.resume_attempted = 0; + swap_header_page.pointer->sig_data.orig_sig_type = + parse_signature(); + + memcpy(swap_header_page.pointer->swh.magic.magic, + tuxonice_signature, sizeof(tuxonice_signature)); + + break; + case NO_IMAGE_SIGNATURE: + if (!swap_header_page.pointer->sig_data.orig_sig_type) + orig_sig = "SWAP-SPACE"; + else + orig_sig = "SWAPSPACE2"; + + memcpy(swap_header_page.pointer->swh.magic.magic, orig_sig, 10); + memcpy(swap_header_page.ptr, no_image_signature_contents, + sizeof(no_image_signature_contents)); + break; + case TRIED_RESUME: + swap_header_page.pointer->sig_data.resume_attempted = 1; + break; + case NO_TRIED_RESUME: + swap_header_page.pointer->sig_data.resume_attempted = 0; + break; + } + + result = toi_bio_ops.bdev_page_io(WRITE, resume_block_device, + resume_firstblock, virt_to_page(swap_header_page.address)); + + memcpy(current_signature_page, swap_header_page.ptr, PAGE_SIZE); + + toi_free_page(38, swap_header_page.address); + + return result; +} + +/* + * apply_header_reservation + */ +static int apply_header_reservation(void) +{ + int i; + + toi_extent_state_goto_start(&toi_writer_posn); + + for (i = 0; i < header_pages_reserved; i++) + if (toi_bio_ops.forward_one_page(1, 0)) + return -ENOSPC; + + /* The end of header pages will be the start of pageset 2; + * we are now sitting on the first pageset2 page. */ + toi_extent_state_save(&toi_writer_posn, &toi_writer_posn_save[2]); + return 0; +} + +static void toi_swap_reserve_header_space(int request) +{ + header_pages_reserved = (long) request; +} + +static void free_block_chains(void) +{ + int i; + + for (i = 0; i < MAX_SWAPFILES; i++) + if (block_chain[i].first) + toi_put_extent_chain(&block_chain[i]); +} + +static int add_blocks_to_extent_chain(int chain, int start, int end) +{ + if (test_action_state(TOI_TEST_BIO)) + printk(KERN_INFO "Adding extent chain %d %d-%d.\n", chain, + start << devinfo[chain].bmap_shift, + end << devinfo[chain].bmap_shift); + + if (toi_add_to_extent_chain(&block_chain[chain], start, end)) { + free_block_chains(); + return -ENOMEM; + } + + return 0; +} + + +static int get_main_pool_phys_params(void) +{ + struct hibernate_extent *extentpointer = NULL; + unsigned long address; + int extent_min = -1, extent_max = -1, last_chain = -1; + + free_block_chains(); + + toi_extent_for_each(&swapextents, extentpointer, address) { + swp_entry_t swap_address = (swp_entry_t) { address }; + pgoff_t offset = swp_offset(swap_address); + unsigned swapfilenum = swp_type(swap_address); + struct swap_info_struct *sis = + get_swap_info_struct(swapfilenum); + sector_t new_sector = map_swap_page(sis, offset); + + if (devinfo[swapfilenum].ignored) + continue; + + if ((new_sector == extent_max + 1) && + (last_chain == swapfilenum)) { + extent_max++; + continue; + } + + if (extent_min > -1 && add_blocks_to_extent_chain(last_chain, + extent_min, extent_max)) { + printk("Out of memory while making block chains.\n"); + return -ENOMEM; + } + + extent_min = new_sector; + extent_max = new_sector; + last_chain = swapfilenum; + } + + if (extent_min > -1 && add_blocks_to_extent_chain(last_chain, + extent_min, extent_max)) { + printk("Out of memory while making block chains.\n"); + return -ENOMEM; + } + + return apply_header_reservation(); +} + +static long raw_to_real(long raw) +{ + long result; + + result = raw - (raw * (sizeof(unsigned long) + sizeof(int)) + + (PAGE_SIZE + sizeof(unsigned long) + sizeof(int) + 1)) / + (PAGE_SIZE + sizeof(unsigned long) + sizeof(int)); + + return result < 0 ? 0 : result; +} + +static int toi_swap_storage_allocated(void) +{ + return (int) raw_to_real(swap_pages_allocated - header_pages_reserved); +} + +/* + * Like si_swapinfo, except that we don't include ram backed swap (compcache!) + * and don't need to use the spinlocks (userspace is stopped when this + * function is called). + */ +void si_swapinfo_no_compcache(struct sysinfo *val) +{ + unsigned int i; + + si_swapinfo(&swapinfo); + val->freeswap = 0; + val->totalswap = 0; + + for (i = 0; i < MAX_SWAPFILES; i++) { + struct swap_info_struct *si = get_swap_info_struct(i); + if ((si->flags & SWP_USED) && + (si->flags & SWP_WRITEOK) && + (strncmp(si->bdev->bd_disk->disk_name, "ram", 3))) { + val->totalswap += si->inuse_pages; + val->freeswap += si->pages - si->inuse_pages; + } + } +} +/* + * We can't just remember the value from allocation time, because other + * processes might have allocated swap in the mean time. + */ +static int toi_swap_storage_available(void) +{ + si_swapinfo_no_compcache(&swapinfo); + return (int) raw_to_real((long) swapinfo.freeswap + + swap_pages_allocated - header_pages_reserved); +} + +static int toi_swap_initialise(int starting_cycle) +{ + int result = 0; + + if (!starting_cycle) + return 0; + + enable_swapfile(); + + if (resume_swap_dev_t && !resume_block_device) { + resume_block_device = open_bdev(MAX_SWAPFILES, + resume_swap_dev_t, 1); + if (IS_ERR(resume_block_device)) + result = 1; + } + + return result; +} + +static void toi_swap_cleanup(int ending_cycle) +{ + if (ending_cycle) + disable_swapfile(); + + close_bdevs(); + + forget_signatures(); +} + +static int toi_swap_release_storage(void) +{ + header_pages_reserved = 0; + swap_pages_allocated = 0; + + if (swapextents.first) { + /* Free swap entries */ + struct hibernate_extent *extentpointer; + unsigned long extentvalue; + toi_extent_for_each(&swapextents, extentpointer, + extentvalue) + swap_free((swp_entry_t) { extentvalue }); + + toi_put_extent_chain(&swapextents); + + free_block_chains(); + } + + return 0; +} + +static void free_swap_range(unsigned long min, unsigned long max) +{ + int j; + + for (j = min; j <= max; j++) + swap_free((swp_entry_t) { j }); +} + +/* + * Round robin allocation (where swap storage has the same priority). + * could make this very inefficient, so we track extents allocated on + * a per-swapfile basis. + */ +static int toi_swap_allocate_storage(int request) +{ + int i, result = 0, to_add[MAX_SWAPFILES], pages_to_get, extra_pages, + gotten = 0, result2; + unsigned long extent_min[MAX_SWAPFILES], extent_max[MAX_SWAPFILES]; + + extra_pages = DIV_ROUND_UP(request * (sizeof(unsigned long) + + sizeof(int)), PAGE_SIZE); + pages_to_get = request + extra_pages - swapextents.size + + header_pages_reserved; + + if (pages_to_get < 1) + return apply_header_reservation(); + + for (i = 0; i < MAX_SWAPFILES; i++) { + struct swap_info_struct *si = get_swap_info_struct(i); + to_add[i] = 0; + if (!si->bdev) + continue; + if (!strncmp(si->bdev->bd_disk->disk_name, "ram", 3)) { + devinfo[i].ignored = 1; + continue; + } + devinfo[i].ignored = 0; + devinfo[i].bdev = si->bdev; + devinfo[i].dev_t = si->bdev->bd_dev; + devinfo[i].bmap_shift = 3; + devinfo[i].blocks_per_page = 1; + } + + while (gotten < pages_to_get) { + swp_entry_t entry; + unsigned long new_value; + unsigned swapfilenum; + + entry = get_swap_page(); + if (!entry.val) + break; + + swapfilenum = swp_type(entry); + new_value = entry.val; + + if (!to_add[swapfilenum]) { + to_add[swapfilenum] = 1; + extent_min[swapfilenum] = new_value; + extent_max[swapfilenum] = new_value; + if (!devinfo[swapfilenum].ignored) + gotten++; + continue; + } + + if (new_value == extent_max[swapfilenum] + 1) { + extent_max[swapfilenum]++; + if (!devinfo[swapfilenum].ignored) + gotten++; + continue; + } + + if (toi_add_to_extent_chain(&swapextents, + extent_min[swapfilenum], + extent_max[swapfilenum])) { + printk(KERN_INFO "Failed to allocate extent for " + "%lu-%lu.\n", extent_min[swapfilenum], + extent_max[swapfilenum]); + free_swap_range(extent_min[swapfilenum], + extent_max[swapfilenum]); + swap_free(entry); + if (!devinfo[swapfilenum].ignored) + gotten -= (extent_max[swapfilenum] - + extent_min[swapfilenum] + 1); + /* Don't try to add again below */ + to_add[swapfilenum] = 0; + break; + } else { + extent_min[swapfilenum] = new_value; + extent_max[swapfilenum] = new_value; + if (!devinfo[swapfilenum].ignored) + gotten++; + } + } + + for (i = 0; i < MAX_SWAPFILES; i++) { + if (!to_add[i] || !toi_add_to_extent_chain(&swapextents, + extent_min[i], extent_max[i])) + continue; + + free_swap_range(extent_min[i], extent_max[i]); + if (!devinfo[i].ignored) + gotten -= (extent_max[i] - extent_min[i] + 1); + break; + } + + if (gotten < pages_to_get) { + printk("Got fewer pages than required " + "(%d wanted, %d gotten).\n", + pages_to_get, gotten); + result = -ENOSPC; + } + + swap_pages_allocated += (long) gotten; + + result2 = get_main_pool_phys_params(); + + return result ? result : result2; +} + +static int toi_swap_write_header_init(void) +{ + int i, result; + struct swap_info_struct *si; + + toi_bio_ops.rw_init(WRITE, 0); + toi_writer_buffer_posn = 0; + + /* Info needed to bootstrap goes at the start of the header. + * First we save the positions and devinfo, including the number + * of header pages. Then we save the structs containing data needed + * for reading the header pages back. + * Note that even if header pages take more than one page, when we + * read back the info, we will have restored the location of the + * next header page by the time we go to use it. + */ + + result = toi_bio_ops.rw_header_chunk(WRITE, &toi_swapops, + (char *) &no_image_signature_contents, + sizeof(struct sig_data)); + + if (result) + return result; + + /* Forward one page will be done prior to the read */ + for (i = 0; i < MAX_SWAPFILES; i++) { + si = get_swap_info_struct(i); + if (si->swap_file) + devinfo[i].dev_t = si->bdev->bd_dev; + else + devinfo[i].dev_t = (dev_t) 0; + } + + result = toi_bio_ops.rw_header_chunk(WRITE, &toi_swapops, + (char *) &toi_writer_posn_save, + sizeof(toi_writer_posn_save)); + + if (result) + return result; + + result = toi_bio_ops.rw_header_chunk(WRITE, &toi_swapops, + (char *) &devinfo, sizeof(devinfo)); + + if (result) + return result; + + for (i = 0; i < MAX_SWAPFILES; i++) + toi_serialise_extent_chain(&toi_swapops, &block_chain[i]); + + return 0; +} + +static int toi_swap_write_header_cleanup(void) +{ + int result = toi_bio_ops.write_header_chunk_finish(); + + /* Set signature to save we have an image */ + if (!result) + result = write_modified_signature(IMAGE_SIGNATURE); + + return result; +} + +/* ------------------------- HEADER READING ------------------------- */ + +/* + * read_header_init() + * + * Description: + * 1. Attempt to read the device specified with resume=. + * 2. Check the contents of the swap header for our signature. + * 3. Warn, ignore, reset and/or continue as appropriate. + * 4. If continuing, read the toi_swap configuration section + * of the header and set up block device info so we can read + * the rest of the header & image. + * + * Returns: + * May not return if user choose to reboot at a warning. + * -EINVAL if cannot resume at this time. Booting should continue + * normally. + */ + +static int toi_swap_read_header_init(void) +{ + int i, result = 0; + toi_writer_buffer_posn = 0; + + if (!header_dev_t) { + printk(KERN_INFO "read_header_init called when we haven't " + "verified there is an image!\n"); + return -EINVAL; + } + + /* + * If the header is not on the resume_swap_dev_t, get the resume device + * first. + */ + if (header_dev_t != resume_swap_dev_t) { + header_block_device = open_bdev(MAX_SWAPFILES + 1, + header_dev_t, 1); + + if (IS_ERR(header_block_device)) + return PTR_ERR(header_block_device); + } else + header_block_device = resume_block_device; + + toi_bio_ops.read_header_init(); + + /* + * Read toi_swap configuration. + * Headerblock size taken into account already. + */ + result = toi_bio_ops.bdev_page_io(READ, header_block_device, + headerblock << 3, + virt_to_page((unsigned long) toi_writer_buffer)); + if (result) + return result; + + memcpy(&no_image_signature_contents, toi_writer_buffer, + sizeof(no_image_signature_contents)); + + toi_writer_buffer_posn = sizeof(no_image_signature_contents); + + memcpy(&toi_writer_posn_save, toi_writer_buffer + + toi_writer_buffer_posn, sizeof(toi_writer_posn_save)); + + toi_writer_buffer_posn += sizeof(toi_writer_posn_save); + + memcpy(&devinfo, toi_writer_buffer + toi_writer_buffer_posn, + sizeof(devinfo)); + + toi_writer_buffer_posn += sizeof(devinfo); + + /* Restore device info */ + for (i = 0; i < MAX_SWAPFILES; i++) { + dev_t thisdevice = devinfo[i].dev_t; + struct block_device *bdev_result; + + devinfo[i].bdev = NULL; + + if (!thisdevice || devinfo[i].ignored) + continue; + + if (thisdevice == resume_swap_dev_t) { + devinfo[i].bdev = resume_block_device; + continue; + } + + if (thisdevice == header_dev_t) { + devinfo[i].bdev = header_block_device; + continue; + } + + bdev_result = open_bdev(i, thisdevice, 1); + if (IS_ERR(bdev_result)) + return PTR_ERR(bdev_result); + devinfo[i].bdev = bdevs_opened[i]->bdev; + } + + toi_extent_state_goto_start(&toi_writer_posn); + toi_bio_ops.set_extra_page_forward(); + + for (i = 0; i < MAX_SWAPFILES && !result; i++) + result = toi_load_extent_chain(&block_chain[i]); + + return result; +} + +static int toi_swap_read_header_cleanup(void) +{ + toi_bio_ops.rw_cleanup(READ); + return 0; +} + +/* + * workspace_size + * + * Description: + * Returns the number of bytes of RAM needed for this + * code to do its work. (Used when calculating whether + * we have enough memory to be able to hibernate & resume). + * + */ +static int toi_swap_memory_needed(void) +{ + return 1; +} + +/* + * Print debug info + * + * Description: + */ +static int toi_swap_print_debug_stats(char *buffer, int size) +{ + int len = 0; + struct sysinfo sysinfo; + + if (toiActiveAllocator != &toi_swapops) { + len = scnprintf(buffer, size, + "- SwapAllocator inactive.\n"); + return len; + } + + len = scnprintf(buffer, size, "- SwapAllocator active.\n"); + if (swapfilename[0]) + len += scnprintf(buffer+len, size-len, + " Attempting to automatically swapon: %s.\n", + swapfilename); + + si_swapinfo_no_compcache(&sysinfo); + + len += scnprintf(buffer+len, size-len, + " Swap available for image: %d pages.\n", + (int) sysinfo.freeswap + toi_swap_storage_allocated()); + + return len; +} + +/* + * Storage needed + * + * Returns amount of space in the swap header required + * for the toi_swap's data. This ignores the links between + * pages, which we factor in when allocating the space. + * + * We ensure the space is allocated, but actually save the + * data from write_header_init and therefore don't also define a + * save_config_info routine. + */ +static int toi_swap_storage_needed(void) +{ + int i, result; + result = sizeof(struct sig_data) + sizeof(toi_writer_posn_save) + + sizeof(devinfo); + + for (i = 0; i < MAX_SWAPFILES; i++) { + result += 2 * sizeof(int); + result += (2 * sizeof(unsigned long) * + block_chain[i].num_extents); + } + + return result; +} + +/* + * Image_exists + * + * Returns -1 if don't know, otherwise 0 (no) or 1 (yes). + */ +static int toi_swap_image_exists(int quiet) +{ + int signature_found; + + if (!resume_swap_dev_t) { + if (!quiet) + printk(KERN_INFO "Not even trying to read header " + "because resume_swap_dev_t is not set.\n"); + return -1; + } + + if (!resume_block_device) { + resume_block_device = open_bdev(MAX_SWAPFILES, resume_swap_dev_t, + 1); + if (IS_ERR(resume_block_device)) { + if (!quiet) + printk(KERN_INFO "Failed to open resume dev_t (%x).\n", + resume_swap_dev_t); + return -1; + } + } + + signature_found = parse_signature(); + + switch (signature_found) { + case -ENOMEM: + return -1; + case -1: + if (!quiet) + printk(KERN_ERR "TuxOnIce: Unable to find a signature." + " Could you have moved a swap file?\n"); + return -1; + case 0: + case 1: + if (!quiet) + printk(KERN_INFO "TuxOnIce: Normal swapspace found.\n"); + return 0; + case 2: + case 3: + case 4: + if (!quiet) + printk(KERN_INFO "TuxOnIce: Detected another " + "implementation's signature.\n"); + return 0; + case 10: + if (!quiet) + printk(KERN_INFO "TuxOnIce: Detected TuxOnIce binary " + "signature.\n"); + return 1; + } + + printk("Unrecognised parse_signature result (%d).\n", signature_found); + return 0; +} + +/* toi_swap_remove_image + * + */ +static int toi_swap_remove_image(void) +{ + /* + * If nr_hibernates == 0, we must be booting, so no swap pages + * will be recorded as used yet. + */ + + if (nr_hibernates) + toi_swap_release_storage(); + + /* + * We don't do a sanity check here: we want to restore the swap + * whatever version of kernel made the hibernate image. + * + * We need to write swap, but swap may not be enabled so + * we write the device directly + * + * If we don't have an current_signature_page, we didn't + * read an image header, so don't change anything. + */ + + return toi_swap_image_exists(1) ? + write_modified_signature(NO_IMAGE_SIGNATURE) : 0; +} + +/* + * Mark resume attempted. + * + * Record that we tried to resume from this image. We have already read the + * signature in. We just need to write the modified version. + */ +static int toi_swap_mark_resume_attempted(int mark) +{ + if (!resume_swap_dev_t) { + printk(KERN_INFO "Not even trying to record attempt at resuming" + " because resume_swap_dev_t is not set.\n"); + return -ENODEV; + } + + return write_modified_signature(mark ? TRIED_RESUME : NO_TRIED_RESUME); +} + +/* + * Parse Image Location + * + * Attempt to parse a resume= parameter. + * Swap Writer accepts: + * resume=swap:DEVNAME[:FIRSTBLOCK][@BLOCKSIZE] + * + * Where: + * DEVNAME is convertable to a dev_t by name_to_dev_t + * FIRSTBLOCK is the location of the first block in the swap file + * (specifying for a swap partition is nonsensical but not prohibited). + * Data is validated by attempting to read a swap header from the + * location given. Failure will result in toi_swap refusing to + * save an image, and a reboot with correct parameters will be + * necessary. + */ +static int toi_swap_parse_sig_location(char *commandline, + int only_allocator, int quiet) +{ + char *thischar, *devstart, *colon = NULL; + int signature_found, result = -EINVAL, temp_result; + + if (strncmp(commandline, "swap:", 5)) { + /* + * Failing swap:, we'll take a simple + * resume=/dev/hda2, but fall through to + * other allocators if /dev/ isn't matched. + */ + if (strncmp(commandline, "/dev/", 5)) + return 1; + } else + commandline += 5; + + devstart = commandline; + thischar = commandline; + while ((*thischar != ':') && (*thischar != '@') && + ((thischar - commandline) < 250) && (*thischar)) + thischar++; + + if (*thischar == ':') { + colon = thischar; + *colon = 0; + thischar++; + } + + while ((thischar - commandline) < 250 && *thischar) + thischar++; + + if (colon) + resume_firstblock = (int) simple_strtoul(colon + 1, NULL, 0); + else + resume_firstblock = 0; + + clear_toi_state(TOI_CAN_HIBERNATE); + clear_toi_state(TOI_CAN_RESUME); + + temp_result = try_to_parse_resume_device(devstart, quiet); + + if (colon) + *colon = ':'; + + if (temp_result) + return -EINVAL; + + signature_found = toi_swap_image_exists(quiet); + + if (signature_found != -1) { + result = 0; + + toi_bio_ops.set_devinfo(devinfo); + toi_writer_posn.chains = &block_chain[0]; + toi_writer_posn.num_chains = MAX_SWAPFILES; + set_toi_state(TOI_CAN_HIBERNATE); + set_toi_state(TOI_CAN_RESUME); + } else + if (!quiet) + printk(KERN_ERR "TuxOnIce: SwapAllocator: No swap " + "signature found at %s.\n", devstart); + return result; +} + +static int header_locations_read_sysfs(const char *page, int count) +{ + int i, printedpartitionsmessage = 0, len = 0, haveswap = 0; + struct inode *swapf = NULL; + int zone; + char *path_page = (char *) toi_get_free_page(10, GFP_KERNEL); + char *path, *output = (char *) page; + int path_len; + + if (!page) + return 0; + + for (i = 0; i < MAX_SWAPFILES; i++) { + struct swap_info_struct *si = get_swap_info_struct(i); + + if (!si->swap_file) + continue; + + if (S_ISBLK(si->swap_file->f_mapping->host->i_mode)) { + haveswap = 1; + if (!printedpartitionsmessage) { + len += sprintf(output + len, + "For swap partitions, simply use the " + "format: resume=swap:/dev/hda1.\n"); + printedpartitionsmessage = 1; + } + } else { + path_len = 0; + + path = d_path(&si->swap_file->f_path, path_page, + PAGE_SIZE); + path_len = snprintf(path_page, 31, "%s", path); + + haveswap = 1; + swapf = si->swap_file->f_mapping->host; + zone = bmap(swapf, 0); + if (!zone) { + len += sprintf(output + len, + "Swapfile %s has been corrupted. Reuse" + " mkswap on it and try again.\n", + path_page); + } else { + char name_buffer[255]; + len += sprintf(output + len, + "For swapfile `%s`," + " use resume=swap:/dev/%s:0x%x.\n", + path_page, + bdevname(si->bdev, name_buffer), + zone << (swapf->i_blkbits - 9)); + } + } + } + + if (!haveswap) + len = sprintf(output, "You need to turn on swap partitions " + "before examining this file.\n"); + + toi_free_page(10, (unsigned long) path_page); + return len; +} + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_STRING("swapfilename", SYSFS_RW, swapfilename, 255, 0, NULL), + SYSFS_CUSTOM("headerlocations", SYSFS_READONLY, + header_locations_read_sysfs, NULL, 0, NULL), + SYSFS_INT("enabled", SYSFS_RW, &toi_swapops.enabled, 0, 1, 0, + attempt_to_parse_resume_device2), +}; + +static struct toi_module_ops toi_swapops = { + .type = WRITER_MODULE, + .name = "swap storage", + .directory = "swap", + .module = THIS_MODULE, + .memory_needed = toi_swap_memory_needed, + .print_debug_info = toi_swap_print_debug_stats, + .storage_needed = toi_swap_storage_needed, + .initialise = toi_swap_initialise, + .cleanup = toi_swap_cleanup, + + .noresume_reset = toi_swap_noresume_reset, + .storage_available = toi_swap_storage_available, + .storage_allocated = toi_swap_storage_allocated, + .reserve_header_space = toi_swap_reserve_header_space, + .allocate_storage = toi_swap_allocate_storage, + .image_exists = toi_swap_image_exists, + .mark_resume_attempted = toi_swap_mark_resume_attempted, + .write_header_init = toi_swap_write_header_init, + .write_header_cleanup = toi_swap_write_header_cleanup, + .read_header_init = toi_swap_read_header_init, + .read_header_cleanup = toi_swap_read_header_cleanup, + .remove_image = toi_swap_remove_image, + .parse_sig_location = toi_swap_parse_sig_location, + + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +/* ---- Registration ---- */ +static __init int toi_swap_load(void) +{ + toi_swapops.rw_init = toi_bio_ops.rw_init; + toi_swapops.rw_cleanup = toi_bio_ops.rw_cleanup; + toi_swapops.read_page = toi_bio_ops.read_page; + toi_swapops.write_page = toi_bio_ops.write_page; + toi_swapops.rw_header_chunk = toi_bio_ops.rw_header_chunk; + toi_swapops.rw_header_chunk_noreadahead = + toi_bio_ops.rw_header_chunk_noreadahead; + toi_swapops.io_flusher = toi_bio_ops.io_flusher; + toi_swapops.update_throughput_throttle = + toi_bio_ops.update_throughput_throttle; + toi_swapops.finish_all_io = toi_bio_ops.finish_all_io; + + return toi_register_module(&toi_swapops); +} + +#ifdef MODULE +static __exit void toi_swap_unload(void) +{ + toi_unregister_module(&toi_swapops); +} + +module_init(toi_swap_load); +module_exit(toi_swap_unload); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Nigel Cunningham"); +MODULE_DESCRIPTION("TuxOnIce SwapAllocator"); +#else +late_initcall(toi_swap_load); +#endif diff --git a/kernel/power/tuxonice_sysfs.c b/kernel/power/tuxonice_sysfs.c new file mode 100644 index 0000000..4f64dc7 --- /dev/null +++ b/kernel/power/tuxonice_sysfs.c @@ -0,0 +1,324 @@ +/* + * kernel/power/tuxonice_sysfs.c + * + * Copyright (C) 2002-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * This file contains support for sysfs entries for tuning TuxOnIce. + * + * We have a generic handler that deals with the most common cases, and + * hooks for special handlers to use. + */ + +#include +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice.h" +#include "tuxonice_storage.h" +#include "tuxonice_alloc.h" + +static int toi_sysfs_initialised; + +static void toi_initialise_sysfs(void); + +static struct toi_sysfs_data sysfs_params[]; + +#define to_sysfs_data(_attr) container_of(_attr, struct toi_sysfs_data, attr) + +static void toi_main_wrapper(void) +{ + _toi_try_hibernate(); +} + +static ssize_t toi_attr_show(struct kobject *kobj, struct attribute *attr, + char *page) +{ + struct toi_sysfs_data *sysfs_data = to_sysfs_data(attr); + int len = 0; + int full_prep = sysfs_data->flags & SYSFS_NEEDS_SM_FOR_READ; + + if (full_prep && toi_start_anything(0)) + return -EBUSY; + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_READ) + toi_prepare_usm(); + + switch (sysfs_data->type) { + case TOI_SYSFS_DATA_CUSTOM: + len = (sysfs_data->data.special.read_sysfs) ? + (sysfs_data->data.special.read_sysfs)(page, PAGE_SIZE) + : 0; + break; + case TOI_SYSFS_DATA_BIT: + len = sprintf(page, "%d\n", + -test_bit(sysfs_data->data.bit.bit, + sysfs_data->data.bit.bit_vector)); + break; + case TOI_SYSFS_DATA_INTEGER: + len = sprintf(page, "%d\n", + *(sysfs_data->data.integer.variable)); + break; + case TOI_SYSFS_DATA_LONG: + len = sprintf(page, "%ld\n", + *(sysfs_data->data.a_long.variable)); + break; + case TOI_SYSFS_DATA_UL: + len = sprintf(page, "%lu\n", + *(sysfs_data->data.ul.variable)); + break; + case TOI_SYSFS_DATA_STRING: + len = sprintf(page, "%s\n", + sysfs_data->data.string.variable); + break; + } + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_READ) + toi_cleanup_usm(); + + if (full_prep) + toi_finish_anything(0); + + return len; +} + +#define BOUND(_variable, _type) do { \ + if (*_variable < sysfs_data->data._type.minimum) \ + *_variable = sysfs_data->data._type.minimum; \ + else if (*_variable > sysfs_data->data._type.maximum) \ + *_variable = sysfs_data->data._type.maximum; \ +} while (0) + +static ssize_t toi_attr_store(struct kobject *kobj, struct attribute *attr, + const char *my_buf, size_t count) +{ + int assigned_temp_buffer = 0, result = count; + struct toi_sysfs_data *sysfs_data = to_sysfs_data(attr); + + if (toi_start_anything((sysfs_data->flags & SYSFS_HIBERNATE_OR_RESUME))) + return -EBUSY; + + ((char *) my_buf)[count] = 0; + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_WRITE) + toi_prepare_usm(); + + switch (sysfs_data->type) { + case TOI_SYSFS_DATA_CUSTOM: + if (sysfs_data->data.special.write_sysfs) + result = (sysfs_data->data.special.write_sysfs)(my_buf, + count); + break; + case TOI_SYSFS_DATA_BIT: + { + int value = simple_strtoul(my_buf, NULL, 0); + if (value) + set_bit(sysfs_data->data.bit.bit, + (sysfs_data->data.bit.bit_vector)); + else + clear_bit(sysfs_data->data.bit.bit, + (sysfs_data->data.bit.bit_vector)); + } + break; + case TOI_SYSFS_DATA_INTEGER: + { + int *variable = + sysfs_data->data.integer.variable; + *variable = simple_strtol(my_buf, NULL, 0); + BOUND(variable, integer); + break; + } + case TOI_SYSFS_DATA_LONG: + { + long *variable = + sysfs_data->data.a_long.variable; + *variable = simple_strtol(my_buf, NULL, 0); + BOUND(variable, a_long); + break; + } + case TOI_SYSFS_DATA_UL: + { + unsigned long *variable = + sysfs_data->data.ul.variable; + *variable = simple_strtoul(my_buf, NULL, 0); + BOUND(variable, ul); + break; + } + break; + case TOI_SYSFS_DATA_STRING: + { + int copy_len = count; + char *variable = + sysfs_data->data.string.variable; + + if (sysfs_data->data.string.max_length && + (copy_len > sysfs_data->data.string.max_length)) + copy_len = sysfs_data->data.string.max_length; + + if (!variable) { + variable = (char *) toi_get_zeroed_page(31, + TOI_ATOMIC_GFP); + sysfs_data->data.string.variable = variable; + assigned_temp_buffer = 1; + } + strncpy(variable, my_buf, copy_len); + if (copy_len && my_buf[copy_len - 1] == '\n') + variable[count - 1] = 0; + variable[count] = 0; + } + break; + } + + /* Side effect routine? */ + if (sysfs_data->write_side_effect) + sysfs_data->write_side_effect(); + + /* Free temporary buffers */ + if (assigned_temp_buffer) { + toi_free_page(31, + (unsigned long) sysfs_data->data.string.variable); + sysfs_data->data.string.variable = NULL; + } + + if (sysfs_data->flags & SYSFS_NEEDS_SM_FOR_WRITE) + toi_cleanup_usm(); + + toi_finish_anything(sysfs_data->flags & SYSFS_HIBERNATE_OR_RESUME); + + return result; +} + +static struct sysfs_ops toi_sysfs_ops = { + .show = &toi_attr_show, + .store = &toi_attr_store, +}; + +static struct kobj_type toi_ktype = { + .sysfs_ops = &toi_sysfs_ops, +}; + +struct kobject *tuxonice_kobj; + +/* Non-module sysfs entries. + * + * This array contains entries that are automatically registered at + * boot. Modules and the console code register their own entries separately. + */ + +static struct toi_sysfs_data sysfs_params[] = { + SYSFS_CUSTOM("do_hibernate", SYSFS_WRITEONLY, NULL, NULL, + SYSFS_HIBERNATING, toi_main_wrapper), + SYSFS_CUSTOM("do_resume", SYSFS_WRITEONLY, NULL, NULL, + SYSFS_RESUMING, __toi_try_resume) +}; + +void remove_toi_sysdir(struct kobject *kobj) +{ + if (!kobj) + return; + + kobject_put(kobj); +} + +struct kobject *make_toi_sysdir(char *name) +{ + struct kobject *kobj = kobject_create_and_add(name, tuxonice_kobj); + + if (!kobj) { + printk(KERN_INFO "TuxOnIce: Can't allocate kobject for sysfs " + "dir!\n"); + return NULL; + } + + kobj->ktype = &toi_ktype; + + return kobj; +} + +/* toi_register_sysfs_file + * + * Helper for registering a new /sysfs/tuxonice entry. + */ + +int toi_register_sysfs_file( + struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data) +{ + int result; + + if (!toi_sysfs_initialised) + toi_initialise_sysfs(); + + result = sysfs_create_file(kobj, &toi_sysfs_data->attr); + if (result) + printk(KERN_INFO "TuxOnIce: sysfs_create_file for %s " + "returned %d.\n", + toi_sysfs_data->attr.name, result); + kobj->ktype = &toi_ktype; + + return result; +} +EXPORT_SYMBOL_GPL(toi_register_sysfs_file); + +/* toi_unregister_sysfs_file + * + * Helper for removing unwanted /sys/power/tuxonice entries. + * + */ +void toi_unregister_sysfs_file(struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data) +{ + sysfs_remove_file(kobj, &toi_sysfs_data->attr); +} +EXPORT_SYMBOL_GPL(toi_unregister_sysfs_file); + +void toi_cleanup_sysfs(void) +{ + int i, + numfiles = sizeof(sysfs_params) / sizeof(struct toi_sysfs_data); + + if (!toi_sysfs_initialised) + return; + + for (i = 0; i < numfiles; i++) + toi_unregister_sysfs_file(tuxonice_kobj, &sysfs_params[i]); + + kobject_put(tuxonice_kobj); + toi_sysfs_initialised = 0; +} + +/* toi_initialise_sysfs + * + * Initialise the /sysfs/tuxonice directory. + */ + +static void toi_initialise_sysfs(void) +{ + int i; + int numfiles = sizeof(sysfs_params) / sizeof(struct toi_sysfs_data); + + if (toi_sysfs_initialised) + return; + + /* Make our TuxOnIce directory a child of /sys/power */ + tuxonice_kobj = kobject_create_and_add("tuxonice", power_kobj); + if (!tuxonice_kobj) + return; + + toi_sysfs_initialised = 1; + + for (i = 0; i < numfiles; i++) + toi_register_sysfs_file(tuxonice_kobj, &sysfs_params[i]); +} + +int toi_sysfs_init(void) +{ + toi_initialise_sysfs(); + return 0; +} + +void toi_sysfs_exit(void) +{ + toi_cleanup_sysfs(); +} diff --git a/kernel/power/tuxonice_sysfs.h b/kernel/power/tuxonice_sysfs.h new file mode 100644 index 0000000..2fea1cc --- /dev/null +++ b/kernel/power/tuxonice_sysfs.h @@ -0,0 +1,138 @@ +/* + * kernel/power/tuxonice_sysfs.h + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + */ + +#include +#include "power.h" + +struct toi_sysfs_data { + struct attribute attr; + int type; + int flags; + union { + struct { + unsigned long *bit_vector; + int bit; + } bit; + struct { + int *variable; + int minimum; + int maximum; + } integer; + struct { + long *variable; + long minimum; + long maximum; + } a_long; + struct { + unsigned long *variable; + unsigned long minimum; + unsigned long maximum; + } ul; + struct { + char *variable; + int max_length; + } string; + struct { + int (*read_sysfs) (const char *buffer, int count); + int (*write_sysfs) (const char *buffer, int count); + void *data; + } special; + } data; + + /* Side effects routine. Used, eg, for reparsing the + * resume= entry when it changes */ + void (*write_side_effect) (void); + struct list_head sysfs_data_list; +}; + +enum { + TOI_SYSFS_DATA_NONE = 1, + TOI_SYSFS_DATA_CUSTOM, + TOI_SYSFS_DATA_BIT, + TOI_SYSFS_DATA_INTEGER, + TOI_SYSFS_DATA_UL, + TOI_SYSFS_DATA_LONG, + TOI_SYSFS_DATA_STRING +}; + +#define SYSFS_WRITEONLY 0200 +#define SYSFS_READONLY 0444 +#define SYSFS_RW 0644 + +#define SYSFS_BIT(_name, _mode, _ul, _bit, _flags) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_BIT, \ + .flags = _flags, \ + .data = { .bit = { .bit_vector = _ul, .bit = _bit } } } + +#define SYSFS_INT(_name, _mode, _int, _min, _max, _flags, _wse) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_INTEGER, \ + .flags = _flags, \ + .data = { .integer = { .variable = _int, .minimum = _min, \ + .maximum = _max } }, \ + .write_side_effect = _wse } + +#define SYSFS_UL(_name, _mode, _ul, _min, _max, _flags) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_UL, \ + .flags = _flags, \ + .data = { .ul = { .variable = _ul, .minimum = _min, \ + .maximum = _max } } } + +#define SYSFS_LONG(_name, _mode, _long, _min, _max, _flags) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_LONG, \ + .flags = _flags, \ + .data = { .a_long = { .variable = _long, .minimum = _min, \ + .maximum = _max } } } + +#define SYSFS_STRING(_name, _mode, _string, _max_len, _flags, _wse) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_STRING, \ + .flags = _flags, \ + .data = { .string = { .variable = _string, .max_length = _max_len } }, \ + .write_side_effect = _wse } + +#define SYSFS_CUSTOM(_name, _mode, _read, _write, _flags, _wse) { \ + .attr = {.name = _name , .mode = _mode }, \ + .type = TOI_SYSFS_DATA_CUSTOM, \ + .flags = _flags, \ + .data = { .special = { .read_sysfs = _read, .write_sysfs = _write } }, \ + .write_side_effect = _wse } + +#define SYSFS_NONE(_name, _wse) { \ + .attr = {.name = _name , .mode = SYSFS_WRITEONLY }, \ + .type = TOI_SYSFS_DATA_NONE, \ + .write_side_effect = _wse, \ +} + +/* Flags */ +#define SYSFS_NEEDS_SM_FOR_READ 1 +#define SYSFS_NEEDS_SM_FOR_WRITE 2 +#define SYSFS_HIBERNATE 4 +#define SYSFS_RESUME 8 +#define SYSFS_HIBERNATE_OR_RESUME (SYSFS_HIBERNATE | SYSFS_RESUME) +#define SYSFS_HIBERNATING (SYSFS_HIBERNATE | SYSFS_NEEDS_SM_FOR_WRITE) +#define SYSFS_RESUMING (SYSFS_RESUME | SYSFS_NEEDS_SM_FOR_WRITE) +#define SYSFS_NEEDS_SM_FOR_BOTH \ + (SYSFS_NEEDS_SM_FOR_READ | SYSFS_NEEDS_SM_FOR_WRITE) + +int toi_register_sysfs_file(struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data); +void toi_unregister_sysfs_file(struct kobject *kobj, + struct toi_sysfs_data *toi_sysfs_data); + +extern struct kobject *tuxonice_kobj; + +struct kobject *make_toi_sysdir(char *name); +void remove_toi_sysdir(struct kobject *obj); +extern void toi_cleanup_sysfs(void); + +extern int toi_sysfs_init(void); +extern void toi_sysfs_exit(void); diff --git a/kernel/power/tuxonice_ui.c b/kernel/power/tuxonice_ui.c new file mode 100644 index 0000000..4da4afd --- /dev/null +++ b/kernel/power/tuxonice_ui.c @@ -0,0 +1,250 @@ +/* + * kernel/power/tuxonice_ui.c + * + * Copyright (C) 1998-2001 Gabor Kuti + * Copyright (C) 1998,2001,2002 Pavel Machek + * Copyright (C) 2002-2003 Florent Chabaud + * Copyright (C) 2002-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Routines for TuxOnIce's user interface. + * + * The user interface code talks to a userspace program via a + * netlink socket. + * + * The kernel side: + * - starts the userui program; + * - sends text messages and progress bar status; + * + * The user space side: + * - passes messages regarding user requests (abort, toggle reboot etc) + * + */ + +#define __KERNEL_SYSCALLS__ + +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice.h" +#include "tuxonice_ui.h" +#include "tuxonice_netlink.h" +#include "tuxonice_power_off.h" +#include "tuxonice_builtin.h" + +static char local_printf_buf[1024]; /* Same as printk - should be safe */ +struct ui_ops *toi_current_ui; +EXPORT_SYMBOL_GPL(toi_current_ui); + +/** + * toi_wait_for_keypress - Wait for keypress via userui or /dev/console. + * + * @timeout: Maximum time to wait. + * + * Wait for a keypress, either from userui or /dev/console if userui isn't + * available. The non-userui path is particularly for at boot-time, prior + * to userui being started, when we have an important warning to give to + * the user. + */ +static char toi_wait_for_keypress(int timeout) +{ + if (toi_current_ui && toi_current_ui->wait_for_key(timeout)) + return ' '; + + return toi_wait_for_keypress_dev_console(timeout); +} + +/* toi_early_boot_message() + * Description: Handle errors early in the process of booting. + * The user may press C to continue booting, perhaps + * invalidating the image, or space to reboot. + * This works from either the serial console or normally + * attached keyboard. + * + * Note that we come in here from init, while the kernel is + * locked. If we want to get events from the serial console, + * we need to temporarily unlock the kernel. + * + * toi_early_boot_message may also be called post-boot. + * In this case, it simply printks the message and returns. + * + * Arguments: int Whether we are able to erase the image. + * int default_answer. What to do when we timeout. This + * will normally be continue, but the user might + * provide command line options (__setup) to override + * particular cases. + * Char *. Pointer to a string explaining why we're moaning. + */ + +#define say(message, a...) printk(KERN_EMERG message, ##a) + +void toi_early_boot_message(int message_detail, int default_answer, + char *warning_reason, ...) +{ +#if defined(CONFIG_VT) || defined(CONFIG_SERIAL_CONSOLE) + unsigned long orig_state = get_toi_state(), continue_req = 0; + unsigned long orig_loglevel = console_loglevel; + int can_ask = 1; +#else + int can_ask = 0; +#endif + + va_list args; + int printed_len; + + if (!toi_wait) { + set_toi_state(TOI_CONTINUE_REQ); + can_ask = 0; + } + + if (warning_reason) { + va_start(args, warning_reason); + printed_len = vsnprintf(local_printf_buf, + sizeof(local_printf_buf), + warning_reason, + args); + va_end(args); + } + + if (!test_toi_state(TOI_BOOT_TIME)) { + printk("TuxOnIce: %s\n", local_printf_buf); + return; + } + + if (!can_ask) { + continue_req = !!default_answer; + goto post_ask; + } + +#if defined(CONFIG_VT) || defined(CONFIG_SERIAL_CONSOLE) + console_loglevel = 7; + + say("=== TuxOnIce ===\n\n"); + if (warning_reason) { + say("BIG FAT WARNING!! %s\n\n", local_printf_buf); + switch (message_detail) { + case 0: + say("If you continue booting, note that any image WILL" + "NOT BE REMOVED.\nTuxOnIce is unable to do so " + "because the appropriate modules aren't\n" + "loaded. You should manually remove the image " + "to avoid any\npossibility of corrupting your " + "filesystem(s) later.\n"); + break; + case 1: + say("If you want to use the current TuxOnIce image, " + "reboot and try\nagain with the same kernel " + "that you hibernated from. If you want\n" + "to forget that image, continue and the image " + "will be erased.\n"); + break; + } + say("Press SPACE to reboot or C to continue booting with " + "this kernel\n\n"); + if (toi_wait > 0) + say("Default action if you don't select one in %d " + "seconds is: %s.\n", + toi_wait, + default_answer == TOI_CONTINUE_REQ ? + "continue booting" : "reboot"); + } else { + say("BIG FAT WARNING!!\n\n" + "You have tried to resume from this image before.\n" + "If it failed once, it may well fail again.\n" + "Would you like to remove the image and boot " + "normally?\nThis will be equivalent to entering " + "noresume on the\nkernel command line.\n\n" + "Press SPACE to remove the image or C to continue " + "resuming.\n\n"); + if (toi_wait > 0) + say("Default action if you don't select one in %d " + "seconds is: %s.\n", toi_wait, + !!default_answer ? + "continue resuming" : "remove the image"); + } + console_loglevel = orig_loglevel; + + set_toi_state(TOI_SANITY_CHECK_PROMPT); + clear_toi_state(TOI_CONTINUE_REQ); + + if (toi_wait_for_keypress(toi_wait) == 0) /* We timed out */ + continue_req = !!default_answer; + else + continue_req = test_toi_state(TOI_CONTINUE_REQ); + +#endif /* CONFIG_VT or CONFIG_SERIAL_CONSOLE */ + +post_ask: + if ((warning_reason) && (!continue_req)) + machine_restart(NULL); + + restore_toi_state(orig_state); + if (continue_req) + set_toi_state(TOI_CONTINUE_REQ); +} +EXPORT_SYMBOL_GPL(toi_early_boot_message); +#undef say + +/* + * User interface specific /sys/power/tuxonice entries. + */ + +static struct toi_sysfs_data sysfs_params[] = { +#if defined(CONFIG_NET) && defined(CONFIG_SYSFS) + SYSFS_INT("default_console_level", SYSFS_RW, + &toi_bkd.toi_default_console_level, 0, 7, 0, NULL), + SYSFS_UL("debug_sections", SYSFS_RW, &toi_bkd.toi_debug_state, 0, + 1 << 30, 0), + SYSFS_BIT("log_everything", SYSFS_RW, &toi_bkd.toi_action, TOI_LOGALL, + 0) +#endif +}; + +static struct toi_module_ops userui_ops = { + .type = MISC_HIDDEN_MODULE, + .name = "printk ui", + .directory = "user_interface", + .module = THIS_MODULE, + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +int toi_register_ui_ops(struct ui_ops *this_ui) +{ + if (toi_current_ui) { + printk(KERN_INFO "Only one TuxOnIce user interface module can " + "be loaded at a time."); + return -EBUSY; + } + + toi_current_ui = this_ui; + + return 0; +} +EXPORT_SYMBOL_GPL(toi_register_ui_ops); + +void toi_remove_ui_ops(struct ui_ops *this_ui) +{ + if (toi_current_ui != this_ui) + return; + + toi_current_ui = NULL; +} +EXPORT_SYMBOL_GPL(toi_remove_ui_ops); + +/* toi_console_sysfs_init + * Description: Boot time initialisation for user interface. + */ + +int toi_ui_init(void) +{ + return toi_register_module(&userui_ops); +} + +void toi_ui_exit(void) +{ + toi_unregister_module(&userui_ops); +} diff --git a/kernel/power/tuxonice_ui.h b/kernel/power/tuxonice_ui.h new file mode 100644 index 0000000..dc45741 --- /dev/null +++ b/kernel/power/tuxonice_ui.h @@ -0,0 +1,103 @@ +/* + * kernel/power/tuxonice_ui.h + * + * Copyright (C) 2004-2008 Nigel Cunningham (nigel at tuxonice net) + */ + +enum { + DONT_CLEAR_BAR, + CLEAR_BAR +}; + +enum { + /* Userspace -> Kernel */ + USERUI_MSG_ABORT = 0x11, + USERUI_MSG_SET_STATE = 0x12, + USERUI_MSG_GET_STATE = 0x13, + USERUI_MSG_GET_DEBUG_STATE = 0x14, + USERUI_MSG_SET_DEBUG_STATE = 0x15, + USERUI_MSG_SPACE = 0x18, + USERUI_MSG_GET_POWERDOWN_METHOD = 0x1A, + USERUI_MSG_SET_POWERDOWN_METHOD = 0x1B, + USERUI_MSG_GET_LOGLEVEL = 0x1C, + USERUI_MSG_SET_LOGLEVEL = 0x1D, + USERUI_MSG_PRINTK = 0x1E, + + /* Kernel -> Userspace */ + USERUI_MSG_MESSAGE = 0x21, + USERUI_MSG_PROGRESS = 0x22, + USERUI_MSG_POST_ATOMIC_RESTORE = 0x25, + + USERUI_MSG_MAX, +}; + +struct userui_msg_params { + u32 a, b, c, d; + char text[255]; +}; + +struct ui_ops { + char (*wait_for_key) (int timeout); + u32 (*update_status) (u32 value, u32 maximum, const char *fmt, ...); + void (*prepare_status) (int clearbar, const char *fmt, ...); + void (*cond_pause) (int pause, char *message); + void (*abort)(int result_code, const char *fmt, ...); + void (*prepare)(void); + void (*cleanup)(void); + void (*post_atomic_restore)(void); + void (*message)(u32 section, u32 level, u32 normally_logged, + const char *fmt, ...); +}; + +extern struct ui_ops *toi_current_ui; + +#define toi_update_status(val, max, fmt, args...) \ + (toi_current_ui ? (toi_current_ui->update_status) (val, max, fmt, ##args) : \ + max) + +#define toi_ui_post_atomic_restore(void) \ + do { if (toi_current_ui) \ + (toi_current_ui->post_atomic_restore)(); \ + } while (0) + +#define toi_prepare_console(void) \ + do { if (toi_current_ui) \ + (toi_current_ui->prepare)(); \ + } while (0) + +#define toi_cleanup_console(void) \ + do { if (toi_current_ui) \ + (toi_current_ui->cleanup)(); \ + } while (0) + +#define abort_hibernate(result, fmt, args...) \ + do { if (toi_current_ui) \ + (toi_current_ui->abort)(result, fmt, ##args); \ + else { \ + set_abort_result(result); \ + } \ + } while (0) + +#define toi_cond_pause(pause, message) \ + do { if (toi_current_ui) \ + (toi_current_ui->cond_pause)(pause, message); \ + } while (0) + +#define toi_prepare_status(clear, fmt, args...) \ + do { if (toi_current_ui) \ + (toi_current_ui->prepare_status)(clear, fmt, ##args); \ + else \ + printk(KERN_ERR fmt "%s", ##args, "\n"); \ + } while (0) + +#define toi_message(sn, lev, log, fmt, a...) \ +do { \ + if (toi_current_ui && (!sn || test_debug_state(sn))) \ + toi_current_ui->message(sn, lev, log, fmt, ##a); \ +} while (0) + +__exit void toi_ui_cleanup(void); +extern int toi_ui_init(void); +extern void toi_ui_exit(void); +extern int toi_register_ui_ops(struct ui_ops *this_ui); +extern void toi_remove_ui_ops(struct ui_ops *this_ui); diff --git a/kernel/power/tuxonice_userui.c b/kernel/power/tuxonice_userui.c new file mode 100644 index 0000000..de685b6 --- /dev/null +++ b/kernel/power/tuxonice_userui.c @@ -0,0 +1,663 @@ +/* + * kernel/power/user_ui.c + * + * Copyright (C) 2005-2007 Bernard Blackham + * Copyright (C) 2002-2008 Nigel Cunningham (nigel at tuxonice net) + * + * This file is released under the GPLv2. + * + * Routines for TuxOnIce's user interface. + * + * The user interface code talks to a userspace program via a + * netlink socket. + * + * The kernel side: + * - starts the userui program; + * - sends text messages and progress bar status; + * + * The user space side: + * - passes messages regarding user requests (abort, toggle reboot etc) + * + */ + +#define __KERNEL_SYSCALLS__ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tuxonice_sysfs.h" +#include "tuxonice_modules.h" +#include "tuxonice.h" +#include "tuxonice_ui.h" +#include "tuxonice_netlink.h" +#include "tuxonice_power_off.h" + +static char local_printf_buf[1024]; /* Same as printk - should be safe */ + +static struct user_helper_data ui_helper_data; +static struct toi_module_ops userui_ops; +static int orig_kmsg; + +static char lastheader[512]; +static int lastheader_message_len; +static int ui_helper_changed; /* Used at resume-time so don't overwrite value + set from initrd/ramfs. */ + +/* Number of distinct progress amounts that userspace can display */ +static int progress_granularity = 30; + +static DECLARE_WAIT_QUEUE_HEAD(userui_wait_for_key); + +/** + * ui_nl_set_state - Update toi_action based on a message from userui. + * + * @n: The bit (1 << bit) to set. + */ +static void ui_nl_set_state(int n) +{ + /* Only let them change certain settings */ + static const u32 toi_action_mask = + (1 << TOI_REBOOT) | (1 << TOI_PAUSE) | + (1 << TOI_LOGALL) | + (1 << TOI_SINGLESTEP) | + (1 << TOI_PAUSE_NEAR_PAGESET_END); + + toi_bkd.toi_action = (toi_bkd.toi_action & (~toi_action_mask)) | + (n & toi_action_mask); + + if (!test_action_state(TOI_PAUSE) && + !test_action_state(TOI_SINGLESTEP)) + wake_up_interruptible(&userui_wait_for_key); +} + +/** + * userui_post_atomic_restore - Tell userui that atomic restore just happened. + * + * Tell userui that atomic restore just occured, so that it can do things like + * redrawing the screen, re-getting settings and so on. + */ +static void userui_post_atomic_restore(void) +{ + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_POST_ATOMIC_RESTORE, NULL, 0); +} + +/** + * userui_storage_needed - Report how much memory in image header is needed. + */ +static int userui_storage_needed(void) +{ + return sizeof(ui_helper_data.program) + 1 + sizeof(int); +} + +/** + * userui_save_config_info - Fill buffer with config info for image header. + * + * @buf: Buffer into which to put the config info we want to save. + */ +static int userui_save_config_info(char *buf) +{ + *((int *) buf) = progress_granularity; + memcpy(buf + sizeof(int), ui_helper_data.program, + sizeof(ui_helper_data.program)); + return sizeof(ui_helper_data.program) + sizeof(int) + 1; +} + +/** + * userui_load_config_info - Restore config info from buffer. + * + * @buf: Buffer containing header info loaded. + * @size: Size of data loaded for this module. + */ +static void userui_load_config_info(char *buf, int size) +{ + progress_granularity = *((int *) buf); + size -= sizeof(int); + + /* Don't load the saved path if one has already been set */ + if (ui_helper_changed) + return; + + if (size > sizeof(ui_helper_data.program)) + size = sizeof(ui_helper_data.program); + + memcpy(ui_helper_data.program, buf + sizeof(int), size); + ui_helper_data.program[sizeof(ui_helper_data.program)-1] = '\0'; +} + +/** + * set_ui_program_set: Record that userui program was changed. + * + * Side effect routine for when the userui program is set. In an initrd or + * ramfs, the user may set a location for the userui program. If this happens, + * we don't want to reload the value that was saved in the image header. This + * routine allows us to flag that we shouldn't restore the program name from + * the image header. + */ +static void set_ui_program_set(void) +{ + ui_helper_changed = 1; +} + +/** + * userui_memory_needed - Tell core how much memory to reserve for us. + */ +static int userui_memory_needed(void) +{ + /* ball park figure of 128 pages */ + return 128 * PAGE_SIZE; +} + +/** + * userui_update_status - Update the progress bar and (if on) in-bar message. + * + * @value: Current progress percentage numerator. + * @maximum: Current progress percentage denominator. + * @fmt: Message to be displayed in the middle of the progress bar. + * + * Note that a NULL message does not mean that any previous message is erased! + * For that, you need toi_prepare_status with clearbar on. + * + * Returns an unsigned long, being the next numerator (as determined by the + * maximum and progress granularity) where status needs to be updated. + * This is to reduce unnecessary calls to update_status. + */ +static u32 userui_update_status(u32 value, u32 maximum, const char *fmt, ...) +{ + static u32 last_step = 9999; + struct userui_msg_params msg; + u32 this_step, next_update; + int bitshift; + + if (ui_helper_data.pid == -1) + return 0; + + if ((!maximum) || (!progress_granularity)) + return maximum; + + if (value < 0) + value = 0; + + if (value > maximum) + value = maximum; + + /* Try to avoid math problems - we can't do 64 bit math here + * (and shouldn't need it - anyone got screen resolution + * of 65536 pixels or more?) */ + bitshift = fls(maximum) - 16; + if (bitshift > 0) { + u32 temp_maximum = maximum >> bitshift; + u32 temp_value = value >> bitshift; + this_step = (u32) + (temp_value * progress_granularity / temp_maximum); + next_update = (((this_step + 1) * temp_maximum / + progress_granularity) + 1) << bitshift; + } else { + this_step = (u32) (value * progress_granularity / maximum); + next_update = ((this_step + 1) * maximum / + progress_granularity) + 1; + } + + if (this_step == last_step) + return next_update; + + memset(&msg, 0, sizeof(msg)); + + msg.a = this_step; + msg.b = progress_granularity; + + if (fmt) { + va_list args; + va_start(args, fmt); + vsnprintf(msg.text, sizeof(msg.text), fmt, args); + va_end(args); + msg.text[sizeof(msg.text)-1] = '\0'; + } + + toi_send_netlink_message(&ui_helper_data, USERUI_MSG_PROGRESS, + &msg, sizeof(msg)); + last_step = this_step; + + return next_update; +} + +/** + * userui_message - Display a message without necessarily logging it. + * + * @section: Type of message. Messages can be filtered by type. + * @level: Degree of importance of the message. Lower values = higher priority. + * @normally_logged: Whether logged even if log_everything is off. + * @fmt: Message (and parameters). + * + * This function is intended to do the same job as printk, but without normally + * logging what is printed. The point is to be able to get debugging info on + * screen without filling the logs with "1/534. ^M 2/534^M. 3/534^M" + * + * It may be called from an interrupt context - can't sleep! + */ +static void userui_message(u32 section, u32 level, u32 normally_logged, + const char *fmt, ...) +{ + struct userui_msg_params msg; + + if ((level) && (level > console_loglevel)) + return; + + memset(&msg, 0, sizeof(msg)); + + msg.a = section; + msg.b = level; + msg.c = normally_logged; + + if (fmt) { + va_list args; + va_start(args, fmt); + vsnprintf(msg.text, sizeof(msg.text), fmt, args); + va_end(args); + msg.text[sizeof(msg.text)-1] = '\0'; + } + + if (test_action_state(TOI_LOGALL)) + printk(KERN_INFO "%s\n", msg.text); + + toi_send_netlink_message(&ui_helper_data, USERUI_MSG_MESSAGE, + &msg, sizeof(msg)); +} + +/** + * wait_for_key_via_userui - Wait for userui to receive a keypress. + */ +static void wait_for_key_via_userui(void) +{ + DECLARE_WAITQUEUE(wait, current); + + add_wait_queue(&userui_wait_for_key, &wait); + set_current_state(TASK_INTERRUPTIBLE); + + interruptible_sleep_on(&userui_wait_for_key); + + set_current_state(TASK_RUNNING); + remove_wait_queue(&userui_wait_for_key, &wait); +} + +/** + * userui_prepare_status - Display high level messages. + * + * @clearbar: Whether to clear the progress bar. + * @fmt...: New message for the title. + * + * Prepare the 'nice display', drawing the header and version, along with the + * current action and perhaps also resetting the progress bar. + */ +static void userui_prepare_status(int clearbar, const char *fmt, ...) +{ + va_list args; + + if (fmt) { + va_start(args, fmt); + lastheader_message_len = vsnprintf(lastheader, 512, fmt, args); + va_end(args); + } + + if (clearbar) + toi_update_status(0, 1, NULL); + + if (ui_helper_data.pid == -1) + printk(KERN_EMERG "%s\n", lastheader); + else + toi_message(0, TOI_STATUS, 1, lastheader, NULL); +} + +/** + * toi_wait_for_keypress - Wait for keypress via userui. + * + * @timeout: Maximum time to wait. + * + * Wait for a keypress from userui. + * + * FIXME: Implement timeout? + */ +static char userui_wait_for_keypress(int timeout) +{ + char key = '\0'; + + if (ui_helper_data.pid != -1) { + wait_for_key_via_userui(); + key = ' '; + } + + return key; +} + +/** + * userui_abort_hibernate - Abort a cycle & tell user if they didn't request it. + * + * @result_code: Reason why we're aborting (1 << bit). + * @fmt: Message to display if telling the user what's going on. + * + * Abort a cycle. If this wasn't at the user's request (and we're displaying + * output), tell the user why and wait for them to acknowledge the message. + */ +static void userui_abort_hibernate(int result_code, const char *fmt, ...) +{ + va_list args; + int printed_len = 0; + + set_result_state(result_code); + + if (test_result_state(TOI_ABORTED)) + return; + + set_result_state(TOI_ABORTED); + + if (test_result_state(TOI_ABORT_REQUESTED)) + return; + + va_start(args, fmt); + printed_len = vsnprintf(local_printf_buf, sizeof(local_printf_buf), + fmt, args); + va_end(args); + if (ui_helper_data.pid != -1) + printed_len = sprintf(local_printf_buf + printed_len, + " (Press SPACE to continue)"); + + toi_prepare_status(CLEAR_BAR, "%s", local_printf_buf); + + if (ui_helper_data.pid != -1) + userui_wait_for_keypress(0); +} + +/** + * request_abort_hibernate - Abort hibernating or resuming at user request. + * + * Handle the user requesting the cancellation of a hibernation or resume by + * pressing escape. + */ +static void request_abort_hibernate(void) +{ + if (test_result_state(TOI_ABORT_REQUESTED)) + return; + + if (test_toi_state(TOI_NOW_RESUMING)) { + toi_prepare_status(CLEAR_BAR, "Escape pressed. " + "Powering down again."); + set_toi_state(TOI_STOP_RESUME); + while (!test_toi_state(TOI_IO_STOPPED)) + schedule(); + if (toiActiveAllocator->mark_resume_attempted) + toiActiveAllocator->mark_resume_attempted(0); + toi_power_down(); + } + + toi_prepare_status(CLEAR_BAR, "--- ESCAPE PRESSED :" + " ABORTING HIBERNATION ---"); + set_abort_result(TOI_ABORT_REQUESTED); + wake_up_interruptible(&userui_wait_for_key); +} + +/** + * userui_user_rcv_msg - Receive a netlink message from userui. + * + * @skb: skb received. + * @nlh: Netlink header received. + */ +static int userui_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) +{ + int type; + int *data; + + type = nlh->nlmsg_type; + + /* A control message: ignore them */ + if (type < NETLINK_MSG_BASE) + return 0; + + /* Unknown message: reply with EINVAL */ + if (type >= USERUI_MSG_MAX) + return -EINVAL; + + /* All operations require privileges, even GET */ + if (security_netlink_recv(skb, CAP_NET_ADMIN)) + return -EPERM; + + /* Only allow one task to receive NOFREEZE privileges */ + if (type == NETLINK_MSG_NOFREEZE_ME && ui_helper_data.pid != -1) { + printk(KERN_INFO "Got NOFREEZE_ME request when " + "ui_helper_data.pid is %d.\n", ui_helper_data.pid); + return -EBUSY; + } + + data = (int *) NLMSG_DATA(nlh); + + switch (type) { + case USERUI_MSG_ABORT: + request_abort_hibernate(); + return 0; + case USERUI_MSG_GET_STATE: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_STATE, &toi_bkd.toi_action, + sizeof(toi_bkd.toi_action)); + return 0; + case USERUI_MSG_GET_DEBUG_STATE: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_DEBUG_STATE, + &toi_bkd.toi_debug_state, + sizeof(toi_bkd.toi_debug_state)); + return 0; + case USERUI_MSG_SET_STATE: + if (nlh->nlmsg_len < NLMSG_LENGTH(sizeof(int))) + return -EINVAL; + ui_nl_set_state(*data); + return 0; + case USERUI_MSG_SET_DEBUG_STATE: + if (nlh->nlmsg_len < NLMSG_LENGTH(sizeof(int))) + return -EINVAL; + toi_bkd.toi_debug_state = (*data); + return 0; + case USERUI_MSG_SPACE: + wake_up_interruptible(&userui_wait_for_key); + return 0; + case USERUI_MSG_GET_POWERDOWN_METHOD: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_POWERDOWN_METHOD, + &toi_poweroff_method, + sizeof(toi_poweroff_method)); + return 0; + case USERUI_MSG_SET_POWERDOWN_METHOD: + if (nlh->nlmsg_len != NLMSG_LENGTH(sizeof(char))) + return -EINVAL; + toi_poweroff_method = (unsigned long)(*data); + return 0; + case USERUI_MSG_GET_LOGLEVEL: + toi_send_netlink_message(&ui_helper_data, + USERUI_MSG_GET_LOGLEVEL, + &toi_bkd.toi_default_console_level, + sizeof(toi_bkd.toi_default_console_level)); + return 0; + case USERUI_MSG_SET_LOGLEVEL: + if (nlh->nlmsg_len < NLMSG_LENGTH(sizeof(int))) + return -EINVAL; + toi_bkd.toi_default_console_level = (*data); + return 0; + case USERUI_MSG_PRINTK: + printk(KERN_INFO "%s", (char *) data); + return 0; + } + + /* Unhandled here */ + return 1; +} + +/** + * userui_cond_pause - Possibly pause at user request. + * + * @pause: Whether to pause or just display the message. + * @message: Message to display at the start of pausing. + * + * Potentially pause and wait for the user to tell us to continue. We normally + * only pause when @pause is set. While paused, the user can do things like + * changing the loglevel, toggling the display of debugging sections and such + * like. + */ +static void userui_cond_pause(int pause, char *message) +{ + int displayed_message = 0, last_key = 0; + + while (last_key != 32 && + ui_helper_data.pid != -1 && + ((test_action_state(TOI_PAUSE) && pause) || + (test_action_state(TOI_SINGLESTEP)))) { + if (!displayed_message) { + toi_prepare_status(DONT_CLEAR_BAR, + "%s Press SPACE to continue.%s", + message ? message : "", + (test_action_state(TOI_SINGLESTEP)) ? + " Single step on." : ""); + displayed_message = 1; + } + last_key = userui_wait_for_keypress(0); + } + schedule(); +} + +/** + * userui_prepare_console - Prepare the console for use. + * + * Prepare a console for use, saving current kmsg settings and attempting to + * start userui. Console loglevel changes are handled by userui. + */ +static void userui_prepare_console(void) +{ + orig_kmsg = kmsg_redirect; + kmsg_redirect = fg_console + 1; + + ui_helper_data.pid = -1; + + if (!userui_ops.enabled) { + printk(KERN_INFO "TuxOnIce: Userui disabled.\n"); + return; + } + + if (*ui_helper_data.program) + toi_netlink_setup(&ui_helper_data); + else + printk(KERN_INFO "TuxOnIce: Userui program not configured.\n"); +} + +/** + * userui_cleanup_console - Cleanup after a cycle. + * + * Tell userui to cleanup, and restore kmsg_redirect to its original value. + */ + +static void userui_cleanup_console(void) +{ + if (ui_helper_data.pid > -1) + toi_netlink_close(&ui_helper_data); + + kmsg_redirect = orig_kmsg; +} + +/* + * User interface specific /sys/power/tuxonice entries. + */ + +static struct toi_sysfs_data sysfs_params[] = { +#if defined(CONFIG_NET) && defined(CONFIG_SYSFS) + SYSFS_BIT("enable_escape", SYSFS_RW, &toi_bkd.toi_action, + TOI_CAN_CANCEL, 0), + SYSFS_BIT("pause_between_steps", SYSFS_RW, &toi_bkd.toi_action, + TOI_PAUSE, 0), + SYSFS_INT("enabled", SYSFS_RW, &userui_ops.enabled, 0, 1, 0, NULL), + SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1, + 2048, 0, NULL), + SYSFS_STRING("program", SYSFS_RW, ui_helper_data.program, 255, 0, + set_ui_program_set), + SYSFS_INT("debug", SYSFS_RW, &ui_helper_data.debug, 0, 1, 0, NULL) +#endif +}; + +static struct toi_module_ops userui_ops = { + .type = MISC_MODULE, + .name = "userui", + .shared_directory = "user_interface", + .module = THIS_MODULE, + .storage_needed = userui_storage_needed, + .save_config_info = userui_save_config_info, + .load_config_info = userui_load_config_info, + .memory_needed = userui_memory_needed, + .sysfs_data = sysfs_params, + .num_sysfs_entries = sizeof(sysfs_params) / + sizeof(struct toi_sysfs_data), +}; + +static struct ui_ops my_ui_ops = { + .post_atomic_restore = userui_post_atomic_restore, + .update_status = userui_update_status, + .message = userui_message, + .prepare_status = userui_prepare_status, + .abort = userui_abort_hibernate, + .cond_pause = userui_cond_pause, + .prepare = userui_prepare_console, + .cleanup = userui_cleanup_console, + .wait_for_key = userui_wait_for_keypress, +}; + +/** + * toi_user_ui_init - Boot time initialisation for user interface. + * + * Invoked from the core init routine. + */ +static __init int toi_user_ui_init(void) +{ + int result; + + ui_helper_data.nl = NULL; + strncpy(ui_helper_data.program, CONFIG_TOI_USERUI_DEFAULT_PATH, 255); + ui_helper_data.pid = -1; + ui_helper_data.skb_size = sizeof(struct userui_msg_params); + ui_helper_data.pool_limit = 6; + ui_helper_data.netlink_id = NETLINK_TOI_USERUI; + ui_helper_data.name = "userspace ui"; + ui_helper_data.rcv_msg = userui_user_rcv_msg; + ui_helper_data.interface_version = 8; + ui_helper_data.must_init = 0; + ui_helper_data.not_ready = userui_cleanup_console; + init_completion(&ui_helper_data.wait_for_process); + result = toi_register_module(&userui_ops); + if (!result) + result = toi_register_ui_ops(&my_ui_ops); + if (result) + toi_unregister_module(&userui_ops); + + return result; +} + +#ifdef MODULE +/** + * toi_user_ui_ext - Cleanup code for if the core is unloaded. + */ +static __exit void toi_user_ui_exit(void) +{ + toi_netlink_close_complete(&ui_helper_data); + toi_remove_ui_ops(&my_ui_ops); + toi_unregister_module(&userui_ops); +} + +module_init(toi_user_ui_init); +module_exit(toi_user_ui_exit); +MODULE_AUTHOR("Nigel Cunningham"); +MODULE_DESCRIPTION("TuxOnIce Userui Support"); +MODULE_LICENSE("GPL"); +#else +late_initcall(toi_user_ui_init); +#endif diff --git a/kernel/power/user.c b/kernel/power/user.c index 6c85359..5249c79 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -64,6 +64,7 @@ static struct snapshot_data { } snapshot_state; atomic_t snapshot_device_available = ATOMIC_INIT(1); +EXPORT_SYMBOL_GPL(snapshot_device_available); static int snapshot_open(struct inode *inode, struct file *filp) { diff --git a/kernel/printk.c b/kernel/printk.c index e3602d0..a239499 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -32,6 +32,7 @@ #include #include #include +#include #include @@ -59,6 +60,7 @@ int console_printk[4] = { MINIMUM_CONSOLE_LOGLEVEL, /* minimum_console_loglevel */ DEFAULT_CONSOLE_LOGLEVEL, /* default_console_loglevel */ }; +EXPORT_SYMBOL_GPL(console_printk); /* * Low level drivers may need that to know if they can schedule in @@ -892,6 +894,7 @@ void suspend_console(void) console_suspended = 1; up(&console_sem); } +EXPORT_SYMBOL_GPL(suspend_console); void resume_console(void) { @@ -901,6 +904,7 @@ void resume_console(void) console_suspended = 0; release_console_sem(); } +EXPORT_SYMBOL_GPL(resume_console); /** * acquire_console_sem - lock the console system for exclusive use. diff --git a/kernel/timer.c b/kernel/timer.c index 13dd64f..7075c2d 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -37,6 +37,8 @@ #include #include #include +#include +#include #include #include @@ -1057,6 +1059,59 @@ unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); +#ifdef CONFIG_PM +static unsigned long avenrun_save[3]; +/* + * save_avenrun - Record the values prior to starting a hibernation cycle. + * We do this to make the work done in hibernation invisible to userspace + * post-suspend. Some programs, including some MTAs, watch the load average + * and stop work until it lowers. Without this, they would stop working for + * a while post-resume, unnecessarily. + */ + +static void save_avenrun(void) +{ + avenrun_save[0] = avenrun[0]; + avenrun_save[1] = avenrun[1]; + avenrun_save[2] = avenrun[2]; +} + +static void restore_avenrun(void) +{ + if (!avenrun_save[0]) + return; + + avenrun[0] = avenrun_save[0]; + avenrun[1] = avenrun_save[1]; + avenrun[2] = avenrun_save[2]; + + avenrun_save[0] = 0; +} + +static int avenrun_pm_callback(struct notifier_block *nfb, + unsigned long action, + void *ignored) +{ + switch (action) { + case PM_HIBERNATION_PREPARE: + save_avenrun(); + return NOTIFY_OK; + case PM_POST_HIBERNATION: + restore_avenrun(); + return NOTIFY_OK; + } + + return NOTIFY_DONE; +} + +static void register_pm_notifier_callback(void) +{ + pm_notifier(avenrun_pm_callback, 0); +} +#else +static inline void register_pm_notifier_callback(void) { } +#endif + /* * calc_load - given tick count, update the avenrun load estimates. * This is called while holding a write_lock on xtime_lock. @@ -1551,6 +1606,7 @@ void __init init_timers(void) BUG_ON(err == NOTIFY_BAD); register_cpu_notifier(&timers_nb); open_softirq(TIMER_SOFTIRQ, run_timer_softirq); + register_pm_notifier_callback(); } /** diff --git a/mm/bootmem.c b/mm/bootmem.c index 51a0ccf..48374aa 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -22,6 +22,7 @@ unsigned long max_low_pfn; unsigned long min_low_pfn; unsigned long max_pfn; +EXPORT_SYMBOL_GPL(max_pfn); #ifdef CONFIG_CRASH_DUMP /* diff --git a/mm/highmem.c b/mm/highmem.c index b36b83b..c46ede9 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -58,6 +58,7 @@ unsigned int nr_free_highpages (void) return pages; } +EXPORT_SYMBOL_GPL(nr_free_highpages); static int pkmap_count[LAST_PKMAP]; static unsigned int last_pkmap_nr; diff --git a/mm/memory.c b/mm/memory.c index d7df5ba..ef53e66 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1180,6 +1180,7 @@ no_page_table: } return page; } +EXPORT_SYMBOL_GPL(follow_page); /* Can we do the FOLL_ANON optimization? */ static inline int use_zero_page(struct vm_area_struct *vma) diff --git a/mm/mmzone.c b/mm/mmzone.c index 16ce8b9..8dd207b 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void) { return NODE_DATA(first_online_node); } +EXPORT_SYMBOL_GPL(first_online_pgdat); struct pglist_data *next_online_pgdat(struct pglist_data *pgdat) { @@ -22,6 +23,7 @@ struct pglist_data *next_online_pgdat(struct pglist_data *pgdat) return NULL; return NODE_DATA(nid); } +EXPORT_SYMBOL_GPL(next_online_pgdat); /* * next_zone - helper magic for for_each_zone() @@ -41,6 +43,7 @@ struct zone *next_zone(struct zone *zone) } return zone; } +EXPORT_SYMBOL_GPL(next_zone); static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes) { diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 74dc57c..16df11b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -105,6 +105,7 @@ int dirty_expire_interval = 30 * HZ; * Flag that makes the machine dump writes/reads and block dirtyings. */ int block_dump; +EXPORT_SYMBOL_GPL(block_dump); /* * Flag that puts the machine in "laptop mode". Doubles as a timeout in jiffies: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5c44ed4..e9cddc3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1807,6 +1807,26 @@ static unsigned int nr_free_zone_pages(int offset) return sum; } +static unsigned int nr_unallocated_zone_pages(int offset) +{ + struct zoneref *z; + struct zone *zone; + + /* Just pick one node, since fallback list is circular */ + unsigned int sum = 0; + + struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); + + for_each_zone_zonelist(zone, z, zonelist, offset) { + unsigned long high = zone->pages_high; + unsigned long left = zone_page_state(zone, NR_FREE_PAGES); + if (left > high) + sum += left - high; + } + + return sum; +} + /* * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL */ @@ -1817,6 +1837,15 @@ unsigned int nr_free_buffer_pages(void) EXPORT_SYMBOL_GPL(nr_free_buffer_pages); /* + * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL + */ +unsigned int nr_unallocated_buffer_pages(void) +{ + return nr_unallocated_zone_pages(gfp_zone(GFP_USER)); +} +EXPORT_SYMBOL_GPL(nr_unallocated_buffer_pages); + +/* * Amount of free RAM allocatable within all zones */ unsigned int nr_free_pagecache_pages(void) diff --git a/mm/shmem.c b/mm/shmem.c index 4103a23..caeba07 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1538,6 +1538,8 @@ static struct inode *shmem_get_inode(struct super_block *sb, int mode, memset(info, 0, (char *)inode - (char *)info); spin_lock_init(&info->lock); info->flags = flags & VM_NORESERVE; + if (flags & VM_ATOMIC_COPY) + inode->i_flags |= S_ATOMIC_COPY; INIT_LIST_HEAD(&info->swaplist); switch (mode & S_IFMT) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 3ecea98..be135e6 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -45,6 +45,7 @@ struct address_space swapper_space = { .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear), .backing_dev_info = &swap_backing_dev_info, }; +EXPORT_SYMBOL_GPL(swapper_space); #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) diff --git a/mm/swapfile.c b/mm/swapfile.c index 312fafe..894fcb5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -414,6 +414,7 @@ noswap: spin_unlock(&swap_lock); return (swp_entry_t) {0}; } +EXPORT_SYMBOL_GPL(get_swap_page); swp_entry_t get_swap_page_of_type(int type) { @@ -508,6 +509,7 @@ void swap_free(swp_entry_t entry) spin_unlock(&swap_lock); } } +EXPORT_SYMBOL_GPL(swap_free); /* * How many references to page are currently swapped out? @@ -1178,6 +1180,7 @@ sector_t map_swap_page(struct swap_info_struct *sis, pgoff_t offset) BUG_ON(se == start_se); /* It *must* be present */ } } +EXPORT_SYMBOL_GPL(map_swap_page); #ifdef CONFIG_HIBERNATION /* @@ -1521,6 +1524,7 @@ out_dput: out: return err; } +EXPORT_SYMBOL_GPL(sys_swapoff); #ifdef CONFIG_PROC_FS /* iterator */ @@ -1919,6 +1923,7 @@ out: } return error; } +EXPORT_SYMBOL_GPL(sys_swapon); void si_swapinfo(struct sysinfo *val) { @@ -1936,6 +1941,7 @@ void si_swapinfo(struct sysinfo *val) val->totalswap = total_swap_pages + nr_to_be_unused; spin_unlock(&swap_lock); } +EXPORT_SYMBOL_GPL(si_swapinfo); /* * Verify that a swap entry is valid and increment its swap map count. @@ -1984,6 +1990,7 @@ get_swap_info_struct(unsigned type) { return &swap_info[type]; } +EXPORT_SYMBOL_GPL(get_swap_info_struct); /* * swap_lock prevents swap_map being freed. Don't grab an extra diff --git a/mm/vmscan.c b/mm/vmscan.c index 56ddf41..f8df64a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2025,6 +2025,9 @@ void wakeup_kswapd(struct zone *zone, int order) if (!populated_zone(zone)) return; + if (freezer_is_on()) + return; + pgdat = zone->zone_pgdat; if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0)) return; @@ -2184,6 +2187,7 @@ out: return ret; } +EXPORT_SYMBOL_GPL(shrink_all_memory); #endif /* It's optimal to keep kswapds on the same CPUs as their memory, but