The translation is temporarily closed for contributions due to maintenance, please come back later.

Source string Read only

(itstool) path: listitem/para
Context English State
FreeBSD solves the deep layering problem with a special optimization called the <quote>All Shadowed Case</quote>. This case occurs if either C1 or C2 take sufficient COW faults to completely shadow all pages in B. Lets say that C1 achieves this. C1 can now bypass B entirely, so rather then have C1-&gt;B-&gt;A and C2-&gt;B-&gt;A we now have C1-&gt;A and C2-&gt;B-&gt;A. But look what also happened—now B has only one reference (C2), so we can collapse B and C2 together. The end result is that B is deleted entirely and we have C1-&gt;A and C2-&gt;A. It is often the case that B will contain a large number of pages and neither C1 nor C2 will be able to completely overshadow it. If we fork again and create a set of D layers, however, it is much more likely that one of the D layers will eventually be able to completely overshadow the much smaller dataset represented by C1 or C2. The same optimization will work at any point in the graph and the grand result of this is that even on a heavily forked machine VM Object stacks tend to not get much deeper then 4. This is true of both the parent and the children and true whether the parent is doing the forking or whether the children cascade forks.
The dead page problem still exists in the case where C1 or C2 do not completely overshadow B. Due to our other optimizations this case does not represent much of a problem and we simply allow the pages to be dead. If the system runs low on memory it will swap them out, eating a little swap, but that is it.
The advantage to the VM Object model is that <function>fork()</function> is extremely fast, since no real data copying need take place. The disadvantage is that you can build a relatively complex VM Object layering that slows page fault handling down a little, and you spend memory managing the VM Object structures. The optimizations FreeBSD makes proves to reduce the problems enough that they can be ignored, leaving no real disadvantage.
SWAP Layers
Private data pages are initially either copy-on-write or zero-fill pages. When a change, and therefore a copy, is made, the original backing object (usually a file) can no longer be used to save a copy of the page when the VM system needs to reuse it for other purposes. This is where SWAP comes in. SWAP is allocated to create backing store for memory that does not otherwise have it. FreeBSD allocates the swap management structure for a VM Object only when it is actually needed. However, the swap management structure has had problems historically:
Under FreeBSD 3.X the swap management structure preallocates an array that encompasses the entire object requiring swap backing store—even if only a few pages of that object are swap-backed. This creates a kernel memory fragmentation problem when large objects are mapped, or processes with large runsizes (RSS) fork.
Also, in order to keep track of swap space, a <quote>list of holes</quote> is kept in kernel memory, and this tends to get severely fragmented as well. Since the <quote>list of holes</quote> is a linear list, the swap allocation and freeing performance is a non-optimal O(n)-per-page.
It requires kernel memory allocations to take place during the swap freeing process, and that creates low memory deadlock problems.
The problem is further exacerbated by holes created due to the interleaving algorithm.
Also, the swap block map can become fragmented fairly easily resulting in non-contiguous allocations.
Kernel memory must also be allocated on the fly for additional swap management structures when a swapout occurs.
It is evident from that list that there was plenty of room for improvement. For FreeBSD 4.X, I completely rewrote the swap subsystem:
Swap management structures are allocated through a hash table rather than a linear array giving them a fixed allocation size and much finer granularity.
Rather then using a linearly linked list to keep track of swap space reservations, it now uses a bitmap of swap blocks arranged in a radix tree structure with free-space hinting in the radix node structures. This effectively makes swap allocation and freeing an O(1) operation.
The entire radix tree bitmap is also preallocated in order to avoid having to allocate kernel memory during critical low memory swapping operations. After all, the system tends to swap when it is low on memory so we should avoid allocating kernel memory at such times in order to avoid potential deadlocks.
To reduce fragmentation the radix tree is capable of allocating large contiguous chunks at once, skipping over smaller fragmented chunks.
I did not take the final step of having an <quote>allocating hint pointer</quote> that would trundle through a portion of swap as allocations were made in order to further guarantee contiguous allocations or at least locality of reference, but I ensured that such an addition could be made.
When to free a page
Since the VM system uses all available memory for disk caching, there are usually very few truly-free pages. The VM system depends on being able to properly choose pages which are not in use to reuse for new allocations. Selecting the optimal pages to free is possibly the single-most important function any VM system can perform because if it makes a poor selection, the VM system may be forced to unnecessarily retrieve pages from disk, seriously degrading system performance.
How much overhead are we willing to suffer in the critical path to avoid freeing the wrong page? Each wrong choice we make will cost us hundreds of thousands of CPU cycles and a noticeable stall of the affected processes, so we are willing to endure a significant amount of overhead in order to be sure that the right page is chosen. This is why FreeBSD tends to outperform other systems when memory resources become stressed.
The free page determination algorithm is built upon a history of the use of memory pages. To acquire this history, the system takes advantage of a page-used bit feature that most hardware page tables have.
In any case, the page-used bit is cleared and at some later point the VM system comes across the page again and sees that the page-used bit has been set. This indicates that the page is still being actively used. If the bit is still clear it is an indication that the page is not being actively used. By testing this bit periodically, a use history (in the form of a counter) for the physical page is developed. When the VM system later needs to free up some pages, checking this history becomes the cornerstone of determining the best candidate page to reuse.
What if the hardware has no page-used bit?
For those platforms that do not have this feature, the system actually emulates a page-used bit. It unmaps or protects a page, forcing a page fault if the page is accessed again. When the page fault is taken, the system simply marks the page as having been used and unprotects the page so that it may be used. While taking such page faults just to determine if a page is being used appears to be an expensive proposition, it is much less expensive than reusing the page for some other purpose only to find that a process needs it back and then have to go to disk.
FreeBSD makes use of several page queues to further refine the selection of pages to reuse as well as to determine when dirty pages must be flushed to their backing store. Since page tables are dynamic entities under FreeBSD, it costs virtually nothing to unmap a page from the address space of any processes using it. When a page candidate has been chosen based on the page-use counter, this is precisely what is done. The system must make a distinction between clean pages which can theoretically be freed up at any time, and dirty pages which must first be written to their backing store before being reusable. When a page candidate has been found it is moved to the inactive queue if it is dirty, or the cache queue if it is clean. A separate algorithm based on the dirty-to-clean page ratio determines when dirty pages in the inactive queue must be flushed to disk. Once this is accomplished, the flushed pages are moved from the inactive queue to the cache queue. At this point, pages in the cache queue can still be reactivated by a VM fault at relatively low cost. However, pages in the cache queue are considered to be <quote>immediately freeable</quote> and will be reused in an LRU (least-recently used) fashion when the system needs to allocate new memory.
It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a <command>systat -vm</command> command. As the VM system becomes more stressed, it makes a greater effort to maintain the various page queues at the levels determined to be the most effective.
An urban myth has circulated for years that Linux did a better job avoiding swapouts than FreeBSD, but this in fact is not true. What was actually occurring was that FreeBSD was proactively paging out unused pages in order to make room for more disk cache while Linux was keeping unused pages in core and leaving less memory available for cache and process pages. I do not know whether this is still true today.
Pre-Faulting and Zeroing Optimizations
Taking a VM fault is not expensive if the underlying page is already in core and can simply be mapped into the process, but it can become expensive if you take a whole lot of them on a regular basis. A good example of this is running a program such as <citerefentry><refentrytitle>ls</refentrytitle><manvolnum>1</manvolnum></citerefentry> or <citerefentry><refentrytitle>ps</refentrytitle><manvolnum>1</manvolnum></citerefentry> over and over again. If the program binary is mapped into memory but not mapped into the page table, then all the pages that will be accessed by the program will have to be faulted in every time the program is run. This is unnecessary when the pages in question are already in the VM Cache, so FreeBSD will attempt to pre-populate a process's page tables with those pages that are already in the VM Cache. One thing that FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For example, if you run the <citerefentry><refentrytitle>ls</refentrytitle><manvolnum>1</manvolnum></citerefentry> program while running <command>vmstat 1</command> you will notice that it always takes a certain number of page faults, even when you run it over and over again. These are zero-fill faults, not program code faults (which were pre-faulted in already). Pre-copying pages on exec or fork is an area that could use more study.
A large percentage of page faults that occur are zero-fill faults. You can usually see this by observing the <command>vmstat -s</command> output. These occur when a process accesses pages in its BSS area. The BSS area is expected to be initially zero but the VM system does not bother to allocate any memory at all until the process actually accesses it. When a fault occurs the VM system must not only allocate a new page, it must zero it as well. To optimize the zeroing operation the VM system has the ability to pre-zero pages and mark them as such, and to request pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs whenever the CPU is idle but the number of pages the system pre-zeros is limited in order to avoid blowing away the memory caches. This is an excellent example of adding complexity to the VM system in order to optimize the critical path.
Page Table Optimizations

Loading…

No matching activity found.

Browse all component changes

Source information

Source string comment
(itstool) path: listitem/para
Flags
read-only
Source string location
article.translate.xml:410
String age
a year ago
Source string age
a year ago
Translation file
articles/vm-design.pot, string 53