The translation is temporarily closed for contributions due to maintenance, please come back later.

Source string Read only

(itstool) path: question/para
Context English State
Both Linux and FreeBSD need work in this area. FreeBSD is trying to maximize the advantage of a potentially sparse active-mapping model (not all processes need to map all pages of a shared library, for example), whereas Linux is trying to simplify its algorithms. FreeBSD generally has the performance advantage here at the cost of wasting a little extra memory, but FreeBSD breaks down in the case where a large file is massively shared across hundreds of processes. Linux, on the other hand, breaks down in the case where many processes are sparsely-mapping the same shared library and also runs non-optimally when trying to determine whether a page can be reused or not.
Page Coloring
We will end with the page coloring optimizations. Page coloring is a performance optimization designed to ensure that accesses to contiguous pages in virtual memory make the best use of the processor cache. In ancient times (i.e. 10+ years ago) processor caches tended to map virtual memory rather than physical memory. This led to a huge number of problems including having to clear the cache on every context switch in some cases, and problems with data aliasing in the cache. Modern processor caches map physical memory precisely to solve those problems. This means that two side-by-side pages in a processes address space may not correspond to two side-by-side pages in the cache. In fact, if you are not careful side-by-side pages in virtual memory could wind up using the same page in the processor cache—leading to cacheable data being thrown away prematurely and reducing CPU performance. This is true even with multi-way set-associative caches (though the effect is mitigated somewhat).
FreeBSD's memory allocation code implements page coloring optimizations, which means that the memory allocation code will attempt to locate free pages that are contiguous from the point of view of the cache. For example, if page 16 of physical memory is assigned to page 0 of a process's virtual memory and the cache can hold 4 pages, the page coloring code will not assign page 20 of physical memory to page 1 of a process's virtual memory. It would, instead, assign page 21 of physical memory. The page coloring code attempts to avoid assigning page 20 because this maps over the same cache memory as page 16 and would result in non-optimal caching. This code adds a significant amount of complexity to the VM memory allocation subsystem as you can well imagine, but the result is well worth the effort. Page Coloring makes VM memory as deterministic as physical memory in regards to cache performance.
Virtual memory in modern operating systems must address a number of different issues efficiently and for many different usage patterns. The modular and algorithmic approach that BSD has historically taken allows us to study and understand the current implementation as well as relatively cleanly replace large sections of the code. There have been a number of improvements to the FreeBSD VM system in the last several years, and work is ongoing.
Bonus QA session by Allen Briggs <email></email>
What is <quote>the interleaving algorithm</quote> that you refer to in your listing of the ills of the FreeBSD 3.X swap arrangements?
FreeBSD uses a fixed swap interleave which defaults to 4. This means that FreeBSD reserves space for four swap areas even if you only have one, two, or three. Since swap is interleaved the linear address space representing the <quote>four swap areas</quote> will be fragmented if you do not actually have four swap areas. For example, if you have two swap areas A and B FreeBSD's address space representation for that swap area will be interleaved in blocks of 16 pages:
FreeBSD 3.X uses a <quote>sequential list of free regions</quote> approach to accounting for the free swap areas. The idea is that large blocks of free linear space can be represented with a single list node (<filename>kern/subr_rlist.c</filename>). But due to the fragmentation the sequential list winds up being insanely fragmented. In the above example, completely unused swap will have A and B shown as <quote>free</quote> and C and D shown as <quote>all allocated</quote>. Each A-B sequence requires a list node to account for because C and D are holes, so the list node cannot be combined with the next A-B sequence.
Why do we interleave our swap space instead of just tack swap areas onto the end and do something fancier? It is a whole lot easier to allocate linear swaths of an address space and have the result automatically be interleaved across multiple disks than it is to try to put that sophistication elsewhere.
The fragmentation causes other problems. Being a linear list under 3.X, and having such a huge amount of inherent fragmentation, allocating and freeing swap winds up being an O(N) algorithm instead of an O(1) algorithm. Combined with other factors (heavy swapping) and you start getting into O(N^2) and O(N^3) levels of overhead, which is bad. The 3.X system may also need to allocate KVM during a swap operation to create a new list node which can lead to a deadlock if the system is trying to pageout pages in a low-memory situation.
Under 4.X we do not use a sequential list. Instead we use a radix tree and bitmaps of swap blocks rather than ranged list nodes. We take the hit of preallocating all the bitmaps required for the entire swap area up front but it winds up wasting less memory due to the use of a bitmap (one bit per block) instead of a linked list of nodes. The use of a radix tree instead of a sequential list gives us nearly O(1) performance no matter how fragmented the tree becomes.
How is the separation of clean and dirty (inactive) pages related to the situation where you see low cache queue counts and high active queue counts in <command>systat -vm</command>? Do the systat stats roll the active and dirty pages together for the active queue count?
I do not get the following:
It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a <command>systat -vm</command> command.
Yes, that is confusing. The relationship is <quote>goal</quote> verses <quote>reality</quote>. Our goal is to separate the pages but the reality is that if we are not in a memory crunch, we do not really have to.
What this means is that FreeBSD will not try very hard to separate out dirty pages (inactive queue) from clean pages (cache queue) when the system is not being stressed, nor will it try to deactivate pages (active queue -&gt; inactive queue) when the system is not being stressed, even if they are not being used.
In the <citerefentry><refentrytitle>ls</refentrytitle><manvolnum>1</manvolnum></citerefentry> / <command>vmstat 1</command> example, would not some of the page faults be data page faults (COW from executable file to private page)? I.e., I would expect the page faults to be some zero-fill and some program data. Or are you implying that FreeBSD does do pre-COW for the program data?
A COW fault can be either zero-fill or program-data. The mechanism is the same either way because the backing program-data is almost certainly already in the cache. I am indeed lumping the two together. FreeBSD does not pre-COW program data or zero-fill, but it <emphasis>does</emphasis> pre-map pages that exist in its cache.
In your section on page table optimizations, can you give a little more detail about <literal>pv_entry</literal> and <literal>vm_page</literal> (or should vm_page be <literal>vm_pmap</literal>—as in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of operation/reaction would require scanning the mappings?
How does Linux do in the case where FreeBSD breaks down (sharing a large file mapping over many processes)?
A <literal>vm_page</literal> represents an (object,index#) tuple. A <literal>pv_entry</literal> represents a hardware page table entry (pte). If you have five processes sharing the same physical page, and three of those processes's page tables actually map the page, that page will be represented by a single <literal>vm_page</literal> structure and three <literal>pv_entry</literal> structures.
<literal>pv_entry</literal> structures only represent pages mapped by the MMU (one <literal>pv_entry</literal> represents one pte). This means that when we need to remove all hardware references to a <literal>vm_page</literal> (in order to reuse the page for something else, page it out, clear it, dirty it, and so forth) we can simply scan the linked list of <literal>pv_entry</literal>'s associated with that <literal>vm_page</literal> to remove or modify the pte's from their page tables.
Under Linux there is no such linked list. In order to remove all the hardware page table mappings for a <literal>vm_page</literal> linux must index into every VM object that <emphasis>might</emphasis> have mapped the page. For example, if you have 50 processes all mapping the same shared library and want to get rid of page X in that library, you need to index into the page table for each of those 50 processes even if only 10 of them have actually mapped the page. So Linux is trading off the simplicity of its design against performance. Many VM algorithms which are O(1) or (small N) under FreeBSD wind up being O(N), O(N^2), or worse under Linux. Since the pte's representing a particular page in an object tend to be at the same offset in all the page tables they are mapped in, reducing the number of accesses into the page tables at the same pte offset will often avoid blowing away the L1 cache line for that offset, which can lead to better performance.
FreeBSD has added complexity (the <literal>pv_entry</literal> scheme) in order to increase performance (to limit page table accesses to <emphasis>only</emphasis> those pte's that need to be modified).
But FreeBSD has a scaling problem that Linux does not in that there are a limited number of <literal>pv_entry</literal> structures and this causes problems when you have massive sharing of data. In this case you may run out of <literal>pv_entry</literal> structures even though there is plenty of free memory available. This can be fixed easily enough by bumping up the number of <literal>pv_entry</literal> structures in the kernel config, but we really need to find a better way to do it.
In regards to the memory overhead of a page table verses the <literal>pv_entry</literal> scheme: Linux uses <quote>permanent</quote> page tables that are not throw away, but does not need a <literal>pv_entry</literal> for each potentially mapped pte. FreeBSD uses <quote>throw away</quote> page tables but adds in a <literal>pv_entry</literal> structure for each actually-mapped pte. I think memory utilization winds up being about the same, giving FreeBSD an algorithmic advantage with its ability to throw away page tables at will with very low overhead.
Finally, in the page coloring section, it might help to have a little more description of what you mean here. I did not quite follow it.
Do you know how an L1 hardware memory cache works? I will explain: Consider a machine with 16MB of main memory but only 128K of L1 cache. Generally the way this cache works is that each 128K block of main memory uses the <emphasis>same</emphasis> 128K of cache. If you access offset 0 in main memory and then offset 128K in main memory you can wind up throwing away the cached data you read from offset 0!


No matching activity found.

Browse all component changes

Source information

Source string comment
(itstool) path: question/para
Source string location
String age
a year ago
Source string age
a year ago
Translation file
articles/vm-design.pot, string 85