Source string Read only

(itstool) path: sect2/para

Context English State
*at family of syscalls
During development of <trademark class="registered">Linux</trademark> 2.6.16 kernel, the *at syscalls were added. Those syscalls (<function>openat</function> for example) work exactly like their at-less counterparts with the slight exception of the <varname>dirfd</varname> parameter. This parameter changes where the given file, on which the syscall is to be performed, is. When the <varname>filename</varname> parameter is absolute <varname>dirfd</varname> is ignored but when the path to the file is relative, it comes to the play. The <varname>dirfd</varname> parameter is a directory relative to which the relative pathname is checked. The <varname>dirfd</varname> parameter is a file descriptor of some directory or <literal>AT_FDCWD</literal>. So for example the <function>openat</function> syscall can be like this:
file descriptor 123 = /tmp/foo/, current working directory = /tmp/

openat(123, /tmp/bah\, flags, mode) /* opens /tmp/bah */
openat(123, bah\, flags, mode) /* opens /tmp/foo/bah */
openat(AT_FDWCWD, bah\, flags, mode) /* opens /tmp/bah */
openat(stdio, bah\, flags, mode) /* returns error because stdio is not a directory */
This infrastructure is necessary to avoid races when opening files outside the working directory. Imagine that a process consists of two threads, thread A and thread B. Thread A issues <literal>open(./tmp/foo/bah., flags, mode)</literal> and before returning it gets preempted and thread B runs. Thread B does not care about the needs of thread A and renames or removes <filename>/tmp/foo/</filename>. We got a race. To avoid this we can open <filename>/tmp/foo</filename> and use it as <varname>dirfd</varname> for <function>openat</function> syscall. This also enables user to implement per-thread working directories.
<trademark class="registered">Linux</trademark> family of *at syscalls contains: <function>linux_openat</function>, <function>linux_mkdirat</function>, <function>linux_mknodat</function>, <function>linux_fchownat</function>, <function>linux_futimesat</function>, <function>linux_fstatat64</function>, <function>linux_unlinkat</function>, <function>linux_renameat</function>, <function>linux_linkat</function>, <function>linux_symlinkat</function>, <function>linux_readlinkat</function>, <function>linux_fchmodat</function> and <function>linux_faccessat</function>. All these are implemented using the modified <citerefentry><refentrytitle>namei</refentrytitle><manvolnum>9</manvolnum></citerefentry> routine and simple wrapping layer.
The implementation is done by altering the <citerefentry><refentrytitle>namei</refentrytitle><manvolnum>9</manvolnum></citerefentry> routine (described above) to take additional parameter <varname>dirfd</varname> in its <literal>nameidata</literal> structure, which specifies the starting point of the pathname lookup instead of using the current working directory every time. The resolution of <varname>dirfd</varname> from file descriptor number to a vnode is done in native *at syscalls. When <varname>dirfd</varname> is <literal>AT_FDCWD</literal> the <varname>dvp</varname> entry in <literal>nameidata</literal> structure is <literal>NULL</literal> but when <varname>dirfd</varname> is a different number we obtain a file for this file descriptor, check whether this file is valid and if there is vnode attached to it then we get a vnode. Then we check this vnode for being a directory. In the actual <citerefentry><refentrytitle>namei</refentrytitle><manvolnum>9</manvolnum></citerefentry> routine we simply substitute the <varname>dvp</varname> vnode for <varname>dp</varname> variable in the <citerefentry><refentrytitle>namei</refentrytitle><manvolnum>9</manvolnum></citerefentry> function, which determines the starting point. The <citerefentry><refentrytitle>namei</refentrytitle><manvolnum>9</manvolnum></citerefentry> is not used directly but via a trace of different functions on various levels. For example the <function>openat</function> goes like this:
openat() --&gt; kern_openat() --&gt; vn_open() -&gt; namei()
For this reason <function>kern_open</function> and <function>vn_open</function> must be altered to incorporate the additional <varname>dirfd</varname> parameter. No compat layer is created for those because there are not many users of this and the users can be easily converted. This general implementation enables FreeBSD to implement their own *at syscalls. This is being discussed right now.
The ioctl interface is quite fragile due to its generality. We have to bear in mind that devices differ between <trademark class="registered">Linux</trademark> and FreeBSD so some care must be applied to do ioctl emulation work right. The ioctl handling is implemented in <filename>linux_ioctl.c</filename>, where <function>linux_ioctl</function> function is defined. This function simply iterates over sets of ioctl handlers to find a handler that implements a given command. The ioctl syscall has three parameters, the file descriptor, command and an argument. The command is a 16-bit number, which in theory is divided into high 8 bits determining class of the ioctl command and low 8 bits, which are the actual command within the given set. The emulation takes advantage of this division. We implement handlers for each set, like <function>sound_handler</function> or <function>disk_handler</function>. Each handler has a maximum command and a minimum command defined, which is used for determining what handler is used. There are slight problems with this approach because <trademark class="registered">Linux</trademark> does not use the set division consistently so sometimes ioctls for a different set are inside a set they should not belong to (SCSI generic ioctls inside cdrom set, etc.). FreeBSD currently does not implement many <trademark class="registered">Linux</trademark> ioctls (compared to NetBSD, for example) but the plan is to port those from NetBSD. The trend is to use <trademark class="registered">Linux</trademark> ioctls even in the native FreeBSD drivers because of the easy porting of applications.
Every syscall should be debuggable. For this purpose we introduce a small infrastructure. We have the ldebug facility, which tells whether a given syscall should be debugged (settable via a sysctl). For printing we have LMSG and ARGS macros. Those are used for altering a printable string for uniform debugging messages.
As of April 2007 the <trademark class="registered">Linux</trademark> emulation layer is capable of emulating the <trademark class="registered">Linux</trademark> 2.6.16 kernel quite well. The remaining problems concern futexes, unfinished *at family of syscalls, problematic signals delivery, missing <function>epoll</function> and <function>inotify</function> and probably some bugs we have not discovered yet. Despite this we are capable of running basically all the <trademark class="registered">Linux</trademark> programs included in FreeBSD Ports Collection with Fedora Core 4 at 2.6.16 and there are some rudimentary reports of success with Fedora Core 6 at 2.6.16. The Fedora Core 6 linux_base was recently committed enabling some further testing of the emulation layer and giving us some more hints where we should put our effort in implementing missing stuff.
We are able to run the most used applications like <package>www/linux-firefox</package>, <package>net-im/skype</package> and some games from the Ports Collection. Some of the programs exhibit bad behavior under 2.6 emulation but this is currently under investigation and hopefully will be fixed soon. The only big application that is known not to work is the <trademark class="registered">Linux</trademark> <trademark>Java</trademark> Development Kit and this is because of the requirement of <function>epoll</function> facility which is not directly related to the <trademark class="registered">Linux</trademark> kernel 2.6.
We hope to enable 2.6.16 emulation by default some time after FreeBSD 7.0 is released at least to expose the 2.6 emulation parts for some wider testing. Once this is done we can switch to Fedora Core 6 linux_base, which is the ultimate plan.
Future work
Future work should focus on fixing the remaining issues with futexes, implement the rest of the *at family of syscalls, fix the signal delivery and possibly implement the <function>epoll</function> and <function>inotify</function> facilities.
We hope to be able to run the most important programs flawlessly soon, so we will be able to switch to the 2.6 emulation by default and make the Fedora Core 6 the default linux_base because our currently used Fedora Core 4 is not supported any more.
The other possible goal is to share our code with NetBSD and DragonflyBSD. NetBSD has some support for 2.6 emulation but its far from finished and not really tested. DragonflyBSD has expressed some interest in porting the 2.6 improvements.
Generally, as <trademark class="registered">Linux</trademark> develops we would like to keep up with their development, implementing newly added syscalls. Splice comes to mind first. Some already implemented syscalls are also heavily crippled, for example <function>mremap</function> and others. Some performance improvements can also be made, finer grained locking and others.
I cooperated on this project with (in alphabetical order):
John Baldwin <email></email>
Konstantin Belousov <email></email>
Emmanuel Dreyfus
Scot Hetzel
Jung-uk Kim <email></email>
Alexander Leidinger <email></email>


User avatar None

New source string

FreeBSD Doc / articles_linux-emulationEnglish

New source string 6 months ago
Browse all component changes


English English
No related strings found in the glossary.

Source information

Source string comment

(itstool) path: sect2/para

Source string location
String age
6 months ago
Source string age
6 months ago
Translation file
articles/linux-emulation.pot, string 361