Team LiB
Previous Section Next Section

System Call Implementation

The actual implementation of a system call in Linux does not need to concern itself with the behavior of the system call handler. Thus, adding a new system call to Linux is relatively easy. The hard work lies in designing and implementing the system call; registering it with the kernel is simple. Let's look at the steps involved in writing a new system call for Linux.

The first step in implementing a system call is defining its purpose. What will it do? The syscall should have exactly one purpose. Multiplexing syscalls (a single system call that does wildly different things depending on a flag argument) is discouraged in Linux. Look at ioctl() as an example of what not to do.

What are the new system call's arguments, return value, and error codes? The system call should have a clean and simple interface with the smallest number of arguments possible. The semantics and behavior of a system call are important; they must not change, because existing applications will come to rely on them.

Designing the interface with an eye toward the future is important. Are you needlessly limiting the function? Design the system call to be as general as possible. Do not assume its use today will be the same as its use tomorrow. The purpose of the system call will remain constant but its uses may change. Is the system call portable? Do not make assumptions about an architecture's word size or endianness. Chapter 19, "Portability," discusses these issues. Make sure you are not making poor assumptions that will break the system call in the future. Remember the Unix motto: "provide mechanism, not policy."

When you write a system call, it is important to realize the need for portability and robustness, not just today but in the future. The basic Unix system calls have survived this test of time; most of them are just as useful and applicable today as they were thirty years ago!

Verifying the Parameters

System calls must carefully verify all their parameters to ensure that they are valid and legal. The system call runs in kernel-space, and if the user is able to pass invalid input into the kernel without restraint, the system's security and stability can suffer.

For example, file I/O syscalls must check whether the file descriptor is valid. Process-related functions must check whether the provided PID is valid. Every parameter must be checked to ensure it is not just valid and legal, but correct.

One of the most important checks is the validity of any pointers that the user provides. Imagine if a process could pass any pointer into the kernel, unchecked, with warts and all, even passing a pointer for which it did not have read access! Processes could then trick the kernel into copying data for which they did not have access permission, such as data belonging to another process. Before following a pointer into user-space, the system must ensure that

  • The pointer points to a region of memory in user-space. Processes must not be able to trick the kernel into reading data in kernel-space on their behalf.

  • The pointer points to a region of memory in the process's address space. The process must not be able to trick the kernel into reading someone else's data.

  • If reading, the memory is marked readable. If writing, the memory is marked writable. The process must not be able to bypass memory access restrictions.

The kernel provides two methods for performing the requisite checks and the desired copy to and from user-space. Note kernel code must never blindly follow a pointer into user-space! One of these two methods must always be used.

For writing into user-space, the method copy_to_user() is provided. It takes three parameters. The first is the destination memory address in the process's address space. The second is the source pointer in kernel-space. Finally, the third argument is the size in bytes of the data to copy.

For reading from user-space, the method copy_from_user() is analogous to copy_to_user(). The function reads from the second parameter into the first parameter the number of bytes specified in the third parameter.

Both of these functions return the number of bytes they failed to copy on error. On success, they return zero. It is standard for the syscall to return -EFAULT in the case of such an error.

Let's consider an example system call that uses both copy_from_user() and copy_to_user(). This syscall, silly_copy(), is utterly worthless; it copies data from its first parameter into its second. This is highly suboptimal in that it involves the intermediate extraneous copy into kernel-space for absolutely no reason. But it helps illustrate the point.

 * silly_copy - utterly worthless syscall that copies the len bytes from
 * 'src' to 'dst' using the kernel as an intermediary in the copy for no
 * good reason.  But it makes for a good example!
asmlinkage long sys_silly_copy(unsigned long *src,
                               unsigned long *dst,
                               unsigned long len)
        unsigned long buf;

        /* fail if the kernel wordsize and user wordsize do not match */
        if (len != sizeof(buf))
                return -EINVAL;

        /* copy src, which is in the user's address space, into buf */
        if (copy_from_user(&buf, src, len))
                return -EFAULT;

        /* copy buf into dst, which is in the user's address space */
        if (copy_to_user(dst, &buf, len))
                return -EFAULT;

        /* return amount of data copied */
        return len;

Both copy_to_user() and copy_from_user() may block. This occurs, for example, if the page containing the user data is not in physical memory but swapped to disk. In that case, the process sleeps until the page fault handler can bring the page from the swap file on disk into physical memory.

A final possible check is for valid permission. In older versions of Linux, it was standard for syscalls that require root privilege to use suser(). This function merely checked whether a user was root or not; this is now removed and a finer-grained "capabilities" system is in place. The new system allows specific access checks on specific resources. A call to capable() with a valid capabilities flag returns nonzero if the caller holds the specified capability and zero otherwise. For example, capable(CAP_SYS_NICE) checks whether the caller has the ability to modify nice values of other processes. By default, the superuser possesses all capabilities and non-root possesses none. Here is another worthless system call, this one demonstrating capabilities:

asmlinkage long sys_am_i_popular (void)
        /* check whether the user possesses the CAP_SYS_NICE capability */
        if (!capable(CAP_SYS_NICE))
                return EPERM;

        /* return zero for success */
        return 0;

See <linux/capability.h> for a list of all capabilities and what rights they entail.

    Team LiB
    Previous Section Next Section