I spent some time this week analyzing the recently disclosed vulnerability in the Linux kernel syscall, vmsplice. Several POC's have been released and I was curious as to how they exploited the kernel.
Background on the vulnerability: the vmsplice function is a system call that allows a programmer to map an I/O vector (basically, an array of buffers) to a pipe. From the main page:
"The vmsplice() system call maps nr_segs ranges of user memory described by iov into a pipe. The file descriptor fd must refer to a pipe."
The kernel adjudicates the whole transaction, dutifully mapping/copying the user specified memory to the pipe's buffers or vice versa.
The trouble is that the routine for sys_vmsplice didn't follow best practices for kernel programming and check the pointers passed from userspace for validity. In at least three places in fs/splice.c, data in the user-specified iov array was copied to or from without verifying it's validity via access_ok().
The exploit I examined only worked on kernel versions 2.6.23 to 2.6.24.1. Rafal Wojtczuk has an excellent write-up on the 2.6.17 and up exploit. You should check it out.
In 2.6.23, code was added to handle copying from the pipe to the user iov.
Unfortunately, there was no check that this destination address was a valid mapping for the user process:
linux/fs/splice.c:
1400/*
1401 * For lack of a better implementation, implement vmsplice() to userspace
1402 * as a simple copy of the pipes pages to the user iov.
1403 */
1404 static long vmsplice_to_user(struct file *file, const struct iovec __user *iov,
1405 unsigned long nr_segs, unsigned int flags)
1406 {
1407 struct pipe_inode_info *pipe;
1408 struct splice_desc sd;
1409 ssize_t size;
1410 int error;
1411 long ret;
1425 /*
1426 * Get user address base and length for this iovec.
1427 */
1428 error = get_user(base, &iov->iov_base);
1429 if (unlikely(error))
1430 break;
1431 error = get_user(len, &iov->iov_len);
1432 if (unlikely(error))
1433 break;
1434
1435 /*
1436 * Sanity check this iovec. 0 read succeeds. 1437 */
1438 if (unlikely(!len))
1439 break;
1440 if (unlikely(!base)) {
1441 error = -EFAULT;
1442 break;
1443 }
Note that base and len are only checked for being non-zero, rather than the more detailed check performed by access_ok(). Thus, we can pass in values that are unmapped (less useful for exploitation) or are mapped but unwritable.
This later case is what qaaz's exploit utilizes. By specifying the entry point of another system call (in this case, the rarely used sys_vm86old) as the "base" for copying, qaaz tricks the kernel into overwriting it's own syscall table:
"addr = get_target();
printf(?[+] addr: 0x%lxn?, addr);
if (pipe(pi) < 0)
die("pipe", errno);
iov.iov_base = (void *) addr;
iov.iov_len = TRAMP_SIZE;
write(pi[1], TRAMP_CODE, TRAMP_SIZE);
_vmsplice(pi[0], &iov, 1, 0);
gimmeroot();"
Here, get_target() finds the target system call. TRAMP_CODE is a static buffer containing our privilege escalation syscall. gimmeroot() is a macro to invoke the newly overwritten syscall with function to set our process' UID/GID to root's:
#define gimmeroot() syscall(TARGET_SYSCALL, 31337, kernel_code, 1, 2, 3, 4)
You would expect the user process to be terminated for an access violation. After all, it is asking the kernel to write to what should be a protected area of kernel memory. Moreover, kernel memory isn't usually mapped into a process's address space. Trying to access it should generate a page fault and subsequent kernel oops, plus a SIGSEGV for the userland process.
However, this is not the case with system calls: they must be mapped in userland as well as kernelspace, since user processes need to call them. This exploit takes advantage of this, as the copy is done with the current process's memory mappings but with the elevated permissions (and reduced access checks) of running in kernel mode.
There are a number of steps that could be taken to prevent this exploit from working:
1. Obvious: Check user addresses for validity before copying anything to/from them.
2. Protect certain chunks of kernel memory from being overwritten after it's initial load. Unfortunately, the syscall table isn't a very good candidate for this tactic: it needs to be modified at runtime, at the very least for registering new system calls from loadable modules.
3. Audit system calls and flag any ones with unusual parameters. In this case, the user process was passing a pointer into what should have been (from the process's perspective) an unwritable page. This tactic would be very expensive, since you'd essentially be double checking most of what the kernel already verifies. Or, at least it should verify.
4. Careful code auditing to make sure #1 always happens. Note that the tactic of preventing user processes from mapping very low memory (used in the other vmsplice exploit) as suggested elsewhere would not have prevented this particular exploit from working. It uses a different path through sys_vmsplice to achieve it's code execution and no zero page mappings or any such funkiness.
By the way, the vulnerability has been patched in 2.6.24.2. Though it is a local-only exploit, it is still a significant risk.
Background on the vulnerability: the vmsplice function is a system call that allows a programmer to map an I/O vector (basically, an array of buffers) to a pipe. From the main page:
"The vmsplice() system call maps nr_segs ranges of user memory described by iov into a pipe. The file descriptor fd must refer to a pipe."
The kernel adjudicates the whole transaction, dutifully mapping/copying the user specified memory to the pipe's buffers or vice versa.
The trouble is that the routine for sys_vmsplice didn't follow best practices for kernel programming and check the pointers passed from userspace for validity. In at least three places in fs/splice.c, data in the user-specified iov array was copied to or from without verifying it's validity via access_ok().
The exploit I examined only worked on kernel versions 2.6.23 to 2.6.24.1. Rafal Wojtczuk has an excellent write-up on the 2.6.17 and up exploit. You should check it out.
In 2.6.23, code was added to handle copying from the pipe to the user iov.
Unfortunately, there was no check that this destination address was a valid mapping for the user process:
linux/fs/splice.c:
1400/*
1401 * For lack of a better implementation, implement vmsplice() to userspace
1402 * as a simple copy of the pipes pages to the user iov.
1403 */
1404 static long vmsplice_to_user(struct file *file, const struct iovec __user *iov,
1405 unsigned long nr_segs, unsigned int flags)
1406 {
1407 struct pipe_inode_info *pipe;
1408 struct splice_desc sd;
1409 ssize_t size;
1410 int error;
1411 long ret;
1425 /*
1426 * Get user address base and length for this iovec.
1427 */
1428 error = get_user(base, &iov->iov_base);
1429 if (unlikely(error))
1430 break;
1431 error = get_user(len, &iov->iov_len);
1432 if (unlikely(error))
1433 break;
1434
1435 /*
1436 * Sanity check this iovec. 0 read succeeds. 1437 */
1438 if (unlikely(!len))
1439 break;
1440 if (unlikely(!base)) {
1441 error = -EFAULT;
1442 break;
1443 }
Note that base and len are only checked for being non-zero, rather than the more detailed check performed by access_ok(). Thus, we can pass in values that are unmapped (less useful for exploitation) or are mapped but unwritable.
This later case is what qaaz's exploit utilizes. By specifying the entry point of another system call (in this case, the rarely used sys_vm86old) as the "base" for copying, qaaz tricks the kernel into overwriting it's own syscall table:
"addr = get_target();
printf(?[+] addr: 0x%lxn?, addr);
if (pipe(pi) < 0)
die("pipe", errno);
iov.iov_base = (void *) addr;
iov.iov_len = TRAMP_SIZE;
write(pi[1], TRAMP_CODE, TRAMP_SIZE);
_vmsplice(pi[0], &iov, 1, 0);
gimmeroot();"
Here, get_target() finds the target system call. TRAMP_CODE is a static buffer containing our privilege escalation syscall. gimmeroot() is a macro to invoke the newly overwritten syscall with function to set our process' UID/GID to root's:
#define gimmeroot() syscall(TARGET_SYSCALL, 31337, kernel_code, 1, 2, 3, 4)
You would expect the user process to be terminated for an access violation. After all, it is asking the kernel to write to what should be a protected area of kernel memory. Moreover, kernel memory isn't usually mapped into a process's address space. Trying to access it should generate a page fault and subsequent kernel oops, plus a SIGSEGV for the userland process.
However, this is not the case with system calls: they must be mapped in userland as well as kernelspace, since user processes need to call them. This exploit takes advantage of this, as the copy is done with the current process's memory mappings but with the elevated permissions (and reduced access checks) of running in kernel mode.
There are a number of steps that could be taken to prevent this exploit from working:
1. Obvious: Check user addresses for validity before copying anything to/from them.
2. Protect certain chunks of kernel memory from being overwritten after it's initial load. Unfortunately, the syscall table isn't a very good candidate for this tactic: it needs to be modified at runtime, at the very least for registering new system calls from loadable modules.
3. Audit system calls and flag any ones with unusual parameters. In this case, the user process was passing a pointer into what should have been (from the process's perspective) an unwritable page. This tactic would be very expensive, since you'd essentially be double checking most of what the kernel already verifies. Or, at least it should verify.
4. Careful code auditing to make sure #1 always happens. Note that the tactic of preventing user processes from mapping very low memory (used in the other vmsplice exploit) as suggested elsewhere would not have prevented this particular exploit from working. It uses a different path through sys_vmsplice to achieve it's code execution and no zero page mappings or any such funkiness.
By the way, the vulnerability has been patched in 2.6.24.2. Though it is a local-only exploit, it is still a significant risk.