Often, a kernel developer will try to reduce the size of an attack surface against Linux, even if it can't be closed entirely. It's generally a toss-up whether such a patch makes it into the kernel. Linus Torvalds always prefers security patches that really close a hole, rather than just give attackers a slightly harder time of it.
Matthew Garrett recognized that userspace applications might have secret data that might be sitting in RAM at any given time, and that those applications might want to wipe that data clean so no one could look at it.
There were various ways to do this already in the kernel, as Matthew pointed out. An
application could use
mlock() to prevent its memory contents from being pushed into
swap, where it might be read more easily by attackers. An application also could use
atexit() to cause its memory to be thoroughly overwritten when the application exited,
thus leaving no secret data in the general pool of available RAM.
The problem, Matthew pointed out, came if an attacker was able to reboot the system at a critical moment—say, before the user's data could be safely overwritten. If attackers then booted into a different OS, they might be able to examine the data still stored in RAM, left over from the previously running Linux system.
As Matthew also noted, the existing way to prevent even that was to tell the UEFI firmware to wipe system memory before booting to another OS, but this would dramatically increase the amount of time it took to reboot. And if the good guys had won out over the attackers, forcing them to wait a long time for a reboot could be considered a denial of service attack—or at least downright annoying.
Ideally, Matthew said, if the attackers were only able to induce a clean shutdown—not simply a cold boot—then there needed to be a way to tell Linux to scrub all data out of RAM, so there would be no further need for UEFI to handle it, and thus no need for a very long delay during reboot.
Matthew explained the reasoning behind his patch. He said:
Unfortunately, if an application exits uncleanly, its secrets may still be present in RAM. This can't be easily fixed in userland (eg, if the OOM killer decides to kill a process holding secrets, we're not going to be able to avoid that), so this patch adds a new flag to madvise() to allow userland to request that the kernel clear the covered pages whenever the page reference count hits zero. Since vm_flags is already full on 32-bit, it will only work on 64-bit systems.
Matthew Wilcox liked this plan and offered some technical suggestions for Matthew G's patch, and Matthew G posted an updated version in response.