11. Intel(R) MPX Overview
2========================
3
4Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
5introduced into Intel Architecture. Intel MPX provides hardware features
6that can be used in conjunction with compiler changes to check memory
7references, for those references whose compile-time normal intentions are
8usurped at runtime due to buffer overflow or underflow.
9
10You can tell if your CPU supports MPX by looking in /proc/cpuinfo:
11
12	cat /proc/cpuinfo  | grep ' mpx '
13
14For more information, please refer to Intel(R) Architecture Instruction
15Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
16Extensions.
17
18Note: As of December 2014, no hardware with MPX is available but it is
19possible to use SDE (Intel(R) Software Development Emulator) instead, which
20can be downloaded from
21http://software.intel.com/en-us/articles/intel-software-development-emulator
22
23
242. How to get the advantage of MPX
25==================================
26
27For MPX to work, changes are required in the kernel, binutils and compiler.
28No source changes are required for applications, just a recompile.
29
30There are a lot of moving parts of this to all work right. The following
31is how we expect the compiler, application and kernel to work together.
32
331) Application developer compiles with -fmpx. The compiler will add the
34   instrumentation as well as some setup code called early after the app
35   starts. New instruction prefixes are noops for old CPUs.
362) That setup code allocates (virtual) space for the "bounds directory",
37   points the "bndcfgu" register to the directory (must also set the valid
38   bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT))
39   that the app will be using MPX.  The app must be careful not to access
40   the bounds tables between the time when it populates "bndcfgu" and
41   when it calls the prctl().  This might be hard to guarantee if the app
42   is compiled with MPX.  You can add "__attribute__((bnd_legacy))" to
43   the function to disable MPX instrumentation to help guarantee this.
44   Also be careful not to call out to any other code which might be
45   MPX-instrumented.
463) The kernel detects that the CPU has MPX, allows the new prctl() to
47   succeed, and notes the location of the bounds directory. Userspace is
48   expected to keep the bounds directory at that location. We note it
49   instead of reading it each time because the 'xsave' operation needed
50   to access the bounds directory register is an expensive operation.
514) If the application needs to spill bounds out of the 4 registers, it
52   issues a bndstx instruction. Since the bounds directory is empty at
53   this point, a bounds fault (#BR) is raised, the kernel allocates a
54   bounds table (in the user address space) and makes the relevant entry
55   in the bounds directory point to the new table.
565) If the application violates the bounds specified in the bounds registers,
57   a separate kind of #BR is raised which will deliver a signal with
58   information about the violation in the 'struct siginfo'.
596) Whenever memory is freed, we know that it can no longer contain valid
60   pointers, and we attempt to free the associated space in the bounds
61   tables. If an entire table becomes unused, we will attempt to free
62   the table and remove the entry in the directory.
63
64To summarize, there are essentially three things interacting here:
65
66GCC with -fmpx:
67 * enables annotation of code with MPX instructions and prefixes
68 * inserts code early in the application to call in to the "gcc runtime"
69GCC MPX Runtime:
70 * Checks for hardware MPX support in cpuid leaf
71 * allocates virtual space for the bounds directory (malloc() essentially)
72 * points the hardware BNDCFGU register at the directory
73 * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
74   start managing the bounds directories
75Kernel MPX Code:
76 * Checks for hardware MPX support in cpuid leaf
77 * Handles #BR exceptions and sends SIGSEGV to the app when it violates
78   bounds, like during a buffer overflow.
79 * When bounds are spilled in to an unallocated bounds table, the kernel
80   notices in the #BR exception, allocates the virtual space, then
81   updates the bounds directory to point to the new table. It keeps
82   special track of the memory with a VM_MPX flag.
83 * Frees unused bounds tables at the time that the memory they described
84   is unmapped.
85
86
873. How does MPX kernel code work
88================================
89
90Handling #BR faults caused by MPX
91---------------------------------
92
93When MPX is enabled, there are 2 new situations that can generate
94#BR faults.
95  * new bounds tables (BT) need to be allocated to save bounds.
96  * bounds violation caused by MPX instructions.
97
98We hook #BR handler to handle these two new situations.
99
100On-demand kernel allocation of bounds tables
101--------------------------------------------
102
103MPX only has 4 hardware registers for storing bounds information. If
104MPX-enabled code needs more than these 4 registers, it needs to spill
105them somewhere. It has two special instructions for this which allow
106the bounds to be moved between the bounds registers and some new "bounds
107tables".
108
109#BR exceptions are a new class of exceptions just for MPX. They are
110similar conceptually to a page fault and will be raised by the MPX
111hardware during both bounds violations or when the tables are not
112present. The kernel handles those #BR exceptions for not-present tables
113by carving the space out of the normal processes address space and then
114pointing the bounds-directory over to it.
115
116The tables need to be accessed and controlled by userspace because
117the instructions for moving bounds in and out of them are extremely
118frequent. They potentially happen every time a register points to
119memory. Any direct kernel involvement (like a syscall) to access the
120tables would obviously destroy performance.
121
122Why not do this in userspace? MPX does not strictly require anything in
123the kernel. It can theoretically be done completely from userspace. Here
124are a few ways this could be done. We don't think any of them are practical
125in the real-world, but here they are.
126
127Q: Can virtual space simply be reserved for the bounds tables so that we
128   never have to allocate them?
129A: MPX-enabled application will possibly create a lot of bounds tables in
130   process address space to save bounds information. These tables can take
131   up huge swaths of memory (as much as 80% of the memory on the system)
132   even if we clean them up aggressively. In the worst-case scenario, the
133   tables can be 4x the size of the data structure being tracked. IOW, a
134   1-page structure can require 4 bounds-table pages. An X-GB virtual
135   area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
136   If we were to preallocate them for the 128TB of user virtual address
137   space, we would need to reserve 512TB+2GB, which is larger than the
138   entire virtual address space today. This means they can not be reserved
139   ahead of time. Also, a single process's pre-populated bounds directory
140   consumes 2GB of virtual *AND* physical memory. IOW, it's completely
141   infeasible to prepopulate bounds directories.
142
143Q: Can we preallocate bounds table space at the same time memory is
144   allocated which might contain pointers that might eventually need
145   bounds tables?
146A: This would work if we could hook the site of each and every memory
147   allocation syscall. This can be done for small, constrained applications.
148   But, it isn't practical at a larger scale since a given app has no
149   way of controlling how all the parts of the app might allocate memory
150   (think libraries). The kernel is really the only place to intercept
151   these calls.
152
153Q: Could a bounds fault be handed to userspace and the tables allocated
154   there in a signal handler instead of in the kernel?
155A: mmap() is not on the list of safe async handler functions and even
156   if mmap() would work it still requires locking or nasty tricks to
157   keep track of the allocation state there.
158
159Having ruled out all of the userspace-only approaches for managing
160bounds tables that we could think of, we create them on demand in
161the kernel.
162
163Decoding MPX instructions
164-------------------------
165
166If a #BR is generated due to a bounds violation caused by MPX.
167We need to decode MPX instructions to get violation address and
168set this address into extended struct siginfo.
169
170The _sigfault field of struct siginfo is extended as follow:
171
17287		/* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
17388		struct {
17489			void __user *_addr; /* faulting insn/memory ref. */
17590 #ifdef __ARCH_SI_TRAPNO
17691			int _trapno;	/* TRAP # which caused the signal */
17792 #endif
17893			short _addr_lsb; /* LSB of the reported address */
17994			struct {
18095				void __user *_lower;
18196				void __user *_upper;
18297			} _addr_bnd;
18398		} _sigfault;
184
185The '_addr' field refers to violation address, and new '_addr_and'
186field refers to the upper/lower bounds when a #BR is caused.
187
188Glibc will be also updated to support this new siginfo. So user
189can get violation address and bounds when bounds violations occur.
190
191Cleanup unused bounds tables
192----------------------------
193
194When a BNDSTX instruction attempts to save bounds to a bounds directory
195entry marked as invalid, a #BR is generated. This is an indication that
196no bounds table exists for this entry. In this case the fault handler
197will allocate a new bounds table on demand.
198
199Since the kernel allocated those tables on-demand without userspace
200knowledge, it is also responsible for freeing them when the associated
201mappings go away.
202
203Here, the solution for this issue is to hook do_munmap() to check
204whether one process is MPX enabled. If yes, those bounds tables covered
205in the virtual address region which is being unmapped will be freed also.
206
207Adding new prctl commands
208-------------------------
209
210Two new prctl commands are added to enable and disable MPX bounds tables
211management in kernel.
212
213155	#define PR_MPX_ENABLE_MANAGEMENT	43
214156	#define PR_MPX_DISABLE_MANAGEMENT	44
215
216Runtime library in userspace is responsible for allocation of bounds
217directory. So kernel have to use XSAVE instruction to get the base
218of bounds directory from BNDCFG register.
219
220But XSAVE is expected to be very expensive. In order to do performance
221optimization, we have to get the base of bounds directory and save it
222into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
223command execution.
224
225
2264. Special rules
227================
228
2291) If userspace is requesting help from the kernel to do the management
230of bounds tables, it may not create or modify entries in the bounds directory.
231
232Certainly users can allocate bounds tables and forcibly point the bounds
233directory at them through XSAVE instruction, and then set valid bit
234of bounds entry to have this entry valid.  But, the kernel will decline
235to assist in managing these tables.
236
2372) Userspace may not take multiple bounds directory entries and point
238them at the same bounds table.
239
240This is allowed architecturally.  See more information "Intel(R) Architecture
241Instruction Set Extensions Programming Reference" (9.3.4).
242
243However, if users did this, the kernel might be fooled in to unmapping an
244in-use bounds table since it does not recognize sharing.
245