1# Xtensa MMU Operation
2
3As with other elements of the architecture, paged virtual memory
4management on Xtensa is somewhat unique. And there is similarly a
5lack of introductory material available. This document is an attempt
6to introduce the architecture at an overview/tutorial level, and to
7describe Zephyr's specific implementation choices.
8
9## General TLB Operation
10
11The Xtensa MMU operates on top of a fairly conventional TLB cache.
12The TLB stores virtual to physical translation for individual pages of
13memory. It is partitioned into an automatically managed
144-way-set-associative bank of entries mapping 4k pages, and 3-6
15"special" ways storing mappings under OS control. Some of these are
16for mapping pages larger than 4k, which Zephyr does not directly
17support. A few are for bootstrap and initialization, and will be
18discussed below.
19
20Like the L1 cache, the TLB is split into separate instruction and data
21entries. Zephyr manages both as needed, but symmetrically. The
22architecture technically supports separately-virtualized instruction
23and data spaces, but the hardware page table refill mechanism (see
24below) does not, and Zephyr's memory spaces are unified regardless.
25
26The TLB may be loaded with permissions and attributes controlling
27cacheability, access control based on ring (i.e. the contents of the
28RING field of the PS register) and togglable write and execute access.
29Memory access, even with a matching TLB entry, may therefore create
30Kernel/User exceptions as desired to enforce permissions choices on
31userspace code.
32
33Live TLB entries are tagged with an 8-bit "ASID" value derived from
34their ring field of the PTE that loaded them, via a simple translation
35specified in the RASID special register. The intent is that each
36non-kernel address space will get a separate ring 3 ASID set in RASID,
37such that you can switch between them without a TLB flush. The ASID
38value of ring zero is fixed at 1, it may not be changed. (An ASID
39value of zero is used to tag an invalid/unmapped TLB entry at
40initialization, but this mechanism isn't accessible to OS code except
41in special circumstances, and in any case there is already an invalid
42attribute value that can be used in a PTE).
43
44## Virtually-mapped Page Tables
45
46Xtensa has a unique (and, to someone exposed for the first time,
47extremely confusing) "page table" format. The simplest was to begin
48to explain this is just to describe the (quite simple) hardware
49behavior:
50
51On a TLB miss, the hardware immediately does a single fetch (at ring 0
52privilege) from RAM by adding the "desired address right shifted by
5310 bits with the bottom two bits set to zero" (i.e. the page frame
54number in units of 4 bytes) to the value in the PTEVADDR special
55register. If this load succeeds, then the word is treated as a PTE
56with which to fill the TLB and use for a (restarted) memory access.
57This is extremely simple (just one extra hardware state that does just
58one thing the hardware can already do), and quite fast (only one
59memory fetch vs. e.g. the 2-5 fetches required to walk a page table on
60x86).
61
62This special "refill" fetch is otherwise identical to any other memory
63access, meaning it too uses the TLB to translate from a virtual to
64physical address. Which means that the page tables occupy a 4M region
65of virtual, not physical, address space, in the same memory space
66occupied by the running code. The 1024 pages in that range (not all
67of which might be mapped in physical memory) are a linear array of
681048576 4-byte PTE entries, each describing a mapping for 4k of
69virtual memory. Note especially that exactly one of those pages
70contains the 1024 PTE entries for the 4M page table itself, pointed to
71by PTEVADDR.
72
73Obviously, the page table memory being virtual means that the fetch
74can fail: there are 1024 possible pages in a complete page table
75covering all of memory, and the ~16 entry TLB clearly won't contain
76entries mapping all of them. If we are missing a TLB entry for the
77page translation we want (NOT for the original requested address, we
78already know we're missing that TLB entry), the hardware has exactly
79one more special trick: it throws a TLB Miss exception (there are two,
80one each for instruction/data TLBs, but in Zephyr they operate
81identically).
82
83The job of that exception handler is simply to ensure that the TLB has
84an entry for the page table page we want. And the simplest way to do
85that is to just load the faulting PTE as an address, which will then
86go through the same refill process above. This second TLB fetch in
87the exception handler may result in an invalid/inapplicable mapping
88within the 4M page table region. This is an typical/expected runtime
89fault, and simply indicates unmapped memory. The result is TLB miss
90exception from within the TLB miss exception handler (i.e. while the
91EXCM bit is set). This will produce a Double Exception fault, which
92is handled by the OS identically to a general Kernel/User data access
93prohibited exception.
94
95After the TLB refill exception, the original faulting instruction is
96restarted, which retries the refill process, which succeeds in
97fetching a new TLB entry, which is then used to service the original
98memory access. (And may then result in yet another exception if it
99turns out that the TLB entry doesn't permit the access requested, of
100course.)
101
102## Special Cases
103
104The page-tables-specified-in-virtual-memory trick works very well in
105practice. But it does have a chicken/egg problem with the initial
106state. Because everything depends on state in the TLB, something
107needs to tell the hardware how to find a physical address using the
108TLB to begin the process. Here we exploit the separate
109non-automatically-refilled TLB ways to store bootstrap records.
110
111First, note that the refill process to load a PTE requires that the 4M
112space of PTE entries be resolvable by the TLB directly, without
113requiring another refill. This 4M mapping is provided by a single
114page of PTE entries (which itself lives in the 4M page table region!).
115This page must always be in the TLB.
116
117Thankfully, for the data TLB Xtensa provides 3 special/non-refillable
118ways (ways 7-9) with at least one 4k page mapping each. We can use
119one of these to "pin" the top-level page table entry in place,
120ensuring that a refill access will be able to find a PTE address.
121
122But now note that the load from that PTE address for the refill is
123done in an exception handler. And running an exception handler
124requires doing a fetch via the instruction TLB. And that obviously
125means that the page(s) containing the exception handler must never
126require a refill exception of its own.
127
128Ideally we would just pin the vector/handler page in the ITLB in the
129same way we do for data, but somewhat inexplicably, Xtensa does not
130provide 4k "pinnable" ways in the instruction TLB (frankly this seems
131like a design flaw).
132
133Instead, we load ITLB entries for vector handlers via the refill
134mechanism using the data TLB, and so need the refill mechanism for the
135vector page to succeed always. The way to do this is to similarly pin
136the page table page containing the (single) PTE for the vector page in
137the data TLB, such that instruction fetches always find their TLB
138mapping via refill, without requiring an exception.
139
140## Initialization
141
142Unlike most other architectures, Xtensa does not have a "disable" mode
143for the MMU. Virtual address translation through the TLB is active at
144all times. There therefore needs to be a mechanism for the CPU to
145execute code before the OS is able to initialize a refillable page
146table.
147
148The way Xtensa resolves this (on the hardware Zephyr supports, see the
149note below) is to have an 8-entry set ("way 6") of 512M pages able to
150cover all of memory. These 8 entries are initialized as valid, with
151attributes specifying that they are accessible only to an ASID of 1
152(i.e. the fixed ring zero / kernel ASID), writable, executable, and
153uncached. So at boot the CPU relies on these TLB entries to provide a
154clean view of hardware memory.
155
156But that means that enabling page-level translation requires some
157care, as the CPU will throw an exception ("multi hit") if a memory
158access matches more than one live entry in the TLB. The
159initialization algorithm is therefore:
160
1610. Start with a fully-initialized page table layout, including the
162 top-level "L1" page containing the mappings for the page table
163 itself.
164
1651. Ensure that the initialization routine does not cross a page
166 boundary (to prevent stray TLB refill exceptions), that it occupies
167 a separate 4k page than the exception vectors (which we must
168 temporarily double-map), and that it operates entirely in registers
169 (to avoid doing memory access at inopportune moments).
170
1712. Pin the L1 page table PTE into the data TLB. This creates a double
172 mapping condition, but it is safe as nothing will use it until we
173 start refilling.
174
1753. Pin the page table page containing the PTE for the TLB miss
176 exception handler into the data TLB. This will likewise not be
177 accessed until the double map condition is resolved.
178
1794. Set PTEVADDR appropriately. The CPU state to handle refill
180 exceptions is now complete, but cannot be used until we resolve the
181 double mappings.
182
1835. Disable the initial/way6 data TLB entries first, by setting them to
184 an ASID of zero. This is safe as the code being executed is not
185 doing data accesses yet (including refills), and will resolve the
186 double mapping conditions we created above.
187
1886. Disable the initial/way6 instruction TLBs second. The very next
189 instruction following the invalidation of the currently-executing
190 code page will then cause a TLB refill exception, which will work
191 normally because we just resolved the final double-map condition.
192 (Pedantic note: if the vector page and the currently-executing page
193 are in different 512M way6 pages, disable the mapping for the
194 exception handlers first so the trap from our current code can be
195 handled. Currently Zephyr doesn't handle this condition as in all
196 reasonable hardware these regions will be near each other)
197
198Note: there is a different variant of the Xtensa MMU architecture
199where the way 5/6 pages are immutable, and specify a set of
200unchangable mappings from the final 384M of memory to the bottom and
201top of physical memory. The intent here would (presumably) be that
202these would be used by the kernel for all physical memory and that the
203remaining memory space would be used for virtual mappings. This
204doesn't match Zephyr's architecture well, as we tend to assume
205page-level control over physical memory (e.g. .text/.rodata is cached
206but .data is not on SMP, etc...). And in any case we don't have any
207such hardware to experiment with. But with a little address
208translation we could support this.
209
210## ASID vs. Virtual Mapping
211
212The ASID mechanism in Xtensa works like other architectures, and is
213intended to be used similarly. The intent of the design is that at
214context switch time, you can simply change RADID and the page table
215data, and leave any existing mappings in place in the TLB using the
216old ASID value(s). So in the common case where you switch back,
217nothing needs to be flushed.
218
219Unfortunately this runs afoul of the virtual mapping of the page
220refill: data TLB entries storing the 4M page table mapping space are
221stored at ASID 1 (ring 0), they can't change when the page tables
222change! So this region naively would have to be flushed, which is
223tantamount to flushing the entire TLB regardless (the TLB is much
224smaller than the 1024-page PTE array).
225
226The resolution in Zephyr is to give each ASID its own PTEVADDR mapping
227in virtual space, such that the page tables don't overlap. This is
228expensive in virtual address space: assigning 4M of space to each of
229the 256 ASIDs (actually 254 as 0 and 1 are never used by user access)
230would take a full gigabyte of address space. Zephyr optimizes this a
231bit by deriving a unique sequential ASID from the hardware address of
232the statically allocated array of L1 page table pages.
233
234Note, obviously, that any change of the mappings within an ASID
235(e.g. to re-use it for another memory domain, or just for any runtime
236mapping change other than mapping previously-unmapped pages) still
237requires a TLB flush, and always will.
238
239## SMP/Cache Interaction
240
241A final important note is that the hardware PTE refill fetch works
242like any other CPU memory access, and in particular it is governed by
243the cacheability attributes of the TLB entry through which it was
244loaded. This means that if the page table entries are marked
245cacheable, then the hardware TLB refill process will be downstream of
246the L1 data cache on the CPU. If the physical memory storing page
247tables has been accessed recently by the CPU (for a refill of another
248page mapped within the same cache line, or to change the tables) then
249the refill will be served from the data cache and not main memory.
250
251This may or may not be desirable depending on access patterns. It
252lets the L1 data cache act as a "L2 TLB" for applications with a lot
253of access variability. But it also means that the TLB entries end up
254being stored twice in the same CPU, wasting transistors that could
255presumably store other useful data.
256
257But it is also important to note that the L1 data cache on Xtensa is
258incoherent! The cache being used for refill reflects the last access
259on the current CPU only, and not of the underlying memory being
260mapped. Page table changes in the data cache of one CPU will be
261invisible to the data cache of another. There is no simple way of
262notifying another CPU of changes to page mappings beyond doing
263system-wide flushes on all cpus every time a memory domain is
264modified.
265
266The result is that, when SMP is enabled, Zephyr must ensure that all
267page table mappings in the system are set uncached. The OS makes no
268attempt to bolt on a software coherence layer.
269
1# How Xtensa register windows work
2
3There is a paucity of introductory material on this subject, and
4Zephyr plays some tricks here that require understanding the base
5layer.
6
7## Hardware
8
9When register windows are configured in the CPU, there are either 32
10or 64 "real" registers in hardware, with 16 visible at one time.
11Registers are grouped and rotated in units of 4, so there are 8 or 16
12such "quads" (my term, not Tensilica's) in hardware of which 4 are
13visible as A0-A15.
14
15The first quad (A0-A3) is pointed to by a special register called
16WINDOWBASE. The register file is cyclic, so for example if NREGS==64
17and WINDOWBASE is 15, quads 15, 0, 1, and 2 will be visible as
18(respectively) A0-A3, A4-A7, A8-A11, and A12-A15.
19
20There is a ROTW instruction that can be used to manually rotate the
21window by a immediate number of quads that are added to WINDOWBASE.
22Positive rotations "move" high registers into low registers
23(i.e. after "ROTW 1" the register that used to be called A4 is now
24A0).
25
26There are CALL4/CALL8/CALL12 instructions to effect rotated calls
27which rotate registers upward (i.e. "hiding" low registers from the
28callee) by 1, 2 or 3 quads. These do not rotate the window
29themselves. Instead they place the rotation amount in two places
30(yes, two; see below): the 2-bit CALLINC field of the PS register, and
31the top two bits of the return address placed in A0.
32
33There is an ENTRY instruction that does the rotation. It adds CALLINC
34to WINDOWBASE, at the same time copying the old (now hidden) stack
35pointer in A1 into the "new" A1 in the rotated frame, subtracting an
36immediate offset from it to make space for the new frame.
37
38There is a RETW instruction that undoes the rotation. It reads the
39top two bits from the return address in A0 and subtracts that value
40from WINDOWBASE before returning. This is why the CALLINC bits went
41in two places. They have to be stored on the stack across potentially
42many calls, so they need to be GPR data that lives in registers and
43can be spilled. But ENTRY isn't specified to assume a particular
44return value format and is used immediately, so it makes more sense
45for it to use processor state instead.
46
47Note that we still don't know how to detect when the register file has
48wrapped around and needs to be spilled or filled. To do this there is
49a WINDOWSTART register used to detect which register quads are in use.
50The name "start" is somewhat confusing, this is not a pointer.
51WINDOWSTART stores a bitmask with one bit per hardware quad (so it's 8
52or 16 bits wide). The bit in windowstart corresponding to WINDOWBASE
53will be set by the ENTRY instruction, and remain set after rotations
54until cleared by a function return (by RETW, see below). Other bits
55stay zero. So there is one set bit in WINDOWSTART corresponding to
56each call frame that is live in hardware registers, and it will be
57followed by 0, 1 or 2 zero bits that tell you how "big" (how many
58quads of registers) that frame is.
59
60So the CPU executing RETW checks to make sure that the register quad
61being brought into A0-A3 (i.e. the new WINDOWBASE) has a set bit
62indicating it's valid. If it does not, the registers must have been
63spilled and the CPU traps to an exception handler to fill them.
64
65Likewise, the processor can tell if a high register is "owned" by
66another call by seeing if there is a one in WINDOWSTART between that
67register's quad and WINDOWBASE. If there is, the CPU traps to a spill
68handler to spill one frame. Note that a frame might be only four
69registers, but it's possible to hit registers 12 out from WINDOWBASE,
70so it's actually possible to trap again when the instruction restarts
71to spill a second quad, and even a third time at maximum.
72
73Finally: note that hardware checks the two bits of WINDOWSTART after
74the frame bit to detect how many quads are represented by the one
75frame. So there are six separate exception handlers to spill/fill
761/2/3 quads of registers.
77
78## Software & ABI
79
80The advantage of the scheme above is that it allows the registers to
81be spilled naturally into the stack by using the stack pointers
82embedded in the register file. But the hardware design assumes and to
83some extent enforces a fairly complicated stack layout to make that
84work:
85
86The spill area for a single frame's A0-A3 registers is not in its own
87stack frame. It lies in the 16 bytes below its CALLEE's stack
88pointer. This is so that the callee (and exception handlers invoked
89on its behalf) can see its caller's potentially-spilled stack pointer
90register (A1) on the stack and be able to walk back up on return.
91Other architectures do this too by e.g. pushing the incoming stack
92pointer onto the stack as a standard "frame pointer" defined in the
93platform ABI. Xtensa wraps this together with the natural spill area
94for register windows.
95
96By convention spill regions always store the lowest numbered register
97in the lowest address.
98
99The spill area for a frame's A4-A11 registers may or may not exist
100depending on whether the call was made with CALL8/CALL12. It is legal
101to write a function using only A0-A3 and CALL4 calls and ignore higher
102registers. But if those 0-2 register quads are in use, they appear at
103the top of the stack frame, immediately below the parent call's A0-A3
104spill area.
105
106There is no spill area for A12-A15. Those registers are always
107caller-save. When using CALLn, you always need to overlap 4 registers
108to provide arguments and take a return value.
109