1# Xtensa MMU Operation 2 3As with other elements of the architecture, paged virtual memory 4management on Xtensa is somewhat unique. And there is similarly a 5lack of introductory material available. This document is an attempt 6to introduce the architecture at an overview/tutorial level, and to 7describe Zephyr's specific implementation choices. 8 9## General TLB Operation 10 11The Xtensa MMU operates on top of a fairly conventional TLB cache. 12The TLB stores virtual to physical translation for individual pages of 13memory. It is partitioned into an automatically managed 144-way-set-associative bank of entries mapping 4k pages, and 3-6 15"special" ways storing mappings under OS control. Some of these are 16for mapping pages larger than 4k, which Zephyr does not directly 17support. A few are for bootstrap and initialization, and will be 18discussed below. 19 20Like the L1 cache, the TLB is split into separate instruction and data 21entries. Zephyr manages both as needed, but symmetrically. The 22architecture technically supports separately-virtualized instruction 23and data spaces, but the hardware page table refill mechanism (see 24below) does not, and Zephyr's memory spaces are unified regardless. 25 26The TLB may be loaded with permissions and attributes controlling 27cacheability, access control based on ring (i.e. the contents of the 28RING field of the PS register) and togglable write and execute access. 29Memory access, even with a matching TLB entry, may therefore create 30Kernel/User exceptions as desired to enforce permissions choices on 31userspace code. 32 33Live TLB entries are tagged with an 8-bit "ASID" value derived from 34their ring field of the PTE that loaded them, via a simple translation 35specified in the RASID special register. The intent is that each 36non-kernel address space will get a separate ring 3 ASID set in RASID, 37such that you can switch between them without a TLB flush. The ASID 38value of ring zero is fixed at 1, it may not be changed. (An ASID 39value of zero is used to tag an invalid/unmapped TLB entry at 40initialization, but this mechanism isn't accessible to OS code except 41in special circumstances, and in any case there is already an invalid 42attribute value that can be used in a PTE). 43 44## Virtually-mapped Page Tables 45 46Xtensa has a unique (and, to someone exposed for the first time, 47extremely confusing) "page table" format. The simplest was to begin 48to explain this is just to describe the (quite simple) hardware 49behavior: 50 51On a TLB miss, the hardware immediately does a single fetch (at ring 0 52privilege) from RAM by adding the "desired address right shifted by 5310 bits with the bottom two bits set to zero" (i.e. the page frame 54number in units of 4 bytes) to the value in the PTEVADDR special 55register. If this load succeeds, then the word is treated as a PTE 56with which to fill the TLB and use for a (restarted) memory access. 57This is extremely simple (just one extra hardware state that does just 58one thing the hardware can already do), and quite fast (only one 59memory fetch vs. e.g. the 2-5 fetches required to walk a page table on 60x86). 61 62This special "refill" fetch is otherwise identical to any other memory 63access, meaning it too uses the TLB to translate from a virtual to 64physical address. Which means that the page tables occupy a 4M region 65of virtual, not physical, address space, in the same memory space 66occupied by the running code. The 1024 pages in that range (not all 67of which might be mapped in physical memory) are a linear array of 681048576 4-byte PTE entries, each describing a mapping for 4k of 69virtual memory. Note especially that exactly one of those pages 70contains the 1024 PTE entries for the 4M page table itself, pointed to 71by PTEVADDR. 72 73Obviously, the page table memory being virtual means that the fetch 74can fail: there are 1024 possible pages in a complete page table 75covering all of memory, and the ~16 entry TLB clearly won't contain 76entries mapping all of them. If we are missing a TLB entry for the 77page translation we want (NOT for the original requested address, we 78already know we're missing that TLB entry), the hardware has exactly 79one more special trick: it throws a TLB Miss exception (there are two, 80one each for instruction/data TLBs, but in Zephyr they operate 81identically). 82 83The job of that exception handler is simply to ensure that the TLB has 84an entry for the page table page we want. And the simplest way to do 85that is to just load the faulting PTE as an address, which will then 86go through the same refill process above. This second TLB fetch in 87the exception handler may result in an invalid/inapplicable mapping 88within the 4M page table region. This is an typical/expected runtime 89fault, and simply indicates unmapped memory. The result is TLB miss 90exception from within the TLB miss exception handler (i.e. while the 91EXCM bit is set). This will produce a Double Exception fault, which 92is handled by the OS identically to a general Kernel/User data access 93prohibited exception. 94 95After the TLB refill exception, the original faulting instruction is 96restarted, which retries the refill process, which succeeds in 97fetching a new TLB entry, which is then used to service the original 98memory access. (And may then result in yet another exception if it 99turns out that the TLB entry doesn't permit the access requested, of 100course.) 101 102## Special Cases 103 104The page-tables-specified-in-virtual-memory trick works very well in 105practice. But it does have a chicken/egg problem with the initial 106state. Because everything depends on state in the TLB, something 107needs to tell the hardware how to find a physical address using the 108TLB to begin the process. Here we exploit the separate 109non-automatically-refilled TLB ways to store bootstrap records. 110 111First, note that the refill process to load a PTE requires that the 4M 112space of PTE entries be resolvable by the TLB directly, without 113requiring another refill. This 4M mapping is provided by a single 114page of PTE entries (which itself lives in the 4M page table region!). 115This page must always be in the TLB. 116 117Thankfully, for the data TLB Xtensa provides 3 special/non-refillable 118ways (ways 7-9) with at least one 4k page mapping each. We can use 119one of these to "pin" the top-level page table entry in place, 120ensuring that a refill access will be able to find a PTE address. 121 122But now note that the load from that PTE address for the refill is 123done in an exception handler. And running an exception handler 124requires doing a fetch via the instruction TLB. And that obviously 125means that the page(s) containing the exception handler must never 126require a refill exception of its own. 127 128Ideally we would just pin the vector/handler page in the ITLB in the 129same way we do for data, but somewhat inexplicably, Xtensa does not 130provide 4k "pinnable" ways in the instruction TLB (frankly this seems 131like a design flaw). 132 133Instead, we load ITLB entries for vector handlers via the refill 134mechanism using the data TLB, and so need the refill mechanism for the 135vector page to succeed always. The way to do this is to similarly pin 136the page table page containing the (single) PTE for the vector page in 137the data TLB, such that instruction fetches always find their TLB 138mapping via refill, without requiring an exception. 139 140## Initialization 141 142Unlike most other architectures, Xtensa does not have a "disable" mode 143for the MMU. Virtual address translation through the TLB is active at 144all times. There therefore needs to be a mechanism for the CPU to 145execute code before the OS is able to initialize a refillable page 146table. 147 148The way Xtensa resolves this (on the hardware Zephyr supports, see the 149note below) is to have an 8-entry set ("way 6") of 512M pages able to 150cover all of memory. These 8 entries are initialized as valid, with 151attributes specifying that they are accessible only to an ASID of 1 152(i.e. the fixed ring zero / kernel ASID), writable, executable, and 153uncached. So at boot the CPU relies on these TLB entries to provide a 154clean view of hardware memory. 155 156But that means that enabling page-level translation requires some 157care, as the CPU will throw an exception ("multi hit") if a memory 158access matches more than one live entry in the TLB. The 159initialization algorithm is therefore: 160 1610. Start with a fully-initialized page table layout, including the 162 top-level "L1" page containing the mappings for the page table 163 itself. 164 1651. Ensure that the initialization routine does not cross a page 166 boundary (to prevent stray TLB refill exceptions), that it occupies 167 a separate 4k page than the exception vectors (which we must 168 temporarily double-map), and that it operates entirely in registers 169 (to avoid doing memory access at inopportune moments). 170 1712. Pin the L1 page table PTE into the data TLB. This creates a double 172 mapping condition, but it is safe as nothing will use it until we 173 start refilling. 174 1753. Pin the page table page containing the PTE for the TLB miss 176 exception handler into the data TLB. This will likewise not be 177 accessed until the double map condition is resolved. 178 1794. Set PTEVADDR appropriately. The CPU state to handle refill 180 exceptions is now complete, but cannot be used until we resolve the 181 double mappings. 182 1835. Disable the initial/way6 data TLB entries first, by setting them to 184 an ASID of zero. This is safe as the code being executed is not 185 doing data accesses yet (including refills), and will resolve the 186 double mapping conditions we created above. 187 1886. Disable the initial/way6 instruction TLBs second. The very next 189 instruction following the invalidation of the currently-executing 190 code page will then cause a TLB refill exception, which will work 191 normally because we just resolved the final double-map condition. 192 (Pedantic note: if the vector page and the currently-executing page 193 are in different 512M way6 pages, disable the mapping for the 194 exception handlers first so the trap from our current code can be 195 handled. Currently Zephyr doesn't handle this condition as in all 196 reasonable hardware these regions will be near each other) 197 198Note: there is a different variant of the Xtensa MMU architecture 199where the way 5/6 pages are immutable, and specify a set of 200unchangable mappings from the final 384M of memory to the bottom and 201top of physical memory. The intent here would (presumably) be that 202these would be used by the kernel for all physical memory and that the 203remaining memory space would be used for virtual mappings. This 204doesn't match Zephyr's architecture well, as we tend to assume 205page-level control over physical memory (e.g. .text/.rodata is cached 206but .data is not on SMP, etc...). And in any case we don't have any 207such hardware to experiment with. But with a little address 208translation we could support this. 209 210## ASID vs. Virtual Mapping 211 212The ASID mechanism in Xtensa works like other architectures, and is 213intended to be used similarly. The intent of the design is that at 214context switch time, you can simply change RADID and the page table 215data, and leave any existing mappings in place in the TLB using the 216old ASID value(s). So in the common case where you switch back, 217nothing needs to be flushed. 218 219Unfortunately this runs afoul of the virtual mapping of the page 220refill: data TLB entries storing the 4M page table mapping space are 221stored at ASID 1 (ring 0), they can't change when the page tables 222change! So this region naively would have to be flushed, which is 223tantamount to flushing the entire TLB regardless (the TLB is much 224smaller than the 1024-page PTE array). 225 226The resolution in Zephyr is to give each ASID its own PTEVADDR mapping 227in virtual space, such that the page tables don't overlap. This is 228expensive in virtual address space: assigning 4M of space to each of 229the 256 ASIDs (actually 254 as 0 and 1 are never used by user access) 230would take a full gigabyte of address space. Zephyr optimizes this a 231bit by deriving a unique sequential ASID from the hardware address of 232the statically allocated array of L1 page table pages. 233 234Note, obviously, that any change of the mappings within an ASID 235(e.g. to re-use it for another memory domain, or just for any runtime 236mapping change other than mapping previously-unmapped pages) still 237requires a TLB flush, and always will. 238 239## SMP/Cache Interaction 240 241A final important note is that the hardware PTE refill fetch works 242like any other CPU memory access, and in particular it is governed by 243the cacheability attributes of the TLB entry through which it was 244loaded. This means that if the page table entries are marked 245cacheable, then the hardware TLB refill process will be downstream of 246the L1 data cache on the CPU. If the physical memory storing page 247tables has been accessed recently by the CPU (for a refill of another 248page mapped within the same cache line, or to change the tables) then 249the refill will be served from the data cache and not main memory. 250 251This may or may not be desirable depending on access patterns. It 252lets the L1 data cache act as a "L2 TLB" for applications with a lot 253of access variability. But it also means that the TLB entries end up 254being stored twice in the same CPU, wasting transistors that could 255presumably store other useful data. 256 257But it is also important to note that the L1 data cache on Xtensa is 258incoherent! The cache being used for refill reflects the last access 259on the current CPU only, and not of the underlying memory being 260mapped. Page table changes in the data cache of one CPU will be 261invisible to the data cache of another. There is no simple way of 262notifying another CPU of changes to page mappings beyond doing 263system-wide flushes on all cpus every time a memory domain is 264modified. 265 266The result is that, when SMP is enabled, Zephyr must ensure that all 267page table mappings in the system are set uncached. The OS makes no 268attempt to bolt on a software coherence layer. 269