Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
offsets/ | 18-Mar-2025 | - | 80 | 62 | ||
startup/ | 18-Mar-2025 | - | 803 | 358 | ||
CMakeLists.txt | D | 18-Mar-2025 | 4.6 KiB | 115 | 98 | |
README_MMU.txt | D | 18-Mar-2025 | 13.5 KiB | 269 | 226 | |
README_WINDOWS.rst | D | 18-Mar-2025 | 5.3 KiB | 109 | 90 | |
coredump.c | D | 18-Mar-2025 | 5.4 KiB | 208 | 143 | |
cpu_idle.c | D | 18-Mar-2025 | 494 | 26 | 18 | |
crt1.S | D | 18-Mar-2025 | 5.5 KiB | 195 | 73 | |
debug_helpers_asm.S | D | 18-Mar-2025 | 1.1 KiB | 41 | 21 | |
elf.c | D | 18-Mar-2025 | 5.4 KiB | 174 | 114 | |
fatal.c | D | 18-Mar-2025 | 3.3 KiB | 160 | 131 | |
gdbstub.c | D | 18-Mar-2025 | 20.7 KiB | 989 | 672 | |
gen_vectors.py | D | 18-Mar-2025 | 4.8 KiB | 123 | 53 | |
gen_zsr.py | D | 18-Mar-2025 | 3.7 KiB | 110 | 70 | |
irq_manage.c | D | 18-Mar-2025 | 2.4 KiB | 88 | 51 | |
irq_offload.c | D | 18-Mar-2025 | 1 KiB | 44 | 30 | |
mem_manage.c | D | 18-Mar-2025 | 1 KiB | 42 | 26 | |
mmu.c | D | 18-Mar-2025 | 6.9 KiB | 195 | 95 | |
mpu.c | D | 18-Mar-2025 | 34 KiB | 1,200 | 651 | |
prep_c.c | D | 18-Mar-2025 | 2.3 KiB | 96 | 47 | |
ptables.c | D | 18-Mar-2025 | 30.5 KiB | 1,147 | 743 | |
smp.c | D | 18-Mar-2025 | 688 | 32 | 15 | |
syscall_helper.c | D | 18-Mar-2025 | 5.1 KiB | 188 | 125 | |
thread.c | D | 18-Mar-2025 | 5.7 KiB | 191 | 109 | |
timing.c | D | 18-Mar-2025 | 1,021 | 59 | 41 | |
tls.c | D | 18-Mar-2025 | 1.3 KiB | 48 | 17 | |
userspace.S | D | 18-Mar-2025 | 7.7 KiB | 355 | 221 | |
vector_handlers.c | D | 18-Mar-2025 | 14.3 KiB | 512 | 314 | |
window_vectors.S | D | 18-Mar-2025 | 10 KiB | 244 | 99 | |
xcc_stubs.c | D | 18-Mar-2025 | 202 | 15 | 6 | |
xtensa_asm2_util.S | D | 18-Mar-2025 | 14.8 KiB | 545 | 312 | |
xtensa_backtrace.c | D | 18-Mar-2025 | 4.4 KiB | 171 | 106 | |
xtensa_hifi.S | D | 18-Mar-2025 | 1.4 KiB | 54 | 19 | |
xtensa_intgen.py | D | 18-Mar-2025 | 4.6 KiB | 139 | 87 | |
xtensa_intgen.tmpl | D | 18-Mar-2025 | 1.8 KiB | 44 | 41 |
README_MMU.txt
1# Xtensa MMU Operation 2 3As with other elements of the architecture, paged virtual memory 4management on Xtensa is somewhat unique. And there is similarly a 5lack of introductory material available. This document is an attempt 6to introduce the architecture at an overview/tutorial level, and to 7describe Zephyr's specific implementation choices. 8 9## General TLB Operation 10 11The Xtensa MMU operates on top of a fairly conventional TLB cache. 12The TLB stores virtual to physical translation for individual pages of 13memory. It is partitioned into an automatically managed 144-way-set-associative bank of entries mapping 4k pages, and 3-6 15"special" ways storing mappings under OS control. Some of these are 16for mapping pages larger than 4k, which Zephyr does not directly 17support. A few are for bootstrap and initialization, and will be 18discussed below. 19 20Like the L1 cache, the TLB is split into separate instruction and data 21entries. Zephyr manages both as needed, but symmetrically. The 22architecture technically supports separately-virtualized instruction 23and data spaces, but the hardware page table refill mechanism (see 24below) does not, and Zephyr's memory spaces are unified regardless. 25 26The TLB may be loaded with permissions and attributes controlling 27cacheability, access control based on ring (i.e. the contents of the 28RING field of the PS register) and togglable write and execute access. 29Memory access, even with a matching TLB entry, may therefore create 30Kernel/User exceptions as desired to enforce permissions choices on 31userspace code. 32 33Live TLB entries are tagged with an 8-bit "ASID" value derived from 34their ring field of the PTE that loaded them, via a simple translation 35specified in the RASID special register. The intent is that each 36non-kernel address space will get a separate ring 3 ASID set in RASID, 37such that you can switch between them without a TLB flush. The ASID 38value of ring zero is fixed at 1, it may not be changed. (An ASID 39value of zero is used to tag an invalid/unmapped TLB entry at 40initialization, but this mechanism isn't accessible to OS code except 41in special circumstances, and in any case there is already an invalid 42attribute value that can be used in a PTE). 43 44## Virtually-mapped Page Tables 45 46Xtensa has a unique (and, to someone exposed for the first time, 47extremely confusing) "page table" format. The simplest was to begin 48to explain this is just to describe the (quite simple) hardware 49behavior: 50 51On a TLB miss, the hardware immediately does a single fetch (at ring 0 52privilege) from RAM by adding the "desired address right shifted by 5310 bits with the bottom two bits set to zero" (i.e. the page frame 54number in units of 4 bytes) to the value in the PTEVADDR special 55register. If this load succeeds, then the word is treated as a PTE 56with which to fill the TLB and use for a (restarted) memory access. 57This is extremely simple (just one extra hardware state that does just 58one thing the hardware can already do), and quite fast (only one 59memory fetch vs. e.g. the 2-5 fetches required to walk a page table on 60x86). 61 62This special "refill" fetch is otherwise identical to any other memory 63access, meaning it too uses the TLB to translate from a virtual to 64physical address. Which means that the page tables occupy a 4M region 65of virtual, not physical, address space, in the same memory space 66occupied by the running code. The 1024 pages in that range (not all 67of which might be mapped in physical memory) are a linear array of 681048576 4-byte PTE entries, each describing a mapping for 4k of 69virtual memory. Note especially that exactly one of those pages 70contains the 1024 PTE entries for the 4M page table itself, pointed to 71by PTEVADDR. 72 73Obviously, the page table memory being virtual means that the fetch 74can fail: there are 1024 possible pages in a complete page table 75covering all of memory, and the ~16 entry TLB clearly won't contain 76entries mapping all of them. If we are missing a TLB entry for the 77page translation we want (NOT for the original requested address, we 78already know we're missing that TLB entry), the hardware has exactly 79one more special trick: it throws a TLB Miss exception (there are two, 80one each for instruction/data TLBs, but in Zephyr they operate 81identically). 82 83The job of that exception handler is simply to ensure that the TLB has 84an entry for the page table page we want. And the simplest way to do 85that is to just load the faulting PTE as an address, which will then 86go through the same refill process above. This second TLB fetch in 87the exception handler may result in an invalid/inapplicable mapping 88within the 4M page table region. This is an typical/expected runtime 89fault, and simply indicates unmapped memory. The result is TLB miss 90exception from within the TLB miss exception handler (i.e. while the 91EXCM bit is set). This will produce a Double Exception fault, which 92is handled by the OS identically to a general Kernel/User data access 93prohibited exception. 94 95After the TLB refill exception, the original faulting instruction is 96restarted, which retries the refill process, which succeeds in 97fetching a new TLB entry, which is then used to service the original 98memory access. (And may then result in yet another exception if it 99turns out that the TLB entry doesn't permit the access requested, of 100course.) 101 102## Special Cases 103 104The page-tables-specified-in-virtual-memory trick works very well in 105practice. But it does have a chicken/egg problem with the initial 106state. Because everything depends on state in the TLB, something 107needs to tell the hardware how to find a physical address using the 108TLB to begin the process. Here we exploit the separate 109non-automatically-refilled TLB ways to store bootstrap records. 110 111First, note that the refill process to load a PTE requires that the 4M 112space of PTE entries be resolvable by the TLB directly, without 113requiring another refill. This 4M mapping is provided by a single 114page of PTE entries (which itself lives in the 4M page table region!). 115This page must always be in the TLB. 116 117Thankfully, for the data TLB Xtensa provides 3 special/non-refillable 118ways (ways 7-9) with at least one 4k page mapping each. We can use 119one of these to "pin" the top-level page table entry in place, 120ensuring that a refill access will be able to find a PTE address. 121 122But now note that the load from that PTE address for the refill is 123done in an exception handler. And running an exception handler 124requires doing a fetch via the instruction TLB. And that obviously 125means that the page(s) containing the exception handler must never 126require a refill exception of its own. 127 128Ideally we would just pin the vector/handler page in the ITLB in the 129same way we do for data, but somewhat inexplicably, Xtensa does not 130provide 4k "pinnable" ways in the instruction TLB (frankly this seems 131like a design flaw). 132 133Instead, we load ITLB entries for vector handlers via the refill 134mechanism using the data TLB, and so need the refill mechanism for the 135vector page to succeed always. The way to do this is to similarly pin 136the page table page containing the (single) PTE for the vector page in 137the data TLB, such that instruction fetches always find their TLB 138mapping via refill, without requiring an exception. 139 140## Initialization 141 142Unlike most other architectures, Xtensa does not have a "disable" mode 143for the MMU. Virtual address translation through the TLB is active at 144all times. There therefore needs to be a mechanism for the CPU to 145execute code before the OS is able to initialize a refillable page 146table. 147 148The way Xtensa resolves this (on the hardware Zephyr supports, see the 149note below) is to have an 8-entry set ("way 6") of 512M pages able to 150cover all of memory. These 8 entries are initialized as valid, with 151attributes specifying that they are accessible only to an ASID of 1 152(i.e. the fixed ring zero / kernel ASID), writable, executable, and 153uncached. So at boot the CPU relies on these TLB entries to provide a 154clean view of hardware memory. 155 156But that means that enabling page-level translation requires some 157care, as the CPU will throw an exception ("multi hit") if a memory 158access matches more than one live entry in the TLB. The 159initialization algorithm is therefore: 160 1610. Start with a fully-initialized page table layout, including the 162 top-level "L1" page containing the mappings for the page table 163 itself. 164 1651. Ensure that the initialization routine does not cross a page 166 boundary (to prevent stray TLB refill exceptions), that it occupies 167 a separate 4k page than the exception vectors (which we must 168 temporarily double-map), and that it operates entirely in registers 169 (to avoid doing memory access at inopportune moments). 170 1712. Pin the L1 page table PTE into the data TLB. This creates a double 172 mapping condition, but it is safe as nothing will use it until we 173 start refilling. 174 1753. Pin the page table page containing the PTE for the TLB miss 176 exception handler into the data TLB. This will likewise not be 177 accessed until the double map condition is resolved. 178 1794. Set PTEVADDR appropriately. The CPU state to handle refill 180 exceptions is now complete, but cannot be used until we resolve the 181 double mappings. 182 1835. Disable the initial/way6 data TLB entries first, by setting them to 184 an ASID of zero. This is safe as the code being executed is not 185 doing data accesses yet (including refills), and will resolve the 186 double mapping conditions we created above. 187 1886. Disable the initial/way6 instruction TLBs second. The very next 189 instruction following the invalidation of the currently-executing 190 code page will then cause a TLB refill exception, which will work 191 normally because we just resolved the final double-map condition. 192 (Pedantic note: if the vector page and the currently-executing page 193 are in different 512M way6 pages, disable the mapping for the 194 exception handlers first so the trap from our current code can be 195 handled. Currently Zephyr doesn't handle this condition as in all 196 reasonable hardware these regions will be near each other) 197 198Note: there is a different variant of the Xtensa MMU architecture 199where the way 5/6 pages are immutable, and specify a set of 200unchangable mappings from the final 384M of memory to the bottom and 201top of physical memory. The intent here would (presumably) be that 202these would be used by the kernel for all physical memory and that the 203remaining memory space would be used for virtual mappings. This 204doesn't match Zephyr's architecture well, as we tend to assume 205page-level control over physical memory (e.g. .text/.rodata is cached 206but .data is not on SMP, etc...). And in any case we don't have any 207such hardware to experiment with. But with a little address 208translation we could support this. 209 210## ASID vs. Virtual Mapping 211 212The ASID mechanism in Xtensa works like other architectures, and is 213intended to be used similarly. The intent of the design is that at 214context switch time, you can simply change RADID and the page table 215data, and leave any existing mappings in place in the TLB using the 216old ASID value(s). So in the common case where you switch back, 217nothing needs to be flushed. 218 219Unfortunately this runs afoul of the virtual mapping of the page 220refill: data TLB entries storing the 4M page table mapping space are 221stored at ASID 1 (ring 0), they can't change when the page tables 222change! So this region naively would have to be flushed, which is 223tantamount to flushing the entire TLB regardless (the TLB is much 224smaller than the 1024-page PTE array). 225 226The resolution in Zephyr is to give each ASID its own PTEVADDR mapping 227in virtual space, such that the page tables don't overlap. This is 228expensive in virtual address space: assigning 4M of space to each of 229the 256 ASIDs (actually 254 as 0 and 1 are never used by user access) 230would take a full gigabyte of address space. Zephyr optimizes this a 231bit by deriving a unique sequential ASID from the hardware address of 232the statically allocated array of L1 page table pages. 233 234Note, obviously, that any change of the mappings within an ASID 235(e.g. to re-use it for another memory domain, or just for any runtime 236mapping change other than mapping previously-unmapped pages) still 237requires a TLB flush, and always will. 238 239## SMP/Cache Interaction 240 241A final important note is that the hardware PTE refill fetch works 242like any other CPU memory access, and in particular it is governed by 243the cacheability attributes of the TLB entry through which it was 244loaded. This means that if the page table entries are marked 245cacheable, then the hardware TLB refill process will be downstream of 246the L1 data cache on the CPU. If the physical memory storing page 247tables has been accessed recently by the CPU (for a refill of another 248page mapped within the same cache line, or to change the tables) then 249the refill will be served from the data cache and not main memory. 250 251This may or may not be desirable depending on access patterns. It 252lets the L1 data cache act as a "L2 TLB" for applications with a lot 253of access variability. But it also means that the TLB entries end up 254being stored twice in the same CPU, wasting transistors that could 255presumably store other useful data. 256 257But it is also important to note that the L1 data cache on Xtensa is 258incoherent! The cache being used for refill reflects the last access 259on the current CPU only, and not of the underlying memory being 260mapped. Page table changes in the data cache of one CPU will be 261invisible to the data cache of another. There is no simple way of 262notifying another CPU of changes to page mappings beyond doing 263system-wide flushes on all cpus every time a memory domain is 264modified. 265 266The result is that, when SMP is enabled, Zephyr must ensure that all 267page table mappings in the system are set uncached. The OS makes no 268attempt to bolt on a software coherence layer. 269
README_WINDOWS.rst
1# How Xtensa register windows work 2 3There is a paucity of introductory material on this subject, and 4Zephyr plays some tricks here that require understanding the base 5layer. 6 7## Hardware 8 9When register windows are configured in the CPU, there are either 32 10or 64 "real" registers in hardware, with 16 visible at one time. 11Registers are grouped and rotated in units of 4, so there are 8 or 16 12such "quads" (my term, not Tensilica's) in hardware of which 4 are 13visible as A0-A15. 14 15The first quad (A0-A3) is pointed to by a special register called 16WINDOWBASE. The register file is cyclic, so for example if NREGS==64 17and WINDOWBASE is 15, quads 15, 0, 1, and 2 will be visible as 18(respectively) A0-A3, A4-A7, A8-A11, and A12-A15. 19 20There is a ROTW instruction that can be used to manually rotate the 21window by a immediate number of quads that are added to WINDOWBASE. 22Positive rotations "move" high registers into low registers 23(i.e. after "ROTW 1" the register that used to be called A4 is now 24A0). 25 26There are CALL4/CALL8/CALL12 instructions to effect rotated calls 27which rotate registers upward (i.e. "hiding" low registers from the 28callee) by 1, 2 or 3 quads. These do not rotate the window 29themselves. Instead they place the rotation amount in two places 30(yes, two; see below): the 2-bit CALLINC field of the PS register, and 31the top two bits of the return address placed in A0. 32 33There is an ENTRY instruction that does the rotation. It adds CALLINC 34to WINDOWBASE, at the same time copying the old (now hidden) stack 35pointer in A1 into the "new" A1 in the rotated frame, subtracting an 36immediate offset from it to make space for the new frame. 37 38There is a RETW instruction that undoes the rotation. It reads the 39top two bits from the return address in A0 and subtracts that value 40from WINDOWBASE before returning. This is why the CALLINC bits went 41in two places. They have to be stored on the stack across potentially 42many calls, so they need to be GPR data that lives in registers and 43can be spilled. But ENTRY isn't specified to assume a particular 44return value format and is used immediately, so it makes more sense 45for it to use processor state instead. 46 47Note that we still don't know how to detect when the register file has 48wrapped around and needs to be spilled or filled. To do this there is 49a WINDOWSTART register used to detect which register quads are in use. 50The name "start" is somewhat confusing, this is not a pointer. 51WINDOWSTART stores a bitmask with one bit per hardware quad (so it's 8 52or 16 bits wide). The bit in windowstart corresponding to WINDOWBASE 53will be set by the ENTRY instruction, and remain set after rotations 54until cleared by a function return (by RETW, see below). Other bits 55stay zero. So there is one set bit in WINDOWSTART corresponding to 56each call frame that is live in hardware registers, and it will be 57followed by 0, 1 or 2 zero bits that tell you how "big" (how many 58quads of registers) that frame is. 59 60So the CPU executing RETW checks to make sure that the register quad 61being brought into A0-A3 (i.e. the new WINDOWBASE) has a set bit 62indicating it's valid. If it does not, the registers must have been 63spilled and the CPU traps to an exception handler to fill them. 64 65Likewise, the processor can tell if a high register is "owned" by 66another call by seeing if there is a one in WINDOWSTART between that 67register's quad and WINDOWBASE. If there is, the CPU traps to a spill 68handler to spill one frame. Note that a frame might be only four 69registers, but it's possible to hit registers 12 out from WINDOWBASE, 70so it's actually possible to trap again when the instruction restarts 71to spill a second quad, and even a third time at maximum. 72 73Finally: note that hardware checks the two bits of WINDOWSTART after 74the frame bit to detect how many quads are represented by the one 75frame. So there are six separate exception handlers to spill/fill 761/2/3 quads of registers. 77 78## Software & ABI 79 80The advantage of the scheme above is that it allows the registers to 81be spilled naturally into the stack by using the stack pointers 82embedded in the register file. But the hardware design assumes and to 83some extent enforces a fairly complicated stack layout to make that 84work: 85 86The spill area for a single frame's A0-A3 registers is not in its own 87stack frame. It lies in the 16 bytes below its CALLEE's stack 88pointer. This is so that the callee (and exception handlers invoked 89on its behalf) can see its caller's potentially-spilled stack pointer 90register (A1) on the stack and be able to walk back up on return. 91Other architectures do this too by e.g. pushing the incoming stack 92pointer onto the stack as a standard "frame pointer" defined in the 93platform ABI. Xtensa wraps this together with the natural spill area 94for register windows. 95 96By convention spill regions always store the lowest numbered register 97in the lowest address. 98 99The spill area for a frame's A4-A11 registers may or may not exist 100depending on whether the call was made with CALL8/CALL12. It is legal 101to write a function using only A0-A3 and CALL4 calls and ignore higher 102registers. But if those 0-2 register quads are in use, they appear at 103the top of the stack frame, immediately below the parent call's A0-A3 104spill area. 105 106There is no spill area for A12-A15. Those registers are always 107caller-save. When using CALLn, you always need to overlap 4 registers 108to provide arguments and take a return value. 109