README_MMU.txt - OpenGrok cross reference for /Zephyr-latest/arch/xtensa/core/README_MMU.txt

# Xtensa MMU Operation

As with other elements of the architecture, paged virtual memory
management on Xtensa is somewhat unique.  And there is similarly a
lack of introductory material available.  This document is an attempt
to introduce the architecture at an overview/tutorial level, and to
describe Zephyr's specific implementation choices.

## General TLB Operation

The Xtensa MMU operates on top of a fairly conventional TLB cache.
The TLB stores virtual to physical translation for individual pages of
memory.  It is partitioned into an automatically managed
4-way-set-associative bank of entries mapping 4k pages, and 3-6
"special" ways storing mappings under OS control.  Some of these are
for mapping pages larger than 4k, which Zephyr does not directly
support.  A few are for bootstrap and initialization, and will be
discussed below.

Like the L1 cache, the TLB is split into separate instruction and data
entries.  Zephyr manages both as needed, but symmetrically.  The
architecture technically supports separately-virtualized instruction
and data spaces, but the hardware page table refill mechanism (see
below) does not, and Zephyr's memory spaces are unified regardless.

The TLB may be loaded with permissions and attributes controlling
cacheability, access control based on ring (i.e. the contents of the
RING field of the PS register) and togglable write and execute access.
Memory access, even with a matching TLB entry, may therefore create
Kernel/User exceptions as desired to enforce permissions choices on
userspace code.

Live TLB entries are tagged with an 8-bit "ASID" value derived from
their ring field of the PTE that loaded them, via a simple translation
specified in the RASID special register.  The intent is that each
non-kernel address space will get a separate ring 3 ASID set in RASID,
such that you can switch between them without a TLB flush.  The ASID
value of ring zero is fixed at 1, it may not be changed.  (An ASID
value of zero is used to tag an invalid/unmapped TLB entry at
initialization, but this mechanism isn't accessible to OS code except
in special circumstances, and in any case there is already an invalid
attribute value that can be used in a PTE).

## Virtually-mapped Page Tables

Xtensa has a unique (and, to someone exposed for the first time,
extremely confusing) "page table" format.  The simplest was to begin
to explain this is just to describe the (quite simple) hardware
behavior:

On a TLB miss, the hardware immediately does a single fetch (at ring 0
privilege) from RAM by adding the "desired address right shifted by
10 bits with the bottom two bits set to zero" (i.e. the page frame
number in units of 4 bytes) to the value in the PTEVADDR special
register.  If this load succeeds, then the word is treated as a PTE
with which to fill the TLB and use for a (restarted) memory access.
This is extremely simple (just one extra hardware state that does just
one thing the hardware can already do), and quite fast (only one
memory fetch vs. e.g. the 2-5 fetches required to walk a page table on
x86).

This special "refill" fetch is otherwise identical to any other memory
access, meaning it too uses the TLB to translate from a virtual to
physical address.  Which means that the page tables occupy a 4M region
of virtual, not physical, address space, in the same memory space
occupied by the running code.  The 1024 pages in that range (not all
of which might be mapped in physical memory) are a linear array of
1048576 4-byte PTE entries, each describing a mapping for 4k of
virtual memory.  Note especially that exactly one of those pages
contains the 1024 PTE entries for the 4M page table itself, pointed to
by PTEVADDR.

Obviously, the page table memory being virtual means that the fetch
can fail: there are 1024 possible pages in a complete page table
covering all of memory, and the ~16 entry TLB clearly won't contain
entries mapping all of them.  If we are missing a TLB entry for the
page translation we want (NOT for the original requested address, we
already know we're missing that TLB entry), the hardware has exactly
one more special trick: it throws a TLB Miss exception (there are two,
one each for instruction/data TLBs, but in Zephyr they operate
identically).

The job of that exception handler is simply to ensure that the TLB has
an entry for the page table page we want.  And the simplest way to do
that is to just load the faulting PTE as an address, which will then
go through the same refill process above.  This second TLB fetch in
the exception handler may result in an invalid/inapplicable mapping
within the 4M page table region.  This is an typical/expected runtime
fault, and simply indicates unmapped memory.  The result is TLB miss
exception from within the TLB miss exception handler (i.e. while the
EXCM bit is set).  This will produce a Double Exception fault, which
is handled by the OS identically to a general Kernel/User data access
prohibited exception.

After the TLB refill exception, the original faulting instruction is
restarted, which retries the refill process, which succeeds in
fetching a new TLB entry, which is then used to service the original
memory access.  (And may then result in yet another exception if it
turns out that the TLB entry doesn't permit the access requested, of
course.)

## Special Cases

The page-tables-specified-in-virtual-memory trick works very well in
practice.  But it does have a chicken/egg problem with the initial
state.  Because everything depends on state in the TLB, something
needs to tell the hardware how to find a physical address using the
TLB to begin the process.  Here we exploit the separate
non-automatically-refilled TLB ways to store bootstrap records.

First, note that the refill process to load a PTE requires that the 4M
space of PTE entries be resolvable by the TLB directly, without
requiring another refill.  This 4M mapping is provided by a single
page of PTE entries (which itself lives in the 4M page table region!).
This page must always be in the TLB.

Thankfully, for the data TLB Xtensa provides 3 special/non-refillable
ways (ways 7-9) with at least one 4k page mapping each.  We can use
one of these to "pin" the top-level page table entry in place,
ensuring that a refill access will be able to find a PTE address.

But now note that the load from that PTE address for the refill is
done in an exception handler.  And running an exception handler
requires doing a fetch via the instruction TLB.  And that obviously
means that the page(s) containing the exception handler must never
require a refill exception of its own.

Ideally we would just pin the vector/handler page in the ITLB in the
same way we do for data, but somewhat inexplicably, Xtensa does not
provide 4k "pinnable" ways in the instruction TLB (frankly this seems
like a design flaw).

Instead, we load ITLB entries for vector handlers via the refill
mechanism using the data TLB, and so need the refill mechanism for the
vector page to succeed always.  The way to do this is to similarly pin
the page table page containing the (single) PTE for the vector page in
the data TLB, such that instruction fetches always find their TLB
mapping via refill, without requiring an exception.

## Initialization

Unlike most other architectures, Xtensa does not have a "disable" mode
for the MMU.  Virtual address translation through the TLB is active at
all times.  There therefore needs to be a mechanism for the CPU to
execute code before the OS is able to initialize a refillable page
table.

The way Xtensa resolves this (on the hardware Zephyr supports, see the
note below) is to have an 8-entry set ("way 6") of 512M pages able to
cover all of memory.  These 8 entries are initialized as valid, with
attributes specifying that they are accessible only to an ASID of 1
(i.e. the fixed ring zero / kernel ASID), writable, executable, and
uncached.  So at boot the CPU relies on these TLB entries to provide a
clean view of hardware memory.

But that means that enabling page-level translation requires some
care, as the CPU will throw an exception ("multi hit") if a memory
access matches more than one live entry in the TLB.  The
initialization algorithm is therefore:

0. Start with a fully-initialized page table layout, including the
   top-level "L1" page containing the mappings for the page table
   itself.

1. Ensure that the initialization routine does not cross a page
   boundary (to prevent stray TLB refill exceptions), that it occupies
   a separate 4k page than the exception vectors (which we must
   temporarily double-map), and that it operates entirely in registers
   (to avoid doing memory access at inopportune moments).

2. Pin the L1 page table PTE into the data TLB.  This creates a double
   mapping condition, but it is safe as nothing will use it until we
   start refilling.

3. Pin the page table page containing the PTE for the TLB miss
   exception handler into the data TLB.  This will likewise not be
   accessed until the double map condition is resolved.

4. Set PTEVADDR appropriately.  The CPU state to handle refill
   exceptions is now complete, but cannot be used until we resolve the
   double mappings.

5. Disable the initial/way6 data TLB entries first, by setting them to
   an ASID of zero.  This is safe as the code being executed is not
   doing data accesses yet (including refills), and will resolve the
   double mapping conditions we created above.

6. Disable the initial/way6 instruction TLBs second.  The very next
   instruction following the invalidation of the currently-executing
   code page will then cause a TLB refill exception, which will work
   normally because we just resolved the final double-map condition.
   (Pedantic note: if the vector page and the currently-executing page
   are in different 512M way6 pages, disable the mapping for the
   exception handlers first so the trap from our current code can be
   handled.  Currently Zephyr doesn't handle this condition as in all
   reasonable hardware these regions will be near each other)

Note: there is a different variant of the Xtensa MMU architecture
where the way 5/6 pages are immutable, and specify a set of
unchangable mappings from the final 384M of memory to the bottom and
top of physical memory.  The intent here would (presumably) be that
these would be used by the kernel for all physical memory and that the
remaining memory space would be used for virtual mappings.  This
doesn't match Zephyr's architecture well, as we tend to assume
page-level control over physical memory (e.g. .text/.rodata is cached
but .data is not on SMP, etc...).  And in any case we don't have any
such hardware to experiment with.  But with a little address
translation we could support this.

## ASID vs. Virtual Mapping

The ASID mechanism in Xtensa works like other architectures, and is
intended to be used similarly.  The intent of the design is that at
context switch time, you can simply change RADID and the page table
data, and leave any existing mappings in place in the TLB using the
old ASID value(s).  So in the common case where you switch back,
nothing needs to be flushed.

Unfortunately this runs afoul of the virtual mapping of the page
refill: data TLB entries storing the 4M page table mapping space are
stored at ASID 1 (ring 0), they can't change when the page tables
change!  So this region naively would have to be flushed, which is
tantamount to flushing the entire TLB regardless (the TLB is much
smaller than the 1024-page PTE array).

The resolution in Zephyr is to give each ASID its own PTEVADDR mapping
in virtual space, such that the page tables don't overlap.  This is
expensive in virtual address space: assigning 4M of space to each of
the 256 ASIDs (actually 254 as 0 and 1 are never used by user access)
would take a full gigabyte of address space.  Zephyr optimizes this a
bit by deriving a unique sequential ASID from the hardware address of
the statically allocated array of L1 page table pages.

Note, obviously, that any change of the mappings within an ASID
(e.g. to re-use it for another memory domain, or just for any runtime
mapping change other than mapping previously-unmapped pages) still
requires a TLB flush, and always will.

## SMP/Cache Interaction

A final important note is that the hardware PTE refill fetch works
like any other CPU memory access, and in particular it is governed by
the cacheability attributes of the TLB entry through which it was
loaded.  This means that if the page table entries are marked
cacheable, then the hardware TLB refill process will be downstream of
the L1 data cache on the CPU.  If the physical memory storing page
tables has been accessed recently by the CPU (for a refill of another
page mapped within the same cache line, or to change the tables) then
the refill will be served from the data cache and not main memory.

This may or may not be desirable depending on access patterns.  It
lets the L1 data cache act as a "L2 TLB" for applications with a lot
of access variability.  But it also means that the TLB entries end up
being stored twice in the same CPU, wasting transistors that could
presumably store other useful data.

But it is also important to note that the L1 data cache on Xtensa is
incoherent!  The cache being used for refill reflects the last access
on the current CPU only, and not of the underlying memory being
mapped.  Page table changes in the data cache of one CPU will be
invisible to the data cache of another.  There is no simple way of
notifying another CPU of changes to page mappings beyond doing
system-wide flushes on all cpus every time a memory domain is
modified.

The result is that, when SMP is enabled, Zephyr must ensure that all
page table mappings in the system are set uncached.  The OS makes no
attempt to bolt on a software coherence layer.