1.. _smp_arch: 2 3Symmetric Multiprocessing 4######################### 5 6On multiprocessor architectures, Zephyr supports the use of multiple 7physical CPUs running Zephyr application code. This support is 8"symmetric" in the sense that no specific CPU is treated specially by 9default. Any processor is capable of running any Zephyr thread, with 10access to all standard Zephyr APIs supported. 11 12No special application code needs to be written to take advantage of 13this feature. If there are two Zephyr application threads runnable on 14a supported dual processor device, they will both run simultaneously. 15 16SMP configuration is controlled under the :kconfig:option:`CONFIG_SMP` kconfig 17variable. This must be set to "y" to enable SMP features, otherwise 18a uniprocessor kernel will be built. In general the platform default 19will have enabled this anywhere it's supported. When enabled, the 20number of physical CPUs available is visible at build time as 21:kconfig:option:`CONFIG_MP_MAX_NUM_CPUS`. Likewise, the default for this will be the 22number of available CPUs on the platform and it is not expected that 23typical apps will change it. But it is legal and supported to set 24this to a smaller (but obviously not larger) number for special 25purposes (e.g. for testing, or to reserve a physical CPU for running 26non-Zephyr code). 27 28Synchronization 29*************** 30 31At the application level, core Zephyr IPC and synchronization 32primitives all behave identically under an SMP kernel. For example 33semaphores used to implement blocking mutual exclusion continue to be 34a proper application choice. 35 36At the lowest level, however, Zephyr code has often used the 37:c:func:`irq_lock`/:c:func:`irq_unlock` primitives to implement fine grained 38critical sections using interrupt masking. These APIs continue to 39work via an emulation layer (see below), but the masking technique 40does not: the fact that your CPU will not be interrupted while you are 41in your critical section says nothing about whether a different CPU 42will be running simultaneously and be inspecting or modifying the same 43data! 44 45Spinlocks 46========= 47 48SMP systems provide a more constrained :c:func:`k_spin_lock` primitive 49that not only masks interrupts locally, as done by :c:func:`irq_lock`, but 50also atomically validates that a shared lock variable has been 51modified before returning to the caller, "spinning" on the check if 52needed to wait for the other CPU to exit the lock. The default Zephyr 53implementation of :c:func:`k_spin_lock` and :c:func:`k_spin_unlock` is built 54on top of the pre-existing :c:struct:`atomic_` layer (itself usually 55implemented using compiler intrinsics), though facilities exist for 56architectures to define their own for performance reasons. 57 58One important difference between IRQ locks and spinlocks is that the 59earlier API was naturally recursive: the lock was global, so it was 60legal to acquire a nested lock inside of a critical section. 61Spinlocks are separable: you can have many locks for separate 62subsystems or data structures, preventing CPUs from contending on a 63single global resource. But that means that spinlocks must not be 64used recursively. Code that holds a specific lock must not try to 65re-acquire it or it will deadlock (it is perfectly legal to nest 66**distinct** spinlocks, however). A validation layer is available to 67detect and report bugs like this. 68 69When used on a uniprocessor system, the data component of the spinlock 70(the atomic lock variable) is unnecessary and elided. Except for the 71recursive semantics above, spinlocks in single-CPU contexts produce 72identical code to legacy IRQ locks. In fact the entirety of the 73Zephyr core kernel has now been ported to use spinlocks exclusively. 74 75Legacy irq_lock() emulation 76=========================== 77 78For the benefit of applications written to the uniprocessor locking 79API, :c:func:`irq_lock` and :c:func:`irq_unlock` continue to work compatibly on 80SMP systems with identical semantics to their legacy versions. They 81are implemented as a single global spinlock, with a nesting count and 82the ability to be atomically reacquired on context switch into locked 83threads. The kernel will ensure that only one thread across all CPUs 84can hold the lock at any time, that it is released on context switch, 85and that it is re-acquired when necessary to restore the lock state 86when a thread is switched in. Other CPUs will spin waiting for the 87release to happen. 88 89The overhead involved in this process has measurable performance 90impact, however. Unlike uniprocessor apps, SMP apps using 91:c:func:`irq_lock` are not simply invoking a very short (often ~1 92instruction) interrupt masking operation. That, and the fact that the 93IRQ lock is global, means that code expecting to be run in an SMP 94context should be using the spinlock API wherever possible. 95 96CPU Mask 97******** 98 99It is often desirable for real time applications to deliberately 100partition work across physical CPUs instead of relying solely on the 101kernel scheduler to decide on which threads to execute. Zephyr 102provides an API, controlled by the :kconfig:option:`CONFIG_SCHED_CPU_MASK` 103kconfig variable, which can associate a specific set of CPUs with each 104thread, indicating on which CPUs it can run. 105 106By default, new threads can run on any CPU. Calling 107:c:func:`k_thread_cpu_mask_disable` with a particular CPU ID will prevent 108that thread from running on that CPU in the future. Likewise 109:c:func:`k_thread_cpu_mask_enable` will re-enable execution. There are also 110:c:func:`k_thread_cpu_mask_clear` and :c:func:`k_thread_cpu_mask_enable_all` APIs 111available for convenience. For obvious reasons, these APIs are 112illegal if called on a runnable thread. The thread must be blocked or 113suspended, otherwise an ``-EINVAL`` will be returned. 114 115Note that when this feature is enabled, the scheduler algorithm 116involved in doing the per-CPU mask test requires that the list be 117traversed in full. The kernel does not keep a per-CPU run queue. 118That means that the performance benefits from the 119:kconfig:option:`CONFIG_SCHED_SCALABLE` and :kconfig:option:`CONFIG_SCHED_MULTIQ` 120scheduler backends cannot be realized. CPU mask processing is 121available only when :kconfig:option:`CONFIG_SCHED_DUMB` is the selected 122backend. This requirement is enforced in the configuration layer. 123 124SMP Boot Process 125**************** 126 127A Zephyr SMP kernel begins boot identically to a uniprocessor kernel. 128Auxiliary CPUs begin in a disabled state in the architecture layer. 129All standard kernel initialization, including device initialization, 130happens on a single CPU before other CPUs are brought online. 131 132Just before entering the application :c:func:`main` function, the kernel 133calls :c:func:`z_smp_init` to launch the SMP initialization process. This 134enumerates over the configured CPUs, calling into the architecture 135layer using :c:func:`arch_cpu_start` for each one. This function is 136passed a memory region to use as a stack on the foreign CPU (in 137practice it uses the area that will become that CPU's interrupt 138stack), the address of a local :c:func:`smp_init_top` callback function to 139run on that CPU, and a pointer to a "start flag" address which will be 140used as an atomic signal. 141 142The local SMP initialization (:c:func:`smp_init_top`) on each CPU is then 143invoked by the architecture layer. Note that interrupts are still 144masked at this point. This routine is responsible for calling 145:c:func:`smp_timer_init` to set up any needed stat in the timer driver. On 146many architectures the timer is a per-CPU device and needs to be 147configured specially on auxiliary CPUs. Then it waits (spinning) for 148the atomic "start flag" to be released in the main thread, to 149guarantee that all SMP initialization is complete before any Zephyr 150application code runs, and finally calls :c:func:`z_swap` to transfer 151control to the appropriate runnable thread via the standard scheduler 152API. 153 154.. figure:: smpinit.svg 155 :align: center 156 :alt: SMP Initialization 157 :figclass: align-center 158 159 Example SMP initialization process, showing a configuration with 160 two CPUs and two app threads which begin operating simultaneously. 161 162Interprocessor Interrupts 163************************* 164 165When running in multiprocessor environments, it is occasionally the 166case that state modified on the local CPU needs to be synchronously 167handled on a different processor. 168 169One example is the Zephyr :c:func:`k_thread_abort` API, which cannot return 170until the thread that had been aborted is no longer runnable. If it 171is currently running on another CPU, that becomes difficult to 172implement. 173 174Another is low power idle. It is a firm requirement on many devices 175that system idle be implemented using a low-power mode with as many 176interrupts (including periodic timer interrupts) disabled or deferred 177as is possible. If a CPU is in such a state, and on another CPU a 178thread becomes runnable, the idle CPU has no way to "wake up" to 179handle the newly-runnable load. 180 181So where possible, Zephyr SMP architectures should implement an 182interprocessor interrupt. The current framework is very simple: the 183architecture provides at least a :c:func:`arch_sched_broadcast_ipi` call, 184which when invoked will flag an interrupt on all CPUs (except the current one, 185though that is allowed behavior). If the architecture supports directed IPIs 186(see :kconfig:option:`CONFIG_ARCH_HAS_DIRECTED_IPIS`), then the 187architecture also provides a :c:func:`arch_sched_directed_ipi` call, which 188when invoked will flag an interrupt on the specified CPUs. When an interrupt is 189flagged on the CPUs, the :c:func:`z_sched_ipi` function implemented in the 190scheduler will get invoked on those CPUs. The expectation is that these 191APIs will evolve over time to encompass more functionality (e.g. cross-CPU 192calls), and that the scheduler-specific calls here will be implemented in 193terms of a more general framework. 194 195Note that not all SMP architectures will have a usable IPI mechanism 196(either missing, or just undocumented/unimplemented). In those cases 197Zephyr provides fallback behavior that is correct, but perhaps 198suboptimal. 199 200Using this, :c:func:`k_thread_abort` becomes only slightly more 201complicated in SMP: for the case where a thread is actually running on 202another CPU (we can detect this atomically inside the scheduler), we 203broadcast an IPI and spin, waiting for the thread to either become 204"DEAD" or for it to re-enter the queue (in which case we terminate it 205the same way we would have in uniprocessor mode). Note that the 206"aborted" check happens on any interrupt exit, so there is no special 207handling needed in the IPI per se. This allows us to implement a 208reasonable fallback when IPI is not available: we can simply spin, 209waiting until the foreign CPU receives any interrupt, though this may 210be a much longer time! 211 212Likewise idle wakeups are trivially implementable with an empty IPI 213handler. If a thread is added to an empty run queue (i.e. there may 214have been idle CPUs), we broadcast an IPI. A foreign CPU will then be 215able to see the new thread when exiting from the interrupt and will 216switch to it if available. 217 218Without an IPI, however, a low power idle that requires an interrupt 219will not work to synchronously run new threads. The workaround in 220that case is more invasive: Zephyr will **not** enter the system idle 221handler and will instead spin in its idle loop, testing the scheduler 222state at high frequency (not spinning on it though, as that would 223involve severe lock contention) for new threads. The expectation is 224that power constrained SMP applications are always going to provide an 225IPI, and this code will only be used for testing purposes or on 226systems without power consumption requirements. 227 228IPI Cascades 229============ 230 231The kernel can not control the order in which IPIs are processed by the CPUs 232in the system. In general, this is not an issue and a single set of IPIs is 233sufficient to trigger a reschedule on the N CPUs that results with them 234scheduling the highest N priority ready threads to execute. When CPU masking 235is used, there may be more than one valid set of threads (not to be confused 236with an optimal set of threads) that can be scheduled on the N CPUs and a 237single set of IPIs may be insufficient to result in any of these valid sets. 238 239.. note:: 240 When CPU masking is not in play, the optimal set of threads is the same 241 as the valid set of threads. However when CPU masking is in play, there 242 may be more than one valid set--one of which may be optimal. 243 244 To better illustrate the distinction, consider a 2-CPU system with ready 245 threads T1 and T2 at priorities 1 and 2 respectively. Let T2 be pinned to 246 CPU0 and T1 not be pinned. If CPU0 is executing T2 and CPU1 executing T1, 247 then this set is is both valid and optimal. However, if CPU0 is executing 248 T1 and CPU1 is idling, then this too would be valid though not optimal. 249 250In those cases where a single set of IPIs is not sufficient to generate a valid 251set, the resulting set of executing threads are expected to be close to a valid 252set, and subsequent IPIs can generally be expected to correct the situation 253soon. However, for cases where neither the approximation nor the delay are 254acceptable, enabling :kconfig:option:`CONFIG_SCHED_IPI_CASCADE` will allow the 255kernel to generate cascading IPIs until the kernel has selected a valid set of 256ready threads for the CPUs. 257 258There are three types of costs/penalties associated with the IPI cascades--and 259for these reasons they are disabled by default. The first is a cost incurred 260by the CPU producing the IPI when a new thread preempts the old thread as checks 261must be done to compare the old thread against the threads executing on the 262other CPUs. The second is a cost incurred by the CPUs receiving the IPIs as 263they must be processed. The third is the apparent sputtering of a thread as it 264"winks in" and then "winks out" due to cascades stemming from the 265aforementioned first cost. 266 267SMP Kernel Internals 268******************** 269 270In general, Zephyr kernel code is SMP-agnostic and, like application 271code, will work correctly regardless of the number of CPUs available. 272But in a few areas there are notable changes in structure or behavior. 273 274 275Per-CPU data 276============ 277 278Many elements of the core kernel data need to be implemented for each 279CPU in SMP mode. For example, the ``arch_current_thread()`` thread pointer obviously 280needs to reflect what is running locally, there are many threads 281running concurrently. Likewise a kernel-provided interrupt stack 282needs to be created and assigned for each physical CPU, as does the 283interrupt nesting count used to detect ISR state. 284 285These fields are now moved into a separate struct :c:struct:`_cpu` instance 286within the :c:struct:`_kernel` struct, which has a ``cpus[]`` array indexed by ID. 287Compatibility fields are provided for legacy uniprocessor code trying 288to access the fields of ``cpus[0]`` using the older syntax and assembly 289offsets. 290 291Note that an important requirement on the architecture layer is that 292the pointer to this CPU struct be available rapidly when in kernel 293context. The expectation is that :c:func:`arch_curr_cpu` will be 294implemented using a CPU-provided register or addressing mode that can 295store this value across arbitrary context switches or interrupts and 296make it available to any kernel-mode code. 297 298Similarly, where on a uniprocessor system Zephyr could simply create a 299global "idle thread" at the lowest priority, in SMP we may need one 300for each CPU. This makes the internal predicate test for "_is_idle()" 301in the scheduler, which is a hot path performance environment, more 302complicated than simply testing the thread pointer for equality with a 303known static variable. In SMP mode, idle threads are distinguished by 304a separate field in the thread struct. 305 306Switch-based context switching 307============================== 308 309The traditional Zephyr context switch primitive has been :c:func:`z_swap`. 310Unfortunately, this function takes no argument specifying a thread to 311switch to. The expectation has always been that the scheduler has 312already made its preemption decision when its state was last modified 313and cached the resulting "next thread" pointer in a location where 314architecture context switch primitives can find it via a simple struct 315offset. That technique will not work in SMP, because the other CPU 316may have modified scheduler state since the current CPU last exited 317the scheduler (for example: it might already be running that cached 318thread!). 319 320Instead, the SMP "switch to" decision needs to be made synchronously 321with the swap call, and as we don't want per-architecture assembly 322code to be handling scheduler internal state, Zephyr requires a 323somewhat lower-level context switch primitives for SMP systems: 324:c:func:`arch_switch` is always called with interrupts masked, and takes 325exactly two arguments. The first is an opaque (architecture defined) 326handle to the context to which it should switch, and the second is a 327pointer to such a handle into which it should store the handle 328resulting from the thread that is being switched out. 329The kernel then implements a portable :c:func:`z_swap` implementation on top 330of this primitive which includes the relevant scheduler logic in a 331location where the architecture doesn't need to understand it. 332 333Similarly, on interrupt exit, switch-based architectures are expected 334to call :c:func:`z_get_next_switch_handle` to retrieve the next thread to 335run from the scheduler. The argument to :c:func:`z_get_next_switch_handle` 336is either the interrupted thread's "handle" reflecting the same opaque type 337used by :c:func:`arch_switch`, or NULL if that thread cannot be released 338to the scheduler just yet. The choice between a handle value or NULL 339depends on the way CPU interrupt mode is implemented. 340 341Architectures with a large CPU register file would typically preserve only 342the caller-saved registers on the current thread's stack when interrupted 343in order to minimize interrupt latency, and preserve the callee-saved 344registers only when :c:func:`arch_switch` is called to minimize context 345switching latency. Such architectures must use NULL as the argument to 346:c:func:`z_get_next_switch_handle` to determine if there is a new thread 347to schedule, and follow through with their own :c:func:`arch_switch` or 348derivative if so, or directly leave interrupt mode otherwise. 349In the former case it is up to that switch code to store the handle 350resulting from the thread that is being switched out in that thread's 351"switch_handle" field after its context has fully been saved. 352 353Architectures whose entry in interrupt mode already preserves the entire 354thread state may pass that thread's handle directly to 355:c:func:`z_get_next_switch_handle` and be done in one step. 356 357Note that while SMP requires :kconfig:option:`CONFIG_USE_SWITCH`, the reverse is not 358true. A uniprocessor architecture built with :kconfig:option:`CONFIG_SMP` set to No might 359still decide to implement its context switching using 360:c:func:`arch_switch`. 361 362API Reference 363************** 364 365.. doxygengroup:: spinlock_apis 366