1.. _smp_arch:
2
3Symmetric Multiprocessing
4#########################
5
6On multiprocessor architectures, Zephyr supports the use of multiple
7physical CPUs running Zephyr application code.  This support is
8"symmetric" in the sense that no specific CPU is treated specially by
9default.  Any processor is capable of running any Zephyr thread, with
10access to all standard Zephyr APIs supported.
11
12No special application code needs to be written to take advantage of
13this feature.  If there are two Zephyr application threads runnable on
14a supported dual processor device, they will both run simultaneously.
15
16SMP configuration is controlled under the :kconfig:option:`CONFIG_SMP` kconfig
17variable.  This must be set to "y" to enable SMP features, otherwise
18a uniprocessor kernel will be built.  In general the platform default
19will have enabled this anywhere it's supported. When enabled, the
20number of physical CPUs available is visible at build time as
21:kconfig:option:`CONFIG_MP_MAX_NUM_CPUS`.  Likewise, the default for this will be the
22number of available CPUs on the platform and it is not expected that
23typical apps will change it.  But it is legal and supported to set
24this to a smaller (but obviously not larger) number for special
25purposes (e.g. for testing, or to reserve a physical CPU for running
26non-Zephyr code).
27
28Synchronization
29***************
30
31At the application level, core Zephyr IPC and synchronization
32primitives all behave identically under an SMP kernel.  For example
33semaphores used to implement blocking mutual exclusion continue to be
34a proper application choice.
35
36At the lowest level, however, Zephyr code has often used the
37:c:func:`irq_lock`/:c:func:`irq_unlock` primitives to implement fine grained
38critical sections using interrupt masking.  These APIs continue to
39work via an emulation layer (see below), but the masking technique
40does not: the fact that your CPU will not be interrupted while you are
41in your critical section says nothing about whether a different CPU
42will be running simultaneously and be inspecting or modifying the same
43data!
44
45Spinlocks
46=========
47
48SMP systems provide a more constrained :c:func:`k_spin_lock` primitive
49that not only masks interrupts locally, as done by :c:func:`irq_lock`, but
50also atomically validates that a shared lock variable has been
51modified before returning to the caller, "spinning" on the check if
52needed to wait for the other CPU to exit the lock.  The default Zephyr
53implementation of :c:func:`k_spin_lock` and :c:func:`k_spin_unlock` is built
54on top of the pre-existing :c:struct:`atomic_` layer (itself usually
55implemented using compiler intrinsics), though facilities exist for
56architectures to define their own for performance reasons.
57
58One important difference between IRQ locks and spinlocks is that the
59earlier API was naturally recursive: the lock was global, so it was
60legal to acquire a nested lock inside of a critical section.
61Spinlocks are separable: you can have many locks for separate
62subsystems or data structures, preventing CPUs from contending on a
63single global resource.  But that means that spinlocks must not be
64used recursively.  Code that holds a specific lock must not try to
65re-acquire it or it will deadlock (it is perfectly legal to nest
66**distinct** spinlocks, however).  A validation layer is available to
67detect and report bugs like this.
68
69When used on a uniprocessor system, the data component of the spinlock
70(the atomic lock variable) is unnecessary and elided.  Except for the
71recursive semantics above, spinlocks in single-CPU contexts produce
72identical code to legacy IRQ locks.  In fact the entirety of the
73Zephyr core kernel has now been ported to use spinlocks exclusively.
74
75Legacy irq_lock() emulation
76===========================
77
78For the benefit of applications written to the uniprocessor locking
79API, :c:func:`irq_lock` and :c:func:`irq_unlock` continue to work compatibly on
80SMP systems with identical semantics to their legacy versions.  They
81are implemented as a single global spinlock, with a nesting count and
82the ability to be atomically reacquired on context switch into locked
83threads.  The kernel will ensure that only one thread across all CPUs
84can hold the lock at any time, that it is released on context switch,
85and that it is re-acquired when necessary to restore the lock state
86when a thread is switched in.  Other CPUs will spin waiting for the
87release to happen.
88
89The overhead involved in this process has measurable performance
90impact, however.  Unlike uniprocessor apps, SMP apps using
91:c:func:`irq_lock` are not simply invoking a very short (often ~1
92instruction) interrupt masking operation.  That, and the fact that the
93IRQ lock is global, means that code expecting to be run in an SMP
94context should be using the spinlock API wherever possible.
95
96CPU Mask
97********
98
99It is often desirable for real time applications to deliberately
100partition work across physical CPUs instead of relying solely on the
101kernel scheduler to decide on which threads to execute.  Zephyr
102provides an API, controlled by the :kconfig:option:`CONFIG_SCHED_CPU_MASK`
103kconfig variable, which can associate a specific set of CPUs with each
104thread, indicating on which CPUs it can run.
105
106By default, new threads can run on any CPU.  Calling
107:c:func:`k_thread_cpu_mask_disable` with a particular CPU ID will prevent
108that thread from running on that CPU in the future.  Likewise
109:c:func:`k_thread_cpu_mask_enable` will re-enable execution.  There are also
110:c:func:`k_thread_cpu_mask_clear` and :c:func:`k_thread_cpu_mask_enable_all` APIs
111available for convenience.  For obvious reasons, these APIs are
112illegal if called on a runnable thread.  The thread must be blocked or
113suspended, otherwise an ``-EINVAL`` will be returned.
114
115Note that when this feature is enabled, the scheduler algorithm
116involved in doing the per-CPU mask test requires that the list be
117traversed in full.  The kernel does not keep a per-CPU run queue.
118That means that the performance benefits from the
119:kconfig:option:`CONFIG_SCHED_SCALABLE` and :kconfig:option:`CONFIG_SCHED_MULTIQ`
120scheduler backends cannot be realized.  CPU mask processing is
121available only when :kconfig:option:`CONFIG_SCHED_DUMB` is the selected
122backend.  This requirement is enforced in the configuration layer.
123
124SMP Boot Process
125****************
126
127A Zephyr SMP kernel begins boot identically to a uniprocessor kernel.
128Auxiliary CPUs begin in a disabled state in the architecture layer.
129All standard kernel initialization, including device initialization,
130happens on a single CPU before other CPUs are brought online.
131
132Just before entering the application :c:func:`main` function, the kernel
133calls :c:func:`z_smp_init` to launch the SMP initialization process.  This
134enumerates over the configured CPUs, calling into the architecture
135layer using :c:func:`arch_cpu_start` for each one.  This function is
136passed a memory region to use as a stack on the foreign CPU (in
137practice it uses the area that will become that CPU's interrupt
138stack), the address of a local :c:func:`smp_init_top` callback function to
139run on that CPU, and a pointer to a "start flag" address which will be
140used as an atomic signal.
141
142The local SMP initialization (:c:func:`smp_init_top`) on each CPU is then
143invoked by the architecture layer.  Note that interrupts are still
144masked at this point.  This routine is responsible for calling
145:c:func:`smp_timer_init` to set up any needed stat in the timer driver.  On
146many architectures the timer is a per-CPU device and needs to be
147configured specially on auxiliary CPUs.  Then it waits (spinning) for
148the atomic "start flag" to be released in the main thread, to
149guarantee that all SMP initialization is complete before any Zephyr
150application code runs, and finally calls :c:func:`z_swap` to transfer
151control to the appropriate runnable thread via the standard scheduler
152API.
153
154.. figure:: smpinit.svg
155   :align: center
156   :alt: SMP Initialization
157   :figclass: align-center
158
159   Example SMP initialization process, showing a configuration with
160   two CPUs and two app threads which begin operating simultaneously.
161
162Interprocessor Interrupts
163*************************
164
165When running in multiprocessor environments, it is occasionally the
166case that state modified on the local CPU needs to be synchronously
167handled on a different processor.
168
169One example is the Zephyr :c:func:`k_thread_abort` API, which cannot return
170until the thread that had been aborted is no longer runnable.  If it
171is currently running on another CPU, that becomes difficult to
172implement.
173
174Another is low power idle.  It is a firm requirement on many devices
175that system idle be implemented using a low-power mode with as many
176interrupts (including periodic timer interrupts) disabled or deferred
177as is possible.  If a CPU is in such a state, and on another CPU a
178thread becomes runnable, the idle CPU has no way to "wake up" to
179handle the newly-runnable load.
180
181So where possible, Zephyr SMP architectures should implement an
182interprocessor interrupt.  The current framework is very simple: the
183architecture provides at least a :c:func:`arch_sched_broadcast_ipi` call,
184which when invoked will flag an interrupt on all CPUs (except the current one,
185though that is allowed behavior). If the architecture supports directed IPIs
186(see :kconfig:option:`CONFIG_ARCH_HAS_DIRECTED_IPIS`), then the
187architecture also provides a :c:func:`arch_sched_directed_ipi` call, which
188when invoked will flag an interrupt on the specified CPUs. When an interrupt is
189flagged on the CPUs, the :c:func:`z_sched_ipi` function implemented in the
190scheduler will get invoked on those CPUs. The expectation is that these
191APIs will evolve over time to encompass more functionality (e.g. cross-CPU
192calls), and that the scheduler-specific calls here will be implemented in
193terms of a more general framework.
194
195Note that not all SMP architectures will have a usable IPI mechanism
196(either missing, or just undocumented/unimplemented).  In those cases
197Zephyr provides fallback behavior that is correct, but perhaps
198suboptimal.
199
200Using this, :c:func:`k_thread_abort` becomes only slightly more
201complicated in SMP: for the case where a thread is actually running on
202another CPU (we can detect this atomically inside the scheduler), we
203broadcast an IPI and spin, waiting for the thread to either become
204"DEAD" or for it to re-enter the queue (in which case we terminate it
205the same way we would have in uniprocessor mode).  Note that the
206"aborted" check happens on any interrupt exit, so there is no special
207handling needed in the IPI per se.  This allows us to implement a
208reasonable fallback when IPI is not available: we can simply spin,
209waiting until the foreign CPU receives any interrupt, though this may
210be a much longer time!
211
212Likewise idle wakeups are trivially implementable with an empty IPI
213handler.  If a thread is added to an empty run queue (i.e. there may
214have been idle CPUs), we broadcast an IPI.  A foreign CPU will then be
215able to see the new thread when exiting from the interrupt and will
216switch to it if available.
217
218Without an IPI, however, a low power idle that requires an interrupt
219will not work to synchronously run new threads.  The workaround in
220that case is more invasive: Zephyr will **not** enter the system idle
221handler and will instead spin in its idle loop, testing the scheduler
222state at high frequency (not spinning on it though, as that would
223involve severe lock contention) for new threads.  The expectation is
224that power constrained SMP applications are always going to provide an
225IPI, and this code will only be used for testing purposes or on
226systems without power consumption requirements.
227
228IPI Cascades
229============
230
231The kernel can not control the order in which IPIs are processed by the CPUs
232in the system. In general, this is not an issue and a single set of IPIs is
233sufficient to trigger a reschedule on the N CPUs that results with them
234scheduling the highest N priority ready threads to execute. When CPU masking
235is used, there may be more than one valid set of threads (not to be confused
236with an optimal set of threads) that can be scheduled on the N CPUs and a
237single set of IPIs may be insufficient to result in any of these valid sets.
238
239.. note::
240    When CPU masking is not in play, the optimal set of threads is the same
241    as the valid set of threads. However when CPU masking is in play, there
242    may be more than one valid set--one of which may be optimal.
243
244    To better illustrate the distinction, consider a 2-CPU system with ready
245    threads T1 and T2 at priorities 1 and 2 respectively. Let T2 be pinned to
246    CPU0 and T1 not be pinned. If CPU0 is executing T2 and CPU1 executing T1,
247    then this set is is both valid and optimal. However, if CPU0 is executing
248    T1 and CPU1 is idling, then this too would be valid though not optimal.
249
250In those cases where a single set of IPIs is not sufficient to generate a valid
251set, the resulting set of executing threads are expected to be close to a valid
252set, and subsequent IPIs can generally be expected to correct the situation
253soon. However, for cases where neither the approximation nor the delay are
254acceptable, enabling :kconfig:option:`CONFIG_SCHED_IPI_CASCADE` will allow the
255kernel to generate cascading IPIs until the kernel has selected a valid set of
256ready threads for the CPUs.
257
258There are three types of costs/penalties associated with the IPI cascades--and
259for these reasons they are disabled by default. The first is a cost incurred
260by the CPU producing the IPI when a new thread preempts the old thread as checks
261must be done to compare the old thread against the threads executing on the
262other CPUs. The second is a cost incurred by the CPUs receiving the IPIs as
263they must be processed. The third is the apparent sputtering of a thread as it
264"winks in" and then "winks out" due to cascades stemming from the
265aforementioned first cost.
266
267SMP Kernel Internals
268********************
269
270In general, Zephyr kernel code is SMP-agnostic and, like application
271code, will work correctly regardless of the number of CPUs available.
272But in a few areas there are notable changes in structure or behavior.
273
274
275Per-CPU data
276============
277
278Many elements of the core kernel data need to be implemented for each
279CPU in SMP mode.  For example, the ``arch_current_thread()`` thread pointer obviously
280needs to reflect what is running locally, there are many threads
281running concurrently.  Likewise a kernel-provided interrupt stack
282needs to be created and assigned for each physical CPU, as does the
283interrupt nesting count used to detect ISR state.
284
285These fields are now moved into a separate struct :c:struct:`_cpu` instance
286within the :c:struct:`_kernel` struct, which has a ``cpus[]`` array indexed by ID.
287Compatibility fields are provided for legacy uniprocessor code trying
288to access the fields of ``cpus[0]`` using the older syntax and assembly
289offsets.
290
291Note that an important requirement on the architecture layer is that
292the pointer to this CPU struct be available rapidly when in kernel
293context.  The expectation is that :c:func:`arch_curr_cpu` will be
294implemented using a CPU-provided register or addressing mode that can
295store this value across arbitrary context switches or interrupts and
296make it available to any kernel-mode code.
297
298Similarly, where on a uniprocessor system Zephyr could simply create a
299global "idle thread" at the lowest priority, in SMP we may need one
300for each CPU.  This makes the internal predicate test for "_is_idle()"
301in the scheduler, which is a hot path performance environment, more
302complicated than simply testing the thread pointer for equality with a
303known static variable.  In SMP mode, idle threads are distinguished by
304a separate field in the thread struct.
305
306Switch-based context switching
307==============================
308
309The traditional Zephyr context switch primitive has been :c:func:`z_swap`.
310Unfortunately, this function takes no argument specifying a thread to
311switch to.  The expectation has always been that the scheduler has
312already made its preemption decision when its state was last modified
313and cached the resulting "next thread" pointer in a location where
314architecture context switch primitives can find it via a simple struct
315offset.  That technique will not work in SMP, because the other CPU
316may have modified scheduler state since the current CPU last exited
317the scheduler (for example: it might already be running that cached
318thread!).
319
320Instead, the SMP "switch to" decision needs to be made synchronously
321with the swap call, and as we don't want per-architecture assembly
322code to be handling scheduler internal state, Zephyr requires a
323somewhat lower-level context switch primitives for SMP systems:
324:c:func:`arch_switch` is always called with interrupts masked, and takes
325exactly two arguments.  The first is an opaque (architecture defined)
326handle to the context to which it should switch, and the second is a
327pointer to such a handle into which it should store the handle
328resulting from the thread that is being switched out.
329The kernel then implements a portable :c:func:`z_swap` implementation on top
330of this primitive which includes the relevant scheduler logic in a
331location where the architecture doesn't need to understand it.
332
333Similarly, on interrupt exit, switch-based architectures are expected
334to call :c:func:`z_get_next_switch_handle` to retrieve the next thread to
335run from the scheduler. The argument to :c:func:`z_get_next_switch_handle`
336is either the interrupted thread's "handle" reflecting the same opaque type
337used by :c:func:`arch_switch`, or NULL if that thread cannot be released
338to the scheduler just yet. The choice between a handle value or NULL
339depends on the way CPU interrupt mode is implemented.
340
341Architectures with a large CPU register file would typically preserve only
342the caller-saved registers on the current thread's stack when interrupted
343in order to minimize interrupt latency, and preserve the callee-saved
344registers only when :c:func:`arch_switch` is called to minimize context
345switching latency. Such architectures must use NULL as the argument to
346:c:func:`z_get_next_switch_handle` to determine if there is a new thread
347to schedule, and follow through with their own :c:func:`arch_switch` or
348derivative if so, or directly leave interrupt mode otherwise.
349In the former case it is up to that switch code to store the handle
350resulting from the thread that is being switched out in that thread's
351"switch_handle" field after its context has fully been saved.
352
353Architectures whose entry in interrupt mode already preserves the entire
354thread state may pass that thread's handle directly to
355:c:func:`z_get_next_switch_handle` and be done in one step.
356
357Note that while SMP requires :kconfig:option:`CONFIG_USE_SWITCH`, the reverse is not
358true.  A uniprocessor architecture built with :kconfig:option:`CONFIG_SMP` set to No might
359still decide to implement its context switching using
360:c:func:`arch_switch`.
361
362API Reference
363**************
364
365.. doxygengroup:: spinlock_apis
366