1.. SPDX-License-Identifier: GPL-2.0 2 3=============== 4Core Scheduling 5=============== 6Core scheduling support allows userspace to define groups of tasks that can 7share a core. These groups can be specified either for security usecases (one 8group of tasks don't trust another), or for performance usecases (some 9workloads may benefit from running on the same core as they don't need the same 10hardware resources of the shared core, or may prefer different cores if they 11do share hardware resource needs). This document only describes the security 12usecase. 13 14Security usecase 15---------------- 16A cross-HT attack involves the attacker and victim running on different Hyper 17Threads of the same core. MDS and L1TF are examples of such attacks. The only 18full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core 19scheduling is a scheduler feature that can mitigate some (not all) cross-HT 20attacks. It allows HT to be turned on safely by ensuring that only tasks in a 21user-designated trusted group can share a core. This increase in core sharing 22can also improve performance, however it is not guaranteed that performance 23will always improve, though that is seen to be the case with a number of real 24world workloads. In theory, core scheduling aims to perform at least as good as 25when Hyper Threading is disabled. In practice, this is mostly the case though 26not always: as synchronizing scheduling decisions across 2 or more CPUs in a 27core involves additional overhead - especially when the system is lightly 28loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core 29scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the 30total number of CPUs. Please measure the performance of your workloads always. 31 32Usage 33----- 34Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option. 35Using this feature, userspace defines groups of tasks that can be co-scheduled 36on the same core. The core scheduler uses this information to make sure that 37tasks that are not in the same group never run simultaneously on a core, while 38doing its best to satisfy the system's scheduling requirements. 39 40Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface. 41This interface provides support for the creation of core scheduling groups, as 42well as admission and removal of tasks from created groups:: 43 44 #include <sys/prctl.h> 45 46 int prctl(int option, unsigned long arg2, unsigned long arg3, 47 unsigned long arg4, unsigned long arg5); 48 49option: 50 ``PR_SCHED_CORE`` 51 52arg2: 53 Command for operation, must be one off: 54 55 - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``. 56 - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``. 57 - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``. 58 - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``. 59 60arg3: 61 ``pid`` of the task for which the operation applies. 62 63arg4: 64 ``pid_type`` for which the operation applies. It is of type ``enum pid_type``. 65 For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command 66 will be performed for all tasks in the task group of ``pid``. 67 68arg5: 69 userspace pointer to an unsigned long for storing the cookie returned by 70 ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands. 71 72In order for a process to push a cookie to, or pull a cookie from a process, it 73is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the 74process. 75 76Building hierarchies of tasks 77~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 78The simplest way to build hierarchies of threads/processes which share a 79cookie and thus a core is to rely on the fact that the core-sched cookie is 80inherited across forks/clones and execs, thus setting a cookie for the 81'initial' script/executable/daemon will place every spawned child in the 82same core-sched group. 83 84Cookie Transferral 85~~~~~~~~~~~~~~~~~~ 86Transferring a cookie between the current and other tasks is possible using 87PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a 88specified task or a share a cookie with a task. In combination this allows a 89simple helper program to pull a cookie from a task in an existing core 90scheduling group and share it with already running tasks. 91 92Design/Implementation 93--------------------- 94Each task that is tagged is assigned a cookie internally in the kernel. As 95mentioned in `Usage`_, tasks with the same cookie value are assumed to trust 96each other and share a core. 97 98The basic idea is that, every schedule event tries to select tasks for all the 99siblings of a core such that all the selected tasks running on a core are 100trusted (same cookie) at any point in time. Kernel threads are assumed trusted. 101The idle task is considered special, as it trusts everything and everything 102trusts it. 103 104During a schedule() event on any sibling of a core, the highest priority task on 105the sibling's core is picked and assigned to the sibling calling schedule(), if 106the sibling has the task enqueued. For rest of the siblings in the core, 107highest priority task with the same cookie is selected if there is one runnable 108in their individual run queues. If a task with same cookie is not available, 109the idle task is selected. Idle task is globally trusted. 110 111Once a task has been selected for all the siblings in the core, an IPI is sent to 112siblings for whom a new task was selected. Siblings on receiving the IPI will 113switch to the new task immediately. If an idle task is selected for a sibling, 114then the sibling is considered to be in a `forced idle` state. I.e., it may 115have tasks on its on runqueue to run, however it will still have to run idle. 116More on this in the next section. 117 118Forced-idling of hyperthreads 119~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 120The scheduler tries its best to find tasks that trust each other such that all 121tasks selected to be scheduled are of the highest priority in a core. However, 122it is possible that some runqueues had tasks that were incompatible with the 123highest priority ones in the core. Favoring security over fairness, one or more 124siblings could be forced to select a lower priority task if the highest 125priority task is not trusted with respect to the core wide highest priority 126task. If a sibling does not have a trusted task to run, it will be forced idle 127by the scheduler (idle thread is scheduled to run). 128 129When the highest priority task is selected to run, a reschedule-IPI is sent to 130the sibling to force it into idle. This results in 4 cases which need to be 131considered depending on whether a VM or a regular usermode process was running 132on either HT:: 133 134 HT1 (attack) HT2 (victim) 135 A idle -> user space user space -> idle 136 B idle -> user space guest -> idle 137 C idle -> guest user space -> idle 138 D idle -> guest guest -> idle 139 140Note that for better performance, we do not wait for the destination CPU 141(victim) to enter idle mode. This is because the sending of the IPI would bring 142the destination CPU immediately into kernel mode from user space, or VMEXIT 143in the case of guests. At best, this would only leak some scheduler metadata 144which may not be worth protecting. It is also possible that the IPI is received 145too late on some architectures, but this has not been observed in the case of 146x86. 147 148Trust model 149~~~~~~~~~~~ 150Core scheduling maintains trust relationships amongst groups of tasks by 151assigning them a tag that is the same cookie value. 152When a system with core scheduling boots, all tasks are considered to trust 153each other. This is because the core scheduler does not have information about 154trust relationships until userspace uses the above mentioned interfaces, to 155communicate them. In other words, all tasks have a default cookie value of 0. 156and are considered system-wide trusted. The forced-idling of siblings running 157cookie-0 tasks is also avoided. 158 159Once userspace uses the above mentioned interfaces to group sets of tasks, tasks 160within such groups are considered to trust each other, but do not trust those 161outside. Tasks outside the group also don't trust tasks within. 162 163Limitations of core-scheduling 164------------------------------ 165Core scheduling tries to guarantee that only trusted tasks run concurrently on a 166core. But there could be small window of time during which untrusted tasks run 167concurrently or kernel could be running concurrently with a task not trusted by 168kernel. 169 170IPI processing delays 171~~~~~~~~~~~~~~~~~~~~~ 172Core scheduling selects only trusted tasks to run together. IPI is used to notify 173the siblings to switch to the new task. But there could be hardware delays in 174receiving of the IPI on some arch (on x86, this has not been observed). This may 175cause an attacker task to start running on a CPU before its siblings receive the 176IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings 177may populate data in the cache and micro architectural buffers after the attacker 178starts to run and this is a possibility for data leak. 179 180Open cross-HT issues that core scheduling does not solve 181-------------------------------------------------------- 1821. For MDS 183~~~~~~~~~~ 184Core scheduling cannot protect against MDS attacks between the siblings 185running in user mode and the others running in kernel mode. Even though all 186siblings run tasks which trust each other, when the kernel is executing 187code on behalf of a task, it cannot trust the code running in the 188sibling. Such attacks are possible for any combination of sibling CPU modes 189(host or guest mode). 190 1912. For L1TF 192~~~~~~~~~~~ 193Core scheduling cannot protect against an L1TF guest attacker exploiting a 194guest or host victim. This is because the guest attacker can craft invalid 195PTEs which are not inverted due to a vulnerable guest kernel. The only 196solution is to disable EPT (Extended Page Tables). 197 198For both MDS and L1TF, if the guest vCPU is configured to not trust each 199other (by tagging separately), then the guest to guest attacks would go away. 200Or it could be a system admin policy which considers guest to guest attacks as 201a guest problem. 202 203Another approach to resolve these would be to make every untrusted task on the 204system to not trust every other untrusted task. While this could reduce 205parallelism of the untrusted tasks, it would still solve the above issues while 206allowing system processes (trusted tasks) to share a core. 207 2083. Protecting the kernel (IRQ, syscall, VMEXIT) 209~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 210Unfortunately, core scheduling does not protect kernel contexts running on 211sibling hyperthreads from one another. Prototypes of mitigations have been posted 212to LKML to solve this, but it is debatable whether such windows are practically 213exploitable, and whether the performance overhead of the prototypes are worth 214it (not to mention, the added code complexity). 215 216Other Use cases 217--------------- 218The main use case for Core scheduling is mitigating the cross-HT vulnerabilities 219with SMT enabled. There are other use cases where this feature could be used: 220 221- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks 222 that uses SIMD instructions etc. 223- Gang scheduling: Requirements for a group of tasks that needs to be scheduled 224 together could also be realized using core scheduling. One example is vCPUs of 225 a VM. 226