scaling.rst - OpenGrok cross reference for /Linux-v5.10/Documentation/networking/scaling.rst

Lines Matching +full:many +full:- +full:to +full:- +full:one
1 .. SPDX-License-Identifier: GPL-2.0
12 networking stack to increase parallelism and improve performance for
13 multi-processor systems.
17 - RSS: Receive Side Scaling
18 - RPS: Receive Packet Steering
19 - RFS: Receive Flow Steering
20 - Accelerated Receive Flow Steering
21 - XPS: Transmit Packet Steering
28 (multi-queue). On reception, a NIC can send different packets to different
29 queues to distribute processing among CPUs. The NIC distributes packets by
30 applying a filter to each packet that assigns it to one of a small number
31 of logical flows. Packets for each flow are steered to a separate receive
33 generally known as “Receive-side Scaling” (RSS). The goal of RSS and
34 the other scaling techniques is to increase performance uniformly.
35 Multi-queue distribution can also be used for traffic prioritization, but
39 and/or transport layer headers-- for example, a 4-tuple hash over
41 implementation of RSS uses a 128-entry indirection table where each entry
47 Some advanced NICs allow steering packets to queues based on
49 can be directed to their own receive queue. Such “n-tuple” filters can
50 be configured from ethtool (--config-ntuple).
54 -----------------
56 The driver for a multi-queue capable NIC typically provides a kernel
57 module parameter for specifying the number of hardware queues to
59 num_queues. A typical RSS configuration would be to have one receive queue
61 one for each memory domain, where a memory domain is a set of CPUs that
66 default mapping is to distribute the queues evenly in the table, but the
68 commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
69 indirection table could be done to give different queues different
77 this to notify a CPU when new packets arrive on the given queue. The
78 signaling path for PCIe devices uses message signaled interrupts (MSI-X),
79 that can route each interrupt to a particular CPU. The active mapping
80 of queues to IRQs can be determined from /proc/interrupts. By default,
81 an IRQ may be handled on any CPU. Because a non-negligible part of packet
83 to spread receive interrupts between CPUs. To manually adjust the IRQ
84 affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
95 is to allocate as many queues as there are CPUs in the system (or the
96 NIC maximum, if lower). The most efficient high-rate configuration
97 is likely the one with the smallest number of receive queues where no
98 receive queue overflows due to a saturated CPU, because in default
102 Per-cpu load can be observed using the mpstat utility, but note that on
105 initial tests, so limit the number of queues to the number of CPU cores
115 interrupt handler, RPS selects the CPU to perform protocol processing
121 2) software filters can easily be added to hash over new protocols
123    introduce inter-processor interrupts (IPIs))
130 The first step in determining the target CPU for RPS is to calculate a
131 flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
137 skb->hash and can be used elsewhere in the stack as a hash of the
140 Each receive hardware queue has an associated list of CPUs to which
144 and the packet is queued to the tail of that CPU’s backlog queue. At
145 the end of the bottom half routine, IPIs are sent to any CPUs for which
146 packets have been queued to their backlog queue. The IPI wakes backlog
152 -----------------
156 explicitly configured. The list of CPUs to which RPS may forward traffic
159   /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
163 CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
170 For a single queue device, a typical RPS configuration would be to set
171 the rps_cpus to the CPUs in the same memory domain of the interrupting
173 the system. At high interrupt rate, it might be wise to exclude the
176 For a multi-queue system, if RSS is configured so that a hardware
177 receive queue is mapped to each CPU, then RPS is probably redundant
184 --------------
187 reordering. The trade-off to sending all packets from the same flow
188 to the same CPU is CPU load imbalance if flows vary in packet rate.
190 common server workloads with many concurrent connections, such
199 net.core.netdev_max_backlog), the kernel starts a per-flow packet
213 turned on. It is implemented for each CPU independently (to avoid lock
220 Per-flow rate is calculated by hashing each packet into a hashtable
221 bucket and incrementing a per-bucket counter. The hash function is
223 be much larger than the number of CPUs, flow limit has finer-grained
236 Flow limit is useful on systems with many concurrent connections,
241 The feature depends on the input packet queue length to exceed
243 Setting net.core.netdev_max_backlog to either 1000 or 10000
253 (RFS). The goal of RFS is to increase datacache hitrate by steering
254 kernel processing of packets to the CPU where the application thread
256 to enqueue packets onto the backlog of another CPU and to wake up that
261 flows to the CPUs where those flows are being processed. The flow hash
262 (see RPS section above) is used to calculate the index into this table.
263 The CPU recorded in each entry is the one which last processed the flow.
264 If an entry does not hold a valid CPU, then packets mapped to that entry
265 are steered using plain RPS. Multiple table entries may point to the
266 same CPU. Indeed, with many flows and few CPUs, it is very likely that
267 a single application thread handles flows with many different flow hashes.
271 Each table value is a CPU index that is updated during calls to recvmsg
275 When the scheduler moves a thread to a new CPU while it has outstanding
276 receive packets on the old CPU, packets may arrive out of order. To
277 avoid this, RFS uses a second flow table to track outstanding packets
278 for each flow: rps_dev_flow_table is a table specific to each hardware
293 entry i is actually selected by hash and multiple flows may hash to the
302 the current CPU is updated to match the desired CPU if one of the
305   - The current CPU's queue head counter >= the recorded tail counter
307   - The current CPU is unset (>= nr_cpu_ids)
308   - The current CPU is offline
310 After this check, the packet is sent to the (possibly updated) current
311 CPU. These rules aim to ensure that a flow only moves to a new CPU when
313 packets could arrive later than those about to be processed on the new
318 -----------------
326 The number of entries in the per-queue flow table are set through::
328   /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
334 Both of these need to be set before RFS is enabled for a receive queue.
335 Values for both are rounded up to the nearest power of two. The
342 would normally be configured to the same value as rps_sock_flow_entries.
343 For a multi-queue device, the rps_flow_cnt for each queue might be
345 queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
353 Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
354 balancing mechanism that uses soft state to steer flows based on where
357 directly to a CPU local to the thread consuming the data. The target CPU
359 which is local to the application thread’s CPU in the cache hierarchy.
361 To enable accelerated RFS, the networking stack calls the
362 ndo_rx_flow_steer driver function to communicate the desired hardware
366 method to program the NIC to steer the packets.
369 rps_dev_flow_table. The stack consults a CPU to hardware queue map which
370 is maintained by the NIC driver. This is an auto-generated reverse map of
373 to populate the map. For each CPU, the corresponding queue in the map is
374 set to be one whose processing CPU is closest in cache locality.
378 -----------------------------
383 of CPU to queues is automatically deduced from the IRQ affinities
391 This technique should be enabled whenever one wants to use RFS and the
399 which transmit queue to use when transmitting a packet on a multi-queue
401 a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
402 to hardware transmit queue(s).
406 The goal of this mapping is usually to assign queues
407 exclusively to a subset of CPUs, where the transmit completions for
418 This mapping is used to pick transmit queue based on the receive
420 queues can be mapped to a set of transmit queues (many:many), although
423 busy polling multi-threaded workloads where there are challenges in
424 associating a given CPU to a given application thread. The application
425 threads are not pinned to CPUs and each thread handles packets
428 transmit queue corresponding to the associated receive queue has benefits
430 the same queue-association that a given application is polling on. This
437 CPUs/receive-queues that may use that queue to transmit. The reverse
438 mapping, from CPUs to transmit queues or from receive-queues to transmit
441 called to select a queue. This function uses the ID of the receive queue
442 for the socket connection for a match in the receive queue-to-transmit queue
444 running CPU as a key into the CPU-to-queue lookup table. If the
446 queues match, one is selected by using the flow hash to compute an index
453 This transmit queue is used for subsequent packets sent on the flow to
455 of calling get_xps_queues() over all packets in the flow. To avoid
457 skb->ooo_okay is set for a packet in the flow. This flag indicates that
465 -----------------
469 how, XPS is configured at device init. The mapping of CPUs/receive-queues
470 to transmit queue can be inspected and configured using sysfs:
474   /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
476 For selection based on receive-queues map::
478   /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
485 has no effect, since there is no choice in this case. In a multi-queue
486 system, XPS is preferably configured so that each CPU maps onto one queue.
487 If there are as many queues as there are CPUs in the system, then each
488 queue can also map onto one CPU, resulting in exclusive pairings that
490 best CPUs to share a given queue are probably those that share the cache
494 For transmit queue selection based on receive queue(s), XPS has to be
495 explicitly configured mapping receive-queue(s) to transmit queue(s). If the
496 user configuration for receive-queue map does not apply, then the transmit
503 These are rate-limitation mechanisms implemented by HW, where currently
504 a max-rate attribute is supported, by setting a Mbps value to::
506   /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
522 - Tom Herbert (therbert@google.com)
523 - Willem de Bruijn (willemb@google.com)