1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 3==================================== 4Marvell OcteonTx2 RVU Kernel Drivers 5==================================== 6 7Copyright (c) 2020 Marvell International Ltd. 8 9Contents 10======== 11 12- `Overview`_ 13- `Drivers`_ 14- `Basic packet flow`_ 15- `Devlink health reporters`_ 16- `Quality of service`_ 17 18Overview 19======== 20 21Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW 22resources from the network, crypto and other functional blocks into 23PCI-compatible physical and virtual functions. Each functional block 24again has multiple local functions (LFs) for provisioning to PCI devices. 25RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual 26functions (VFs). PF0 is called the administrative / admin function (AF) 27and has privileges to provision RVU functional block's LFs to each of the 28PF/VF. 29 30RVU managed networking functional blocks 31 - Network pool or buffer allocator (NPA) 32 - Network interface controller (NIX) 33 - Network parser CAM (NPC) 34 - Schedule/Synchronize/Order unit (SSO) 35 - Loopback interface (LBK) 36 37RVU managed non-networking functional blocks 38 - Crypto accelerator (CPT) 39 - Scheduled timers unit (TIM) 40 - Schedule/Synchronize/Order unit (SSO) 41 Used for both networking and non networking usecases 42 43Resource provisioning examples 44 - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device 45 - A PF/VF with CPT-LF resource works as a pure crypto offload device. 46 47RVU functional blocks are highly configurable as per software requirements. 48 49Firmware setups following stuff before kernel boots 50 - Enables required number of RVU PFs based on number of physical links. 51 - Number of VFs per PF are either static or configurable at compile time. 52 Based on config, firmware assigns VFs to each of the PFs. 53 - Also assigns MSIX vectors to each of PF and VFs. 54 - These are not changed after kernel boot. 55 56Drivers 57======= 58 59Linux kernel will have multiple drivers registering to different PF and VFs 60of RVU. Wrt networking there will be 3 flavours of drivers. 61 62Admin Function driver 63--------------------- 64 65As mentioned above RVU PF0 is called the admin function (AF), this driver 66supports resource provisioning and configuration of functional blocks. 67Doesn't handle any I/O. It sets up few basic stuff but most of the 68funcionality is achieved via configuration requests from PFs and VFs. 69 70PF/VFs communicates with AF via a shared memory region (mailbox). Upon 71receiving requests AF does resource provisioning and other HW configuration. 72AF is always attached to host kernel, but PFs and their VFs may be used by host 73kernel itself, or attached to VMs or to userspace applications like 74DPDK etc. So AF has to handle provisioning/configuration requests sent 75by any device from any domain. 76 77AF driver also interacts with underlying firmware to 78 - Manage physical ethernet links ie CGX LMACs. 79 - Retrieve information like speed, duplex, autoneg etc 80 - Retrieve PHY EEPROM and stats. 81 - Configure FEC, PAM modes 82 - etc 83 84From pure networking side AF driver supports following functionality. 85 - Map a physical link to a RVU PF to which a netdev is registered. 86 - Attach NIX and NPA block LFs to RVU PF/VF which provide buffer pools, RQs, SQs 87 for regular networking functionality. 88 - Flow control (pause frames) enable/disable/config. 89 - HW PTP timestamping related config. 90 - NPC parser profile config, basically how to parse pkt and what info to extract. 91 - NPC extract profile config, what to extract from the pkt to match data in MCAM entries. 92 - Manage NPC MCAM entries, upon request can frame and install requested packet forwarding rules. 93 - Defines receive side scaling (RSS) algorithms. 94 - Defines segmentation offload algorithms (eg TSO) 95 - VLAN stripping, capture and insertion config. 96 - SSO and TIM blocks config which provide packet scheduling support. 97 - Debugfs support, to check current resource provising, current status of 98 NPA pools, NIX RQ, SQ and CQs, various stats etc which helps in debugging issues. 99 - And many more. 100 101Physical Function driver 102------------------------ 103 104This RVU PF handles IO, is mapped to a physical ethernet link and this 105driver registers a netdev. This supports SR-IOV. As said above this driver 106communicates with AF with a mailbox. To retrieve information from physical 107links this driver talks to AF and AF gets that info from firmware and responds 108back ie cannot talk to firmware directly. 109 110Supports ethtool for configuring links, RSS, queue count, queue size, 111flow control, ntuple filters, dump PHY EEPROM, config FEC etc. 112 113Virtual Function driver 114----------------------- 115 116There are two types VFs, VFs that share the physical link with their parent 117SR-IOV PF and the VFs which work in pairs using internal HW loopback channels (LBK). 118 119Type1: 120 - These VFs and their parent PF share a physical link and used for outside communication. 121 - VFs cannot communicate with AF directly, they send mbox message to PF and PF 122 forwards that to AF. AF after processing, responds back to PF and PF forwards 123 the reply to VF. 124 - From functionality point of view there is no difference between PF and VF as same type 125 HW resources are attached to both. But user would be able to configure few stuff only 126 from PF as PF is treated as owner/admin of the link. 127 128Type2: 129 - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels. 130 - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of 131 VF0 will be received by VF1 and vice versa. 132 - These VFs can be used by applications or virtual machines to communicate between them 133 without sending traffic outside. There is no switch present in HW, hence the support 134 for loopback VFs. 135 - These communicate directly with AF (PF0) via mbox. 136 137Except for the IO channels or links used for packet reception and transmission there is 138no other difference between these VF types. AF driver takes care of IO channel mapping, 139hence same VF driver works for both types of devices. 140 141Basic packet flow 142================= 143 144Ingress 145------- 146 1471. CGX LMAC receives packet. 1482. Forwards the packet to the NIX block. 1493. Then submitted to NPC block for parsing and then MCAM lookup to get the destination RVU device. 1504. NIX LF attached to the destination RVU device allocates a buffer from RQ mapped buffer pool of NPA block LF. 1515. RQ may be selected by RSS or by configuring MCAM rule with a RQ number. 1526. Packet is DMA'ed and driver is notified. 153 154Egress 155------ 156 1571. Driver prepares a send descriptor and submits to SQ for transmission. 1582. The SQ is already configured (by AF) to transmit on a specific link/channel. 1593. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF. 1604. NIX block transmits the pkt on the designated channel. 1615. NPC MCAM entries can be installed to divert pkt onto a different channel. 162 163Devlink health reporters 164======================== 165 166NPA Reporters 167------------- 168The NPA reporters are responsible for reporting and recovering the following group of errors: 169 1701. GENERAL events 171 172 - Error due to operation of unmapped PF. 173 - Error due to disabled alloc/free for other HW blocks (NIX, SSO, TIM, DPI and AURA). 174 1752. ERROR events 176 177 - Fault due to NPA_AQ_INST_S read or NPA_AQ_RES_S write. 178 - AQ Doorbell Error. 179 1803. RAS events 181 182 - RAS Error Reporting for NPA_AQ_INST_S/NPA_AQ_RES_S. 183 1844. RVU events 185 186 - Error due to unmapped slot. 187 188Sample Output:: 189 190 ~# devlink health 191 pci/0002:01:00.0: 192 reporter hw_npa_intr 193 state healthy error 2872 recover 2872 last_dump_date 2020-12-10 last_dump_time 09:39:09 grace_period 0 auto_recover true auto_dump true 194 reporter hw_npa_gen 195 state healthy error 2872 recover 2872 last_dump_date 2020-12-11 last_dump_time 04:43:04 grace_period 0 auto_recover true auto_dump true 196 reporter hw_npa_err 197 state healthy error 2871 recover 2871 last_dump_date 2020-12-10 last_dump_time 09:39:17 grace_period 0 auto_recover true auto_dump true 198 reporter hw_npa_ras 199 state healthy error 0 recover 0 last_dump_date 2020-12-10 last_dump_time 09:32:40 grace_period 0 auto_recover true auto_dump true 200 201Each reporter dumps the 202 203 - Error Type 204 - Error Register value 205 - Reason in words 206 207For example:: 208 209 ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_gen 210 NPA_AF_GENERAL: 211 NPA General Interrupt Reg : 1 212 NIX0: free disabled RX 213 ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_intr 214 NPA_AF_RVU: 215 NPA RVU Interrupt Reg : 1 216 Unmap Slot Error 217 ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_err 218 NPA_AF_ERR: 219 NPA Error Interrupt Reg : 4096 220 AQ Doorbell Error 221 222 223NIX Reporters 224------------- 225The NIX reporters are responsible for reporting and recovering the following group of errors: 226 2271. GENERAL events 228 229 - Receive mirror/multicast packet drop due to insufficient buffer. 230 - SMQ Flush operation. 231 2322. ERROR events 233 234 - Memory Fault due to WQE read/write from multicast/mirror buffer. 235 - Receive multicast/mirror replication list error. 236 - Receive packet on an unmapped PF. 237 - Fault due to NIX_AQ_INST_S read or NIX_AQ_RES_S write. 238 - AQ Doorbell Error. 239 2403. RAS events 241 242 - RAS Error Reporting for NIX Receive Multicast/Mirror Entry Structure. 243 - RAS Error Reporting for WQE/Packet Data read from Multicast/Mirror Buffer.. 244 - RAS Error Reporting for NIX_AQ_INST_S/NIX_AQ_RES_S. 245 2464. RVU events 247 248 - Error due to unmapped slot. 249 250Sample Output:: 251 252 ~# ./devlink health 253 pci/0002:01:00.0: 254 reporter hw_npa_intr 255 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 256 reporter hw_npa_gen 257 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 258 reporter hw_npa_err 259 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 260 reporter hw_npa_ras 261 state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true 262 reporter hw_nix_intr 263 state healthy error 1121 recover 1121 last_dump_date 2021-01-19 last_dump_time 05:42:26 grace_period 0 auto_recover true auto_dump true 264 reporter hw_nix_gen 265 state healthy error 949 recover 949 last_dump_date 2021-01-19 last_dump_time 05:42:43 grace_period 0 auto_recover true auto_dump true 266 reporter hw_nix_err 267 state healthy error 1147 recover 1147 last_dump_date 2021-01-19 last_dump_time 05:42:59 grace_period 0 auto_recover true auto_dump true 268 reporter hw_nix_ras 269 state healthy error 409 recover 409 last_dump_date 2021-01-19 last_dump_time 05:43:16 grace_period 0 auto_recover true auto_dump true 270 271Each reporter dumps the 272 273 - Error Type 274 - Error Register value 275 - Reason in words 276 277For example:: 278 279 ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr 280 NIX_AF_RVU: 281 NIX RVU Interrupt Reg : 1 282 Unmap Slot Error 283 ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen 284 NIX_AF_GENERAL: 285 NIX General Interrupt Reg : 1 286 Rx multicast pkt drop 287 ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_err 288 NIX_AF_ERR: 289 NIX Error Interrupt Reg : 64 290 Rx on unmapped PF_FUNC 291 292 293Quality of service 294================== 295 296 297Hardware algorithms used in scheduling 298-------------------------------------- 299 300octeontx2 silicon and CN10K transmit interface consists of five transmit levels 301starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1 302levels. Each level contains an array of queues to support scheduling and shaping. 303The hardware uses the below algorithms depending on the priority of scheduler queues. 304once the usercreates tc classes with different priorities, the driver configures 305schedulers allocated to the class with specified priority along with rate-limiting 306configuration. 307 3081. Strict Priority 309 310 - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority 311 using strict priority. 312 3132. Round Robin 314 315 - Active MDQs having the same priority level are chosen using round robin. 316 317 318Setup HTB offload 319----------------- 320 3211. Enable HW TC offload on the interface:: 322 323 # ethtool -K <interface> hw-tc-offload on 324 3252. Crate htb root:: 326 327 # tc qdisc add dev <interface> clsact 328 # tc qdisc replace dev <interface> root handle 1: htb offload 329 3303. Create tc classes with different priorities:: 331 332 # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1 333 334 # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7 335 3364. Create tc classes with same priorities and different quantum:: 337 338 # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600 339 340 # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416 341 342 # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768 343