1Coherent Accelerator Interface (CXL) 2==================================== 3 4Introduction 5============ 6 7 The coherent accelerator interface is designed to allow the 8 coherent connection of accelerators (FPGAs and other devices) to a 9 POWER system. These devices need to adhere to the Coherent 10 Accelerator Interface Architecture (CAIA). 11 12 IBM refers to this as the Coherent Accelerator Processor Interface 13 or CAPI. In the kernel it's referred to by the name CXL to avoid 14 confusion with the ISDN CAPI subsystem. 15 16 Coherent in this context means that the accelerator and CPUs can 17 both access system memory directly and with the same effective 18 addresses. 19 20 21Hardware overview 22================= 23 24 POWER8/9 FPGA 25 +----------+ +---------+ 26 | | | | 27 | CPU | | AFU | 28 | | | | 29 | | | | 30 | | | | 31 +----------+ +---------+ 32 | PHB | | | 33 | +------+ | PSL | 34 | | CAPP |<------>| | 35 +---+------+ PCIE +---------+ 36 37 The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP) 38 unit which is part of the PCIe Host Bridge (PHB). This is managed 39 by Linux by calls into OPAL. Linux doesn't directly program the 40 CAPP. 41 42 The FPGA (or coherently attached device) consists of two parts. 43 The POWER Service Layer (PSL) and the Accelerator Function Unit 44 (AFU). The AFU is used to implement specific functionality behind 45 the PSL. The PSL, among other things, provides memory address 46 translation services to allow each AFU direct access to userspace 47 memory. 48 49 The AFU is the core part of the accelerator (eg. the compression, 50 crypto etc function). The kernel has no knowledge of the function 51 of the AFU. Only userspace interacts directly with the AFU. 52 53 The PSL provides the translation and interrupt services that the 54 AFU needs. This is what the kernel interacts with. For example, if 55 the AFU needs to read a particular effective address, it sends 56 that address to the PSL, the PSL then translates it, fetches the 57 data from memory and returns it to the AFU. If the PSL has a 58 translation miss, it interrupts the kernel and the kernel services 59 the fault. The context to which this fault is serviced is based on 60 who owns that acceleration function. 61 62 POWER8 <-----> PSL Version 8 is compliant to the CAIA Version 1.0. 63 POWER9 <-----> PSL Version 9 is compliant to the CAIA Version 2.0. 64 This PSL Version 9 provides new features such as: 65 * Interaction with the nest MMU on the P9 chip. 66 * Native DMA support. 67 * Supports sending ASB_Notify messages for host thread wakeup. 68 * Supports Atomic operations. 69 * .... 70 71 Cards with a PSL9 won't work on a POWER8 system and cards with a 72 PSL8 won't work on a POWER9 system. 73 74AFU Modes 75========= 76 77 There are two programming modes supported by the AFU. Dedicated 78 and AFU directed. AFU may support one or both modes. 79 80 When using dedicated mode only one MMU context is supported. In 81 this mode, only one userspace process can use the accelerator at 82 time. 83 84 When using AFU directed mode, up to 16K simultaneous contexts can 85 be supported. This means up to 16K simultaneous userspace 86 applications may use the accelerator (although specific AFUs may 87 support fewer). In this mode, the AFU sends a 16 bit context ID 88 with each of its requests. This tells the PSL which context is 89 associated with each operation. If the PSL can't translate an 90 operation, the ID can also be accessed by the kernel so it can 91 determine the userspace context associated with an operation. 92 93 94MMIO space 95========== 96 97 A portion of the accelerator MMIO space can be directly mapped 98 from the AFU to userspace. Either the whole space can be mapped or 99 just a per context portion. The hardware is self describing, hence 100 the kernel can determine the offset and size of the per context 101 portion. 102 103 104Interrupts 105========== 106 107 AFUs may generate interrupts that are destined for userspace. These 108 are received by the kernel as hardware interrupts and passed onto 109 userspace by a read syscall documented below. 110 111 Data storage faults and error interrupts are handled by the kernel 112 driver. 113 114 115Work Element Descriptor (WED) 116============================= 117 118 The WED is a 64-bit parameter passed to the AFU when a context is 119 started. Its format is up to the AFU hence the kernel has no 120 knowledge of what it represents. Typically it will be the 121 effective address of a work queue or status block where the AFU 122 and userspace can share control and status information. 123 124 125 126 127User API 128======== 129 1301. AFU character devices 131 132 For AFUs operating in AFU directed mode, two character device 133 files will be created. /dev/cxl/afu0.0m will correspond to a 134 master context and /dev/cxl/afu0.0s will correspond to a slave 135 context. Master contexts have access to the full MMIO space an 136 AFU provides. Slave contexts have access to only the per process 137 MMIO space an AFU provides. 138 139 For AFUs operating in dedicated process mode, the driver will 140 only create a single character device per AFU called 141 /dev/cxl/afu0.0d. This will have access to the entire MMIO space 142 that the AFU provides (like master contexts in AFU directed). 143 144 The types described below are defined in include/uapi/misc/cxl.h 145 146 The following file operations are supported on both slave and 147 master devices. 148 149 A userspace library libcxl is available here: 150 https://github.com/ibm-capi/libcxl 151 This provides a C interface to this kernel API. 152 153open 154---- 155 156 Opens the device and allocates a file descriptor to be used with 157 the rest of the API. 158 159 A dedicated mode AFU only has one context and only allows the 160 device to be opened once. 161 162 An AFU directed mode AFU can have many contexts, the device can be 163 opened once for each context that is available. 164 165 When all available contexts are allocated the open call will fail 166 and return -ENOSPC. 167 168 Note: IRQs need to be allocated for each context, which may limit 169 the number of contexts that can be created, and therefore 170 how many times the device can be opened. The POWER8 CAPP 171 supports 2040 IRQs and 3 are used by the kernel, so 2037 are 172 left. If 1 IRQ is needed per context, then only 2037 173 contexts can be allocated. If 4 IRQs are needed per context, 174 then only 2037/4 = 509 contexts can be allocated. 175 176 177ioctl 178----- 179 180 CXL_IOCTL_START_WORK: 181 Starts the AFU context and associates it with the current 182 process. Once this ioctl is successfully executed, all memory 183 mapped into this process is accessible to this AFU context 184 using the same effective addresses. No additional calls are 185 required to map/unmap memory. The AFU memory context will be 186 updated as userspace allocates and frees memory. This ioctl 187 returns once the AFU context is started. 188 189 Takes a pointer to a struct cxl_ioctl_start_work: 190 191 struct cxl_ioctl_start_work { 192 __u64 flags; 193 __u64 work_element_descriptor; 194 __u64 amr; 195 __s16 num_interrupts; 196 __s16 reserved1; 197 __s32 reserved2; 198 __u64 reserved3; 199 __u64 reserved4; 200 __u64 reserved5; 201 __u64 reserved6; 202 }; 203 204 flags: 205 Indicates which optional fields in the structure are 206 valid. 207 208 work_element_descriptor: 209 The Work Element Descriptor (WED) is a 64-bit argument 210 defined by the AFU. Typically this is an effective 211 address pointing to an AFU specific structure 212 describing what work to perform. 213 214 amr: 215 Authority Mask Register (AMR), same as the powerpc 216 AMR. This field is only used by the kernel when the 217 corresponding CXL_START_WORK_AMR value is specified in 218 flags. If not specified the kernel will use a default 219 value of 0. 220 221 num_interrupts: 222 Number of userspace interrupts to request. This field 223 is only used by the kernel when the corresponding 224 CXL_START_WORK_NUM_IRQS value is specified in flags. 225 If not specified the minimum number required by the 226 AFU will be allocated. The min and max number can be 227 obtained from sysfs. 228 229 reserved fields: 230 For ABI padding and future extensions 231 232 CXL_IOCTL_GET_PROCESS_ELEMENT: 233 Get the current context id, also known as the process element. 234 The value is returned from the kernel as a __u32. 235 236 237mmap 238---- 239 240 An AFU may have an MMIO space to facilitate communication with the 241 AFU. If it does, the MMIO space can be accessed via mmap. The size 242 and contents of this area are specific to the particular AFU. The 243 size can be discovered via sysfs. 244 245 In AFU directed mode, master contexts are allowed to map all of 246 the MMIO space and slave contexts are allowed to only map the per 247 process MMIO space associated with the context. In dedicated 248 process mode the entire MMIO space can always be mapped. 249 250 This mmap call must be done after the START_WORK ioctl. 251 252 Care should be taken when accessing MMIO space. Only 32 and 64-bit 253 accesses are supported by POWER8. Also, the AFU will be designed 254 with a specific endianness, so all MMIO accesses should consider 255 endianness (recommend endian(3) variants like: le64toh(), 256 be64toh() etc). These endian issues equally apply to shared memory 257 queues the WED may describe. 258 259 260read 261---- 262 263 Reads events from the AFU. Blocks if no events are pending 264 (unless O_NONBLOCK is supplied). Returns -EIO in the case of an 265 unrecoverable error or if the card is removed. 266 267 read() will always return an integral number of events. 268 269 The buffer passed to read() must be at least 4K bytes. 270 271 The result of the read will be a buffer of one or more events, 272 each event is of type struct cxl_event, of varying size. 273 274 struct cxl_event { 275 struct cxl_event_header header; 276 union { 277 struct cxl_event_afu_interrupt irq; 278 struct cxl_event_data_storage fault; 279 struct cxl_event_afu_error afu_error; 280 }; 281 }; 282 283 The struct cxl_event_header is defined as: 284 285 struct cxl_event_header { 286 __u16 type; 287 __u16 size; 288 __u16 process_element; 289 __u16 reserved1; 290 }; 291 292 type: 293 This defines the type of event. The type determines how 294 the rest of the event is structured. These types are 295 described below and defined by enum cxl_event_type. 296 297 size: 298 This is the size of the event in bytes including the 299 struct cxl_event_header. The start of the next event can 300 be found at this offset from the start of the current 301 event. 302 303 process_element: 304 Context ID of the event. 305 306 reserved field: 307 For future extensions and padding. 308 309 If the event type is CXL_EVENT_AFU_INTERRUPT then the event 310 structure is defined as: 311 312 struct cxl_event_afu_interrupt { 313 __u16 flags; 314 __u16 irq; /* Raised AFU interrupt number */ 315 __u32 reserved1; 316 }; 317 318 flags: 319 These flags indicate which optional fields are present 320 in this struct. Currently all fields are mandatory. 321 322 irq: 323 The IRQ number sent by the AFU. 324 325 reserved field: 326 For future extensions and padding. 327 328 If the event type is CXL_EVENT_DATA_STORAGE then the event 329 structure is defined as: 330 331 struct cxl_event_data_storage { 332 __u16 flags; 333 __u16 reserved1; 334 __u32 reserved2; 335 __u64 addr; 336 __u64 dsisr; 337 __u64 reserved3; 338 }; 339 340 flags: 341 These flags indicate which optional fields are present in 342 this struct. Currently all fields are mandatory. 343 344 address: 345 The address that the AFU unsuccessfully attempted to 346 access. Valid accesses will be handled transparently by the 347 kernel but invalid accesses will generate this event. 348 349 dsisr: 350 This field gives information on the type of fault. It is a 351 copy of the DSISR from the PSL hardware when the address 352 fault occurred. The form of the DSISR is as defined in the 353 CAIA. 354 355 reserved fields: 356 For future extensions 357 358 If the event type is CXL_EVENT_AFU_ERROR then the event structure 359 is defined as: 360 361 struct cxl_event_afu_error { 362 __u16 flags; 363 __u16 reserved1; 364 __u32 reserved2; 365 __u64 error; 366 }; 367 368 flags: 369 These flags indicate which optional fields are present in 370 this struct. Currently all fields are Mandatory. 371 372 error: 373 Error status from the AFU. Defined by the AFU. 374 375 reserved fields: 376 For future extensions and padding 377 378 3792. Card character device (powerVM guest only) 380 381 In a powerVM guest, an extra character device is created for the 382 card. The device is only used to write (flash) a new image on the 383 FPGA accelerator. Once the image is written and verified, the 384 device tree is updated and the card is reset to reload the updated 385 image. 386 387open 388---- 389 390 Opens the device and allocates a file descriptor to be used with 391 the rest of the API. The device can only be opened once. 392 393ioctl 394----- 395 396CXL_IOCTL_DOWNLOAD_IMAGE: 397CXL_IOCTL_VALIDATE_IMAGE: 398 Starts and controls flashing a new FPGA image. Partial 399 reconfiguration is not supported (yet), so the image must contain 400 a copy of the PSL and AFU(s). Since an image can be quite large, 401 the caller may have to iterate, splitting the image in smaller 402 chunks. 403 404 Takes a pointer to a struct cxl_adapter_image: 405 struct cxl_adapter_image { 406 __u64 flags; 407 __u64 data; 408 __u64 len_data; 409 __u64 len_image; 410 __u64 reserved1; 411 __u64 reserved2; 412 __u64 reserved3; 413 __u64 reserved4; 414 }; 415 416 flags: 417 These flags indicate which optional fields are present in 418 this struct. Currently all fields are mandatory. 419 420 data: 421 Pointer to a buffer with part of the image to write to the 422 card. 423 424 len_data: 425 Size of the buffer pointed to by data. 426 427 len_image: 428 Full size of the image. 429 430 431Sysfs Class 432=========== 433 434 A cxl sysfs class is added under /sys/class/cxl to facilitate 435 enumeration and tuning of the accelerators. Its layout is 436 described in Documentation/ABI/testing/sysfs-class-cxl 437 438 439Udev rules 440========== 441 442 The following udev rules could be used to create a symlink to the 443 most logical chardev to use in any programming mode (afuX.Yd for 444 dedicated, afuX.Ys for afu directed), since the API is virtually 445 identical for each: 446 447 SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b" 448 SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \ 449 KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b" 450