1Coherent Accelerator Interface (CXL)
2====================================
3
4Introduction
5============
6
7    The coherent accelerator interface is designed to allow the
8    coherent connection of accelerators (FPGAs and other devices) to a
9    POWER system. These devices need to adhere to the Coherent
10    Accelerator Interface Architecture (CAIA).
11
12    IBM refers to this as the Coherent Accelerator Processor Interface
13    or CAPI. In the kernel it's referred to by the name CXL to avoid
14    confusion with the ISDN CAPI subsystem.
15
16    Coherent in this context means that the accelerator and CPUs can
17    both access system memory directly and with the same effective
18    addresses.
19
20
21Hardware overview
22=================
23
24         POWER8/9             FPGA
25       +----------+        +---------+
26       |          |        |         |
27       |   CPU    |        |   AFU   |
28       |          |        |         |
29       |          |        |         |
30       |          |        |         |
31       +----------+        +---------+
32       |   PHB    |        |         |
33       |   +------+        |   PSL   |
34       |   | CAPP |<------>|         |
35       +---+------+  PCIE  +---------+
36
37    The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
38    unit which is part of the PCIe Host Bridge (PHB). This is managed
39    by Linux by calls into OPAL. Linux doesn't directly program the
40    CAPP.
41
42    The FPGA (or coherently attached device) consists of two parts.
43    The POWER Service Layer (PSL) and the Accelerator Function Unit
44    (AFU). The AFU is used to implement specific functionality behind
45    the PSL. The PSL, among other things, provides memory address
46    translation services to allow each AFU direct access to userspace
47    memory.
48
49    The AFU is the core part of the accelerator (eg. the compression,
50    crypto etc function). The kernel has no knowledge of the function
51    of the AFU. Only userspace interacts directly with the AFU.
52
53    The PSL provides the translation and interrupt services that the
54    AFU needs. This is what the kernel interacts with. For example, if
55    the AFU needs to read a particular effective address, it sends
56    that address to the PSL, the PSL then translates it, fetches the
57    data from memory and returns it to the AFU. If the PSL has a
58    translation miss, it interrupts the kernel and the kernel services
59    the fault. The context to which this fault is serviced is based on
60    who owns that acceleration function.
61
62    POWER8 <-----> PSL Version 8 is compliant to the CAIA Version 1.0.
63    POWER9 <-----> PSL Version 9 is compliant to the CAIA Version 2.0.
64    This PSL Version 9 provides new features such as:
65    * Interaction with the nest MMU on the P9 chip.
66    * Native DMA support.
67    * Supports sending ASB_Notify messages for host thread wakeup.
68    * Supports Atomic operations.
69    * ....
70
71    Cards with a PSL9 won't work on a POWER8 system and cards with a
72    PSL8 won't work on a POWER9 system.
73
74AFU Modes
75=========
76
77    There are two programming modes supported by the AFU. Dedicated
78    and AFU directed. AFU may support one or both modes.
79
80    When using dedicated mode only one MMU context is supported. In
81    this mode, only one userspace process can use the accelerator at
82    time.
83
84    When using AFU directed mode, up to 16K simultaneous contexts can
85    be supported. This means up to 16K simultaneous userspace
86    applications may use the accelerator (although specific AFUs may
87    support fewer). In this mode, the AFU sends a 16 bit context ID
88    with each of its requests. This tells the PSL which context is
89    associated with each operation. If the PSL can't translate an
90    operation, the ID can also be accessed by the kernel so it can
91    determine the userspace context associated with an operation.
92
93
94MMIO space
95==========
96
97    A portion of the accelerator MMIO space can be directly mapped
98    from the AFU to userspace. Either the whole space can be mapped or
99    just a per context portion. The hardware is self describing, hence
100    the kernel can determine the offset and size of the per context
101    portion.
102
103
104Interrupts
105==========
106
107    AFUs may generate interrupts that are destined for userspace. These
108    are received by the kernel as hardware interrupts and passed onto
109    userspace by a read syscall documented below.
110
111    Data storage faults and error interrupts are handled by the kernel
112    driver.
113
114
115Work Element Descriptor (WED)
116=============================
117
118    The WED is a 64-bit parameter passed to the AFU when a context is
119    started. Its format is up to the AFU hence the kernel has no
120    knowledge of what it represents. Typically it will be the
121    effective address of a work queue or status block where the AFU
122    and userspace can share control and status information.
123
124
125
126
127User API
128========
129
1301. AFU character devices
131
132    For AFUs operating in AFU directed mode, two character device
133    files will be created. /dev/cxl/afu0.0m will correspond to a
134    master context and /dev/cxl/afu0.0s will correspond to a slave
135    context. Master contexts have access to the full MMIO space an
136    AFU provides. Slave contexts have access to only the per process
137    MMIO space an AFU provides.
138
139    For AFUs operating in dedicated process mode, the driver will
140    only create a single character device per AFU called
141    /dev/cxl/afu0.0d. This will have access to the entire MMIO space
142    that the AFU provides (like master contexts in AFU directed).
143
144    The types described below are defined in include/uapi/misc/cxl.h
145
146    The following file operations are supported on both slave and
147    master devices.
148
149    A userspace library libcxl is available here:
150	https://github.com/ibm-capi/libcxl
151    This provides a C interface to this kernel API.
152
153open
154----
155
156    Opens the device and allocates a file descriptor to be used with
157    the rest of the API.
158
159    A dedicated mode AFU only has one context and only allows the
160    device to be opened once.
161
162    An AFU directed mode AFU can have many contexts, the device can be
163    opened once for each context that is available.
164
165    When all available contexts are allocated the open call will fail
166    and return -ENOSPC.
167
168    Note: IRQs need to be allocated for each context, which may limit
169          the number of contexts that can be created, and therefore
170          how many times the device can be opened. The POWER8 CAPP
171          supports 2040 IRQs and 3 are used by the kernel, so 2037 are
172          left. If 1 IRQ is needed per context, then only 2037
173          contexts can be allocated. If 4 IRQs are needed per context,
174          then only 2037/4 = 509 contexts can be allocated.
175
176
177ioctl
178-----
179
180    CXL_IOCTL_START_WORK:
181        Starts the AFU context and associates it with the current
182        process. Once this ioctl is successfully executed, all memory
183        mapped into this process is accessible to this AFU context
184        using the same effective addresses. No additional calls are
185        required to map/unmap memory. The AFU memory context will be
186        updated as userspace allocates and frees memory. This ioctl
187        returns once the AFU context is started.
188
189        Takes a pointer to a struct cxl_ioctl_start_work:
190
191                struct cxl_ioctl_start_work {
192                        __u64 flags;
193                        __u64 work_element_descriptor;
194                        __u64 amr;
195                        __s16 num_interrupts;
196                        __s16 reserved1;
197                        __s32 reserved2;
198                        __u64 reserved3;
199                        __u64 reserved4;
200                        __u64 reserved5;
201                        __u64 reserved6;
202                };
203
204            flags:
205                Indicates which optional fields in the structure are
206                valid.
207
208            work_element_descriptor:
209                The Work Element Descriptor (WED) is a 64-bit argument
210                defined by the AFU. Typically this is an effective
211                address pointing to an AFU specific structure
212                describing what work to perform.
213
214            amr:
215                Authority Mask Register (AMR), same as the powerpc
216                AMR. This field is only used by the kernel when the
217                corresponding CXL_START_WORK_AMR value is specified in
218                flags. If not specified the kernel will use a default
219                value of 0.
220
221            num_interrupts:
222                Number of userspace interrupts to request. This field
223                is only used by the kernel when the corresponding
224                CXL_START_WORK_NUM_IRQS value is specified in flags.
225                If not specified the minimum number required by the
226                AFU will be allocated. The min and max number can be
227                obtained from sysfs.
228
229            reserved fields:
230                For ABI padding and future extensions
231
232    CXL_IOCTL_GET_PROCESS_ELEMENT:
233        Get the current context id, also known as the process element.
234        The value is returned from the kernel as a __u32.
235
236
237mmap
238----
239
240    An AFU may have an MMIO space to facilitate communication with the
241    AFU. If it does, the MMIO space can be accessed via mmap. The size
242    and contents of this area are specific to the particular AFU. The
243    size can be discovered via sysfs.
244
245    In AFU directed mode, master contexts are allowed to map all of
246    the MMIO space and slave contexts are allowed to only map the per
247    process MMIO space associated with the context. In dedicated
248    process mode the entire MMIO space can always be mapped.
249
250    This mmap call must be done after the START_WORK ioctl.
251
252    Care should be taken when accessing MMIO space. Only 32 and 64-bit
253    accesses are supported by POWER8. Also, the AFU will be designed
254    with a specific endianness, so all MMIO accesses should consider
255    endianness (recommend endian(3) variants like: le64toh(),
256    be64toh() etc). These endian issues equally apply to shared memory
257    queues the WED may describe.
258
259
260read
261----
262
263    Reads events from the AFU. Blocks if no events are pending
264    (unless O_NONBLOCK is supplied). Returns -EIO in the case of an
265    unrecoverable error or if the card is removed.
266
267    read() will always return an integral number of events.
268
269    The buffer passed to read() must be at least 4K bytes.
270
271    The result of the read will be a buffer of one or more events,
272    each event is of type struct cxl_event, of varying size.
273
274            struct cxl_event {
275                    struct cxl_event_header header;
276                    union {
277                            struct cxl_event_afu_interrupt irq;
278                            struct cxl_event_data_storage fault;
279                            struct cxl_event_afu_error afu_error;
280                    };
281            };
282
283    The struct cxl_event_header is defined as:
284
285            struct cxl_event_header {
286                    __u16 type;
287                    __u16 size;
288                    __u16 process_element;
289                    __u16 reserved1;
290            };
291
292        type:
293            This defines the type of event. The type determines how
294            the rest of the event is structured. These types are
295            described below and defined by enum cxl_event_type.
296
297        size:
298            This is the size of the event in bytes including the
299            struct cxl_event_header. The start of the next event can
300            be found at this offset from the start of the current
301            event.
302
303        process_element:
304            Context ID of the event.
305
306        reserved field:
307            For future extensions and padding.
308
309    If the event type is CXL_EVENT_AFU_INTERRUPT then the event
310    structure is defined as:
311
312            struct cxl_event_afu_interrupt {
313                    __u16 flags;
314                    __u16 irq; /* Raised AFU interrupt number */
315                    __u32 reserved1;
316            };
317
318        flags:
319            These flags indicate which optional fields are present
320            in this struct. Currently all fields are mandatory.
321
322        irq:
323            The IRQ number sent by the AFU.
324
325        reserved field:
326            For future extensions and padding.
327
328    If the event type is CXL_EVENT_DATA_STORAGE then the event
329    structure is defined as:
330
331            struct cxl_event_data_storage {
332                    __u16 flags;
333                    __u16 reserved1;
334                    __u32 reserved2;
335                    __u64 addr;
336                    __u64 dsisr;
337                    __u64 reserved3;
338            };
339
340        flags:
341            These flags indicate which optional fields are present in
342            this struct. Currently all fields are mandatory.
343
344        address:
345            The address that the AFU unsuccessfully attempted to
346            access. Valid accesses will be handled transparently by the
347            kernel but invalid accesses will generate this event.
348
349        dsisr:
350            This field gives information on the type of fault. It is a
351            copy of the DSISR from the PSL hardware when the address
352            fault occurred. The form of the DSISR is as defined in the
353            CAIA.
354
355        reserved fields:
356            For future extensions
357
358    If the event type is CXL_EVENT_AFU_ERROR then the event structure
359    is defined as:
360
361            struct cxl_event_afu_error {
362                    __u16 flags;
363                    __u16 reserved1;
364                    __u32 reserved2;
365                    __u64 error;
366            };
367
368        flags:
369            These flags indicate which optional fields are present in
370            this struct. Currently all fields are Mandatory.
371
372        error:
373            Error status from the AFU. Defined by the AFU.
374
375        reserved fields:
376            For future extensions and padding
377
378
3792. Card character device (powerVM guest only)
380
381    In a powerVM guest, an extra character device is created for the
382    card. The device is only used to write (flash) a new image on the
383    FPGA accelerator. Once the image is written and verified, the
384    device tree is updated and the card is reset to reload the updated
385    image.
386
387open
388----
389
390    Opens the device and allocates a file descriptor to be used with
391    the rest of the API. The device can only be opened once.
392
393ioctl
394-----
395
396CXL_IOCTL_DOWNLOAD_IMAGE:
397CXL_IOCTL_VALIDATE_IMAGE:
398    Starts and controls flashing a new FPGA image. Partial
399    reconfiguration is not supported (yet), so the image must contain
400    a copy of the PSL and AFU(s). Since an image can be quite large,
401    the caller may have to iterate, splitting the image in smaller
402    chunks.
403
404    Takes a pointer to a struct cxl_adapter_image:
405        struct cxl_adapter_image {
406            __u64 flags;
407            __u64 data;
408            __u64 len_data;
409            __u64 len_image;
410            __u64 reserved1;
411            __u64 reserved2;
412            __u64 reserved3;
413            __u64 reserved4;
414        };
415
416    flags:
417        These flags indicate which optional fields are present in
418        this struct. Currently all fields are mandatory.
419
420    data:
421        Pointer to a buffer with part of the image to write to the
422        card.
423
424    len_data:
425        Size of the buffer pointed to by data.
426
427    len_image:
428        Full size of the image.
429
430
431Sysfs Class
432===========
433
434    A cxl sysfs class is added under /sys/class/cxl to facilitate
435    enumeration and tuning of the accelerators. Its layout is
436    described in Documentation/ABI/testing/sysfs-class-cxl
437
438
439Udev rules
440==========
441
442    The following udev rules could be used to create a symlink to the
443    most logical chardev to use in any programming mode (afuX.Yd for
444    dedicated, afuX.Ys for afu directed), since the API is virtually
445    identical for each:
446
447	SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
448	SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
449	                  KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"
450