1Oracle Data Analytics Accelerator (DAX) 2--------------------------------------- 3 4DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 5(DAX2) processor chips, and has direct access to the CPU's L3 caches 6as well as physical memory. It can perform several operations on data 7streams with various input and output formats. A driver provides a 8transport mechanism and has limited knowledge of the various opcodes 9and data formats. A user space library provides high level services 10and translates these into low level commands which are then passed 11into the driver and subsequently the Hypervisor and the coprocessor. 12The library is the recommended way for applications to use the 13coprocessor, and the driver interface is not intended for general use. 14This document describes the general flow of the driver, its 15structures, and its programmatic interface. It also provides example 16code sufficient to write user or kernel applications that use DAX 17functionality. 18 19The user library is open source and available at: 20 https://oss.oracle.com/git/gitweb.cgi?p=libdax.git 21 22The Hypervisor interface to the coprocessor is described in detail in 23the accompanying document, dax-hv-api.txt, which is a plain text 24excerpt of the (Oracle internal) "UltraSPARC Virtual Machine 25Specification" version 3.0.20+15, dated 2017-09-25. 26 27 28High Level Overview 29------------------- 30 31A coprocessor request is described by a Command Control Block 32(CCB). The CCB contains an opcode and various parameters. The opcode 33specifies what operation is to be done, and the parameters specify 34options, flags, sizes, and addresses. The CCB (or an array of CCBs) 35is passed to the Hypervisor, which handles queueing and scheduling of 36requests to the available coprocessor execution units. A status code 37returned indicates if the request was submitted successfully or if 38there was an error. One of the addresses given in each CCB is a 39pointer to a "completion area", which is a 128 byte memory block that 40is written by the coprocessor to provide execution status. No 41interrupt is generated upon completion; the completion area must be 42polled by software to find out when a transaction has finished, but 43the M7 and later processors provide a mechanism to pause the virtual 44processor until the completion status has been updated by the 45coprocessor. This is done using the monitored load and mwait 46instructions, which are described in more detail later. The DAX 47coprocessor was designed so that after a request is submitted, the 48kernel is no longer involved in the processing of it. The polling is 49done at the user level, which results in almost zero latency between 50completion of a request and resumption of execution of the requesting 51thread. 52 53 54Addressing Memory 55----------------- 56 57The kernel does not have access to physical memory in the Sun4v 58architecture, as there is an additional level of memory virtualization 59present. This intermediate level is called "real" memory, and the 60kernel treats this as if it were physical. The Hypervisor handles the 61translations between real memory and physical so that each logical 62domain (LDOM) can have a partition of physical memory that is isolated 63from that of other LDOMs. When the kernel sets up a virtual mapping, 64it specifies a virtual address and the real address to which it should 65be mapped. 66 67The DAX coprocessor can only operate on physical memory, so before a 68request can be fed to the coprocessor, all the addresses in a CCB must 69be converted into physical addresses. The kernel cannot do this since 70it has no visibility into physical addresses. So a CCB may contain 71either the virtual or real addresses of the buffers or a combination 72of them. An "address type" field is available for each address that 73may be given in the CCB. In all cases, the Hypervisor will translate 74all the addresses to physical before dispatching to hardware. Address 75translations are performed using the context of the process initiating 76the request. 77 78 79The Driver API 80-------------- 81 82An application makes requests to the driver via the write() system 83call, and gets results (if any) via read(). The completion areas are 84made accessible via mmap(), and are read-only for the application. 85 86The request may either be an immediate command or an array of CCBs to 87be submitted to the hardware. 88 89Each open instance of the device is exclusive to the thread that 90opened it, and must be used by that thread for all subsequent 91operations. The driver open function creates a new context for the 92thread and initializes it for use. This context contains pointers and 93values used internally by the driver to keep track of submitted 94requests. The completion area buffer is also allocated, and this is 95large enough to contain the completion areas for many concurrent 96requests. When the device is closed, any outstanding transactions are 97flushed and the context is cleaned up. 98 99On a DAX1 system (M7), the device will be called "oradax1", while on a 100DAX2 system (M8) it will be "oradax2". If an application requires one 101or the other, it should simply attempt to open the appropriate 102device. Only one of the devices will exist on any given system, so the 103name can be used to determine what the platform supports. 104 105The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For 106all of these, success is indicated by a return value from write() 107equal to the number of bytes given in the call. Otherwise -1 is 108returned and errno is set. 109 110CCB_DEQUEUE 111 112Tells the driver to clean up resources associated with past 113requests. Since no interrupt is generated upon the completion of a 114request, the driver must be told when it may reclaim resources. No 115further status information is returned, so the user should not 116subsequently call read(). 117 118CCB_KILL 119 120Kills a CCB during execution. The CCB is guaranteed to not continue 121executing once this call returns successfully. On success, read() must 122be called to retrieve the result of the action. 123 124CCB_INFO 125 126Retrieves information about a currently executing CCB. Note that some 127Hypervisors might return 'notfound' when the CCB is in 'inprogress' 128state. To ensure a CCB in the 'notfound' state will never be executed, 129CCB_KILL must be invoked on that CCB. Upon success, read() must be 130called to retrieve the details of the action. 131 132Submission of an array of CCBs for execution 133 134A write() whose length is a multiple of the CCB size is treated as a 135submit operation. The file offset is treated as the index of the 136completion area to use, and may be set via lseek() or using the 137pwrite() system call. If -1 is returned then errno is set to indicate 138the error. Otherwise, the return value is the length of the array that 139was actually accepted by the coprocessor. If the accepted length is 140equal to the requested length, then the submission was completely 141successful and there is no further status needed; hence, the user 142should not subsequently call read(). Partial acceptance of the CCB 143array is indicated by a return value less than the requested length, 144and read() must be called to retrieve further status information. The 145status will reflect the error caused by the first CCB that was not 146accepted, and status_data will provide additional data in some cases. 147 148MMAP 149 150The mmap() function provides access to the completion area allocated 151in the driver. Note that the completion area is not writeable by the 152user process, and the mmap call must not specify PROT_WRITE. 153 154 155Completion of a Request 156----------------------- 157 158The first byte in each completion area is the command status which is 159updated by the coprocessor hardware. Software may take advantage of 160new M7/M8 processor capabilities to efficiently poll this status byte. 161First, a "monitored load" is achieved via a Load from Alternate Space 162(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a 163"monitored wait" is achieved via the mwait instruction (a write to 164%asr28). This instruction is like pause in that it suspends execution 165of the virtual processor for the given number of nanoseconds, but in 166addition will terminate early when one of several events occur. If the 167block of data containing the monitored location is modified, then the 168mwait terminates. This causes software to resume execution immediately 169(without a context switch or kernel to user transition) after a 170transaction completes. Thus the latency between transaction completion 171and resumption of execution may be just a few nanoseconds. 172 173 174Application Life Cycle of a DAX Submission 175------------------------------------------ 176 177 - open dax device 178 - call mmap() to get the completion area address 179 - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc. 180 - submit CCB via write() or pwrite() 181 - go into a loop executing monitored load + monitored wait and 182 terminate when the command status indicates the request is complete 183 (CCB_KILL or CCB_INFO may be used any time as necessary) 184 - perform a CCB_DEQUEUE 185 - call munmap() for completion area 186 - close the dax device 187 188 189Memory Constraints 190------------------ 191 192The DAX hardware operates only on physical addresses. Therefore, it is 193not aware of virtual memory mappings and the discontiguities that may 194exist in the physical memory that a virtual buffer maps to. There is 195no I/O TLB or any scatter/gather mechanism. All buffers, whether input 196or output, must reside in a physically contiguous region of memory. 197 198The Hypervisor translates all addresses within a CCB to physical 199before handing off the CCB to DAX. The Hypervisor determines the 200virtual page size for each virtual address given, and uses this to 201program a size limit for each address. This prevents the coprocessor 202from reading or writing beyond the bound of the virtual page, even 203though it is accessing physical memory directly. A simpler way of 204saying this is that a DAX operation will never "cross" a virtual page 205boundary. If an 8k virtual page is used, then the data is strictly 206limited to 8k. If a user's buffer is larger than 8k, then a larger 207page size must be used, or the transaction size will be truncated to 2088k. 209 210Huge pages. A user may allocate huge pages using standard interfaces. 211Memory buffers residing on huge pages may be used to achieve much 212larger DAX transaction sizes, but the rules must still be followed, 213and no transaction will cross a page boundary, even a huge page. A 214major caveat is that Linux on Sparc presents 8Mb as one of the huge 215page sizes. Sparc does not actually provide a 8Mb hardware page size, 216and this size is synthesized by pasting together two 4Mb pages. The 217reasons for this are historical, and it creates an issue because only 218half of this 8Mb page can actually be used for any given buffer in a 219DAX request, and it must be either the first half or the second half; 220it cannot be a 4Mb chunk in the middle, since that crosses a 221(hardware) page boundary. Note that this entire issue may be hidden by 222higher level libraries. 223 224 225CCB Structure 226------------- 227A CCB is an array of 8 64-bit words. Several of these words provide 228command opcodes, parameters, flags, etc., and the rest are addresses 229for the completion area, output buffer, and various inputs: 230 231 struct ccb { 232 u64 control; 233 u64 completion; 234 u64 input0; 235 u64 access; 236 u64 input1; 237 u64 op_data; 238 u64 output; 239 u64 table; 240 }; 241 242See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of 243each of these fields, and see dax-hv-api.txt for a complete description 244of the Hypervisor API available to the guest OS (ie, Linux kernel). 245 246The first word (control) is examined by the driver for the following: 247 - CCB version, which must be consistent with hardware version 248 - Opcode, which must be one of the documented allowable commands 249 - Address types, which must be set to "virtual" for all the addresses 250 given by the user, thereby ensuring that the application can 251 only access memory that it owns 252 253 254Example Code 255------------ 256 257The DAX is accessible to both user and kernel code. The kernel code 258can make hypercalls directly while the user code must use wrappers 259provided by the driver. The setup of the CCB is nearly identical for 260both; the only difference is in preparation of the completion area. An 261example of user code is given now, with kernel code afterwards. 262 263In order to program using the driver API, the file 264arch/sparc/include/uapi/asm/oradax.h must be included. 265 266First, the proper device must be opened. For M7 it will be 267/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest 268procedure is to attempt to open both, as only one will succeed: 269 270 fd = open("/dev/oradax1", O_RDWR); 271 if (fd < 0) 272 fd = open("/dev/oradax2", O_RDWR); 273 if (fd < 0) 274 /* No DAX found */ 275 276Next, the completion area must be mapped: 277 278 completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0); 279 280All input and output buffers must be fully contained in one hardware 281page, since as explained above, the DAX is strictly constrained by 282virtual page boundaries. In addition, the output buffer must be 28364-byte aligned and its size must be a multiple of 64 bytes because 284the coprocessor writes in units of cache lines. 285 286This example demonstrates the DAX Scan command, which takes as input a 287vector and a match value, and produces a bitmap as the output. For 288each input element that matches the value, the corresponding bit is 289set in the output. 290 291In this example, the input vector consists of a series of single bits, 292and the match value is 0. So each 0 bit in the input will produce a 1 293in the output, and vice versa, which produces an output bitmap which 294is the input bitmap inverted. 295 296For details of all the parameters and bits used in this CCB, please 297refer to section 36.2.1.3 of the DAX Hypervisor API document, which 298describes the Scan command in detail. 299 300 ccb->control = /* Table 36.1, CCB Header Format */ 301 (2L << 48) /* command = Scan Value */ 302 | (3L << 40) /* output address type = primary virtual */ 303 | (3L << 34) /* primary input address type = primary virtual */ 304 /* Section 36.2.1, Query CCB Command Formats */ 305 | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */ 306 | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */ 307 | (8 << 10) /* 36.2.1.1.6 output format = bit vector */ 308 | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */ 309 | (31 << 0); /* 36.2.1.3 Disable second scan criteria */ 310 311 ccb->completion = 0; /* Completion area address, to be filled in by driver */ 312 313 ccb->input0 = (unsigned long) input; /* primary input address */ 314 315 ccb->access = /* Section 36.2.1.2, Data Access Control */ 316 (2 << 24) /* Primary input length format = bits */ 317 | (nbits - 1); /* number of bits in primary input stream, minus 1 */ 318 319 ccb->input1 = 0; /* secondary input address, unused */ 320 321 ccb->op_data = 0; /* scan criteria (value to be matched) */ 322 323 ccb->output = (unsigned long) output; /* output address */ 324 325 ccb->table = 0; /* table address, unused */ 326 327The CCB submission is a write() or pwrite() system call to the 328driver. If the call fails, then a read() must be used to retrieve the 329status: 330 331 if (pwrite(fd, ccb, 64, 0) != 64) { 332 struct ccb_exec_result status; 333 read(fd, &status, sizeof(status)); 334 /* bail out */ 335 } 336 337After a successful submission of the CCB, the completion area may be 338polled to determine when the DAX is finished. Detailed information on 339the contents of the completion area can be found in section 36.2.2 of 340the DAX HV API document. 341 342 while (1) { 343 /* Monitored Load */ 344 __asm__ __volatile__("lduba [%1] 0x84, %0\n" 345 : "=r" (status) 346 : "r" (completion_area)); 347 348 if (status) /* 0 indicates command in progress */ 349 break; 350 351 /* MWAIT */ 352 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ 353 } 354 355A completion area status of 1 indicates successful completion of the 356CCB and validity of the output bitmap, which may be used immediately. 357All other non-zero values indicate error conditions which are 358described in section 36.2.2. 359 360 if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ 361 /* completion_area[0] contains the completion status */ 362 /* completion_area[1] contains an error code, see 36.2.2 */ 363 } 364 365After the completion area has been processed, the driver must be 366notified that it can release any resources associated with the 367request. This is done via the dequeue operation: 368 369 struct dax_command cmd; 370 cmd.command = CCB_DEQUEUE; 371 if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) { 372 /* bail out */ 373 } 374 375Finally, normal program cleanup should be done, i.e., unmapping 376completion area, closing the dax device, freeing memory etc. 377 378[Kernel example] 379 380The only difference in using the DAX in kernel code is the treatment 381of the completion area. Unlike user applications which mmap the 382completion area allocated by the driver, kernel code must allocate its 383own memory to use for the completion area, and this address and its 384type must be given in the CCB: 385 386 ccb->control |= /* Table 36.1, CCB Header Format */ 387 (3L << 32); /* completion area address type = primary virtual */ 388 389 ccb->completion = (unsigned long) completion_area; /* Completion area address */ 390 391The dax submit hypercall is made directly. The flags used in the 392ccb_submit call are documented in the DAX HV API in section 36.3.1. 393 394#include <asm/hypervisor.h> 395 396 hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64, 397 HV_CCB_QUERY_CMD | 398 HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY | 399 HV_CCB_VA_PRIVILEGED, 400 0, &bytes_accepted, &status_data); 401 402 if (hv_rv != HV_EOK) { 403 /* hv_rv is an error code, status_data contains */ 404 /* potential additional status, see 36.3.1.1 */ 405 } 406 407After the submission, the completion area polling code is identical to 408that in user land: 409 410 while (1) { 411 /* Monitored Load */ 412 __asm__ __volatile__("lduba [%1] 0x84, %0\n" 413 : "=r" (status) 414 : "r" (completion_area)); 415 416 if (status) /* 0 indicates command in progress */ 417 break; 418 419 /* MWAIT */ 420 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ 421 } 422 423 if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ 424 /* completion_area[0] contains the completion status */ 425 /* completion_area[1] contains an error code, see 36.2.2 */ 426 } 427 428The output bitmap is ready for consumption immediately after the 429completion status indicates success. 430