1vfio-ccw: the basic infrastructure 2================================== 3 4Introduction 5------------ 6 7Here we describe the vfio support for I/O subchannel devices for 8Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a 9virtual machine, while vfio is the means. 10 11Different than other hardware architectures, s390 has defined a unified 12I/O access method, which is so called Channel I/O. It has its own access 13patterns: 14- Channel programs run asynchronously on a separate (co)processor. 15- The channel subsystem will access any memory designated by the caller 16 in the channel program directly, i.e. there is no iommu involved. 17Thus when we introduce vfio support for these devices, we realize it 18with a mediated device (mdev) implementation. The vfio mdev will be 19added to an iommu group, so as to make itself able to be managed by the 20vfio framework. And we add read/write callbacks for special vfio I/O 21regions to pass the channel programs from the mdev to its parent device 22(the real I/O subchannel device) to do further address translation and 23to perform I/O instructions. 24 25This document does not intend to explain the s390 I/O architecture in 26every detail. More information/reference could be found here: 27- A good start to know Channel I/O in general: 28 https://en.wikipedia.org/wiki/Channel_I/O 29- s390 architecture: 30 s390 Principles of Operation manual (IBM Form. No. SA22-7832) 31- The existing QEMU code which implements a simple emulated channel 32 subsystem could also be a good reference. It makes it easier to follow 33 the flow. 34 qemu/hw/s390x/css.c 35 36For vfio mediated device framework: 37- Documentation/vfio-mediated-device.txt 38 39Motivation of vfio-ccw 40---------------------- 41 42Typically, a guest virtualized via QEMU/KVM on s390 only sees 43paravirtualized virtio devices via the "Virtio Over Channel I/O 44(virtio-ccw)" transport. This makes virtio devices discoverable via 45standard operating system algorithms for handling channel devices. 46 47However this is not enough. On s390 for the majority of devices, which 48use the standard Channel I/O based mechanism, we also need to provide 49the functionality of passing through them to a QEMU virtual machine. 50This includes devices that don't have a virtio counterpart (e.g. tape 51drives) or that have specific characteristics which guests want to 52exploit. 53 54For passing a device to a guest, we want to use the same interface as 55everybody else, namely vfio. We implement this vfio support for channel 56devices via the vfio mediated device framework and the subchannel device 57driver "vfio_ccw". 58 59Access patterns of CCW devices 60------------------------------ 61 62s390 architecture has implemented a so called channel subsystem, that 63provides a unified view of the devices physically attached to the 64systems. Though the s390 hardware platform knows about a huge variety of 65different peripheral attachments like disk devices (aka. DASDs), tapes, 66communication controllers, etc. They can all be accessed by a well 67defined access method and they are presenting I/O completion a unified 68way: I/O interruptions. 69 70All I/O requires the use of channel command words (CCWs). A CCW is an 71instruction to a specialized I/O channel processor. A channel program is 72a sequence of CCWs which are executed by the I/O channel subsystem. To 73issue a channel program to the channel subsystem, it is required to 74build an operation request block (ORB), which can be used to point out 75the format of the CCW and other control information to the system. The 76operating system signals the I/O channel subsystem to begin executing 77the channel program with a SSCH (start sub-channel) instruction. The 78central processor is then free to proceed with non-I/O instructions 79until interrupted. The I/O completion result is received by the 80interrupt handler in the form of interrupt response block (IRB). 81 82Back to vfio-ccw, in short: 83- ORBs and channel programs are built in guest kernel (with guest 84 physical addresses). 85- ORBs and channel programs are passed to the host kernel. 86- Host kernel translates the guest physical addresses to real addresses 87 and starts the I/O with issuing a privileged Channel I/O instruction 88 (e.g SSCH). 89- channel programs run asynchronously on a separate processor. 90- I/O completion will be signaled to the host with I/O interruptions. 91 And it will be copied as IRB to user space to pass it back to the 92 guest. 93 94Physical vfio ccw device and its child mdev 95------------------------------------------- 96 97As mentioned above, we realize vfio-ccw with a mdev implementation. 98 99Channel I/O does not have IOMMU hardware support, so the physical 100vfio-ccw device does not have an IOMMU level translation or isolation. 101 102Subchannel I/O instructions are all privileged instructions. When 103handling the I/O instruction interception, vfio-ccw has the software 104policing and translation how the channel program is programmed before 105it gets sent to hardware. 106 107Within this implementation, we have two drivers for two types of 108devices: 109- The vfio_ccw driver for the physical subchannel device. 110 This is an I/O subchannel driver for the real subchannel device. It 111 realizes a group of callbacks and registers to the mdev framework as a 112 parent (physical) device. As a consequence, mdev provides vfio_ccw a 113 generic interface (sysfs) to create mdev devices. A vfio mdev could be 114 created by vfio_ccw then and added to the mediated bus. It is the vfio 115 device that added to an IOMMU group and a vfio group. 116 vfio_ccw also provides an I/O region to accept channel program 117 request from user space and store I/O interrupt result for user 118 space to retrieve. To notify user space an I/O completion, it offers 119 an interface to setup an eventfd fd for asynchronous signaling. 120 121- The vfio_mdev driver for the mediated vfio ccw device. 122 This is provided by the mdev framework. It is a vfio device driver for 123 the mdev that created by vfio_ccw. 124 It realizes a group of vfio device driver callbacks, adds itself to a 125 vfio group, and registers itself to the mdev framework as a mdev 126 driver. 127 It uses a vfio iommu backend that uses the existing map and unmap 128 ioctls, but rather than programming them into an IOMMU for a device, 129 it simply stores the translations for use by later requests. This 130 means that a device programmed in a VM with guest physical addresses 131 can have the vfio kernel convert that address to process virtual 132 address, pin the page and program the hardware with the host physical 133 address in one step. 134 For a mdev, the vfio iommu backend will not pin the pages during the 135 VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database 136 of the iova<->vaddr mappings in this operation. And they export a 137 vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu 138 backend for the physical devices to pin and unpin pages by demand. 139 140Below is a high Level block diagram. 141 142 +-------------+ 143 | | 144 | +---------+ | mdev_register_driver() +--------------+ 145 | | Mdev | +<-----------------------+ | 146 | | bus | | | vfio_mdev.ko | 147 | | driver | +----------------------->+ |<-> VFIO user 148 | +---------+ | probe()/remove() +--------------+ APIs 149 | | 150 | MDEV CORE | 151 | MODULE | 152 | mdev.ko | 153 | +---------+ | mdev_register_device() +--------------+ 154 | |Physical | +<-----------------------+ | 155 | | device | | | vfio_ccw.ko |<-> subchannel 156 | |interface| +----------------------->+ | device 157 | +---------+ | callback +--------------+ 158 +-------------+ 159 160The process of how these work together. 1611. vfio_ccw.ko drives the physical I/O subchannel, and registers the 162 physical device (with callbacks) to mdev framework. 163 When vfio_ccw probing the subchannel device, it registers device 164 pointer and callbacks to the mdev framework. Mdev related file nodes 165 under the device node in sysfs would be created for the subchannel 166 device, namely 'mdev_create', 'mdev_destroy' and 167 'mdev_supported_types'. 1682. Create a mediated vfio ccw device. 169 Use the 'mdev_create' sysfs file, we need to manually create one (and 170 only one for our case) mediated device. 1713. vfio_mdev.ko drives the mediated ccw device. 172 vfio_mdev is also the vfio device drvier. It will probe the mdev and 173 add it to an iommu_group and a vfio_group. Then we could pass through 174 the mdev to a guest. 175 176vfio-ccw I/O region 177------------------- 178 179An I/O region is used to accept channel program request from user 180space and store I/O interrupt result for user space to retrieve. The 181definition of the region is: 182 183struct ccw_io_region { 184#define ORB_AREA_SIZE 12 185 __u8 orb_area[ORB_AREA_SIZE]; 186#define SCSW_AREA_SIZE 12 187 __u8 scsw_area[SCSW_AREA_SIZE]; 188#define IRB_AREA_SIZE 96 189 __u8 irb_area[IRB_AREA_SIZE]; 190 __u32 ret_code; 191} __packed; 192 193While starting an I/O request, orb_area should be filled with the 194guest ORB, and scsw_area should be filled with the SCSW of the Virtual 195Subchannel. 196 197irb_area stores the I/O result. 198 199ret_code stores a return code for each access of the region. 200 201vfio-ccw operation details 202-------------------------- 203 204vfio-ccw follows what vfio-pci did on the s390 platform and uses 205vfio-iommu-type1 as the vfio iommu backend. 206 207* CCW translation APIs 208 A group of APIs (start with 'cp_') to do CCW translation. The CCWs 209 passed in by a user space program are organized with their guest 210 physical memory addresses. These APIs will copy the CCWs into kernel 211 space, and assemble a runnable kernel channel program by updating the 212 guest physical addresses with their corresponding host physical addresses. 213 Note that we have to use IDALs even for direct-access CCWs, as the 214 referenced memory can be located anywhere, including above 2G. 215 216* vfio_ccw device driver 217 This driver utilizes the CCW translation APIs and introduces 218 vfio_ccw, which is the driver for the I/O subchannel devices you want 219 to pass through. 220 vfio_ccw implements the following vfio ioctls: 221 VFIO_DEVICE_GET_INFO 222 VFIO_DEVICE_GET_IRQ_INFO 223 VFIO_DEVICE_GET_REGION_INFO 224 VFIO_DEVICE_RESET 225 VFIO_DEVICE_SET_IRQS 226 This provides an I/O region, so that the user space program can pass a 227 channel program to the kernel, to do further CCW translation before 228 issuing them to a real device. 229 This also provides the SET_IRQ ioctl to setup an event notifier to 230 notify the user space program the I/O completion in an asynchronous 231 way. 232 233The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a 234good example to get understand how these patches work. Here is a little 235bit more detail how an I/O request triggered by the QEMU guest will be 236handled (without error handling). 237 238Explanation: 239Q1-Q7: QEMU side process. 240K1-K5: Kernel side process. 241 242Q1. Get I/O region info during initialization. 243Q2. Setup event notifier and handler to handle I/O completion. 244 245... ... 246 247Q3. Intercept a ssch instruction. 248Q4. Write the guest channel program and ORB to the I/O region. 249 K1. Copy from guest to kernel. 250 K2. Translate the guest channel program to a host kernel space 251 channel program, which becomes runnable for a real device. 252 K3. With the necessary information contained in the orb passed in 253 by QEMU, issue the ccwchain to the device. 254 K4. Return the ssch CC code. 255Q5. Return the CC code to the guest. 256 257... ... 258 259 K5. Interrupt handler gets the I/O result and write the result to 260 the I/O region. 261 K6. Signal QEMU to retrieve the result. 262Q6. Get the signal and event handler reads out the result from the I/O 263 region. 264Q7. Update the irb for the guest. 265 266Limitations 267----------- 268 269The current vfio-ccw implementation focuses on supporting basic commands 270needed to implement block device functionality (read/write) of DASD/ECKD 271device only. Some commands may need special handling in the future, for 272example, anything related to path grouping. 273 274DASD is a kind of storage device. While ECKD is a data recording format. 275More information for DASD and ECKD could be found here: 276https://en.wikipedia.org/wiki/Direct-access_storage_device 277https://en.wikipedia.org/wiki/Count_key_data 278 279Together with the corresponding work in QEMU, we can bring the passed 280through DASD/ECKD device online in a guest now and use it as a block 281device. 282 283While the current code allows the guest to start channel programs via 284START SUBCHANNEL, support for HALT SUBCHANNEL or CLEAR SUBCHANNEL is 285not yet implemented. 286 287vfio-ccw supports classic (command mode) channel I/O only. Transport 288mode (HPF) is not supported. 289 290QDIO subchannels are currently not supported. Classic devices other than 291DASD/ECKD might work, but have not been tested. 292 293Reference 294--------- 2951. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) 2962. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) 2973. https://en.wikipedia.org/wiki/Channel_I/O 2984. Documentation/s390/cds.txt 2995. Documentation/vfio.txt 3006. Documentation/vfio-mediated-device.txt 301