1==================================
2vfio-ccw: the basic infrastructure
3==================================
4
5Introduction
6------------
7
8Here we describe the vfio support for I/O subchannel devices for
9Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
10virtual machine, while vfio is the means.
11
12Different than other hardware architectures, s390 has defined a unified
13I/O access method, which is so called Channel I/O. It has its own access
14patterns:
15
16- Channel programs run asynchronously on a separate (co)processor.
17- The channel subsystem will access any memory designated by the caller
18  in the channel program directly, i.e. there is no iommu involved.
19
20Thus when we introduce vfio support for these devices, we realize it
21with a mediated device (mdev) implementation. The vfio mdev will be
22added to an iommu group, so as to make itself able to be managed by the
23vfio framework. And we add read/write callbacks for special vfio I/O
24regions to pass the channel programs from the mdev to its parent device
25(the real I/O subchannel device) to do further address translation and
26to perform I/O instructions.
27
28This document does not intend to explain the s390 I/O architecture in
29every detail. More information/reference could be found here:
30
31- A good start to know Channel I/O in general:
32  https://en.wikipedia.org/wiki/Channel_I/O
33- s390 architecture:
34  s390 Principles of Operation manual (IBM Form. No. SA22-7832)
35- The existing QEMU code which implements a simple emulated channel
36  subsystem could also be a good reference. It makes it easier to follow
37  the flow.
38  qemu/hw/s390x/css.c
39
40For vfio mediated device framework:
41- Documentation/driver-api/vfio-mediated-device.rst
42
43Motivation of vfio-ccw
44----------------------
45
46Typically, a guest virtualized via QEMU/KVM on s390 only sees
47paravirtualized virtio devices via the "Virtio Over Channel I/O
48(virtio-ccw)" transport. This makes virtio devices discoverable via
49standard operating system algorithms for handling channel devices.
50
51However this is not enough. On s390 for the majority of devices, which
52use the standard Channel I/O based mechanism, we also need to provide
53the functionality of passing through them to a QEMU virtual machine.
54This includes devices that don't have a virtio counterpart (e.g. tape
55drives) or that have specific characteristics which guests want to
56exploit.
57
58For passing a device to a guest, we want to use the same interface as
59everybody else, namely vfio. We implement this vfio support for channel
60devices via the vfio mediated device framework and the subchannel device
61driver "vfio_ccw".
62
63Access patterns of CCW devices
64------------------------------
65
66s390 architecture has implemented a so called channel subsystem, that
67provides a unified view of the devices physically attached to the
68systems. Though the s390 hardware platform knows about a huge variety of
69different peripheral attachments like disk devices (aka. DASDs), tapes,
70communication controllers, etc. They can all be accessed by a well
71defined access method and they are presenting I/O completion a unified
72way: I/O interruptions.
73
74All I/O requires the use of channel command words (CCWs). A CCW is an
75instruction to a specialized I/O channel processor. A channel program is
76a sequence of CCWs which are executed by the I/O channel subsystem.  To
77issue a channel program to the channel subsystem, it is required to
78build an operation request block (ORB), which can be used to point out
79the format of the CCW and other control information to the system. The
80operating system signals the I/O channel subsystem to begin executing
81the channel program with a SSCH (start sub-channel) instruction. The
82central processor is then free to proceed with non-I/O instructions
83until interrupted. The I/O completion result is received by the
84interrupt handler in the form of interrupt response block (IRB).
85
86Back to vfio-ccw, in short:
87
88- ORBs and channel programs are built in guest kernel (with guest
89  physical addresses).
90- ORBs and channel programs are passed to the host kernel.
91- Host kernel translates the guest physical addresses to real addresses
92  and starts the I/O with issuing a privileged Channel I/O instruction
93  (e.g SSCH).
94- channel programs run asynchronously on a separate processor.
95- I/O completion will be signaled to the host with I/O interruptions.
96  And it will be copied as IRB to user space to pass it back to the
97  guest.
98
99Physical vfio ccw device and its child mdev
100-------------------------------------------
101
102As mentioned above, we realize vfio-ccw with a mdev implementation.
103
104Channel I/O does not have IOMMU hardware support, so the physical
105vfio-ccw device does not have an IOMMU level translation or isolation.
106
107Subchannel I/O instructions are all privileged instructions. When
108handling the I/O instruction interception, vfio-ccw has the software
109policing and translation how the channel program is programmed before
110it gets sent to hardware.
111
112Within this implementation, we have two drivers for two types of
113devices:
114
115- The vfio_ccw driver for the physical subchannel device.
116  This is an I/O subchannel driver for the real subchannel device.  It
117  realizes a group of callbacks and registers to the mdev framework as a
118  parent (physical) device. As a consequence, mdev provides vfio_ccw a
119  generic interface (sysfs) to create mdev devices. A vfio mdev could be
120  created by vfio_ccw then and added to the mediated bus. It is the vfio
121  device that added to an IOMMU group and a vfio group.
122  vfio_ccw also provides an I/O region to accept channel program
123  request from user space and store I/O interrupt result for user
124  space to retrieve. To notify user space an I/O completion, it offers
125  an interface to setup an eventfd fd for asynchronous signaling.
126
127- The vfio_mdev driver for the mediated vfio ccw device.
128  This is provided by the mdev framework. It is a vfio device driver for
129  the mdev that created by vfio_ccw.
130  It realizes a group of vfio device driver callbacks, adds itself to a
131  vfio group, and registers itself to the mdev framework as a mdev
132  driver.
133  It uses a vfio iommu backend that uses the existing map and unmap
134  ioctls, but rather than programming them into an IOMMU for a device,
135  it simply stores the translations for use by later requests. This
136  means that a device programmed in a VM with guest physical addresses
137  can have the vfio kernel convert that address to process virtual
138  address, pin the page and program the hardware with the host physical
139  address in one step.
140  For a mdev, the vfio iommu backend will not pin the pages during the
141  VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
142  of the iova<->vaddr mappings in this operation. And they export a
143  vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
144  backend for the physical devices to pin and unpin pages by demand.
145
146Below is a high Level block diagram::
147
148 +-------------+
149 |             |
150 | +---------+ | mdev_register_driver() +--------------+
151 | |  Mdev   | +<-----------------------+              |
152 | |  bus    | |                        | vfio_mdev.ko |
153 | | driver  | +----------------------->+              |<-> VFIO user
154 | +---------+ |    probe()/remove()    +--------------+    APIs
155 |             |
156 |  MDEV CORE  |
157 |   MODULE    |
158 |   mdev.ko   |
159 | +---------+ | mdev_register_device() +--------------+
160 | |Physical | +<-----------------------+              |
161 | | device  | |                        |  vfio_ccw.ko |<-> subchannel
162 | |interface| +----------------------->+              |     device
163 | +---------+ |       callback         +--------------+
164 +-------------+
165
166The process of how these work together.
167
1681. vfio_ccw.ko drives the physical I/O subchannel, and registers the
169   physical device (with callbacks) to mdev framework.
170   When vfio_ccw probing the subchannel device, it registers device
171   pointer and callbacks to the mdev framework. Mdev related file nodes
172   under the device node in sysfs would be created for the subchannel
173   device, namely 'mdev_create', 'mdev_destroy' and
174   'mdev_supported_types'.
1752. Create a mediated vfio ccw device.
176   Use the 'mdev_create' sysfs file, we need to manually create one (and
177   only one for our case) mediated device.
1783. vfio_mdev.ko drives the mediated ccw device.
179   vfio_mdev is also the vfio device drvier. It will probe the mdev and
180   add it to an iommu_group and a vfio_group. Then we could pass through
181   the mdev to a guest.
182
183
184VFIO-CCW Regions
185----------------
186
187The vfio-ccw driver exposes MMIO regions to accept requests from and return
188results to userspace.
189
190vfio-ccw I/O region
191-------------------
192
193An I/O region is used to accept channel program request from user
194space and store I/O interrupt result for user space to retrieve. The
195definition of the region is::
196
197  struct ccw_io_region {
198  #define ORB_AREA_SIZE 12
199	  __u8    orb_area[ORB_AREA_SIZE];
200  #define SCSW_AREA_SIZE 12
201	  __u8    scsw_area[SCSW_AREA_SIZE];
202  #define IRB_AREA_SIZE 96
203	  __u8    irb_area[IRB_AREA_SIZE];
204	  __u32   ret_code;
205  } __packed;
206
207While starting an I/O request, orb_area should be filled with the
208guest ORB, and scsw_area should be filled with the SCSW of the Virtual
209Subchannel.
210
211irb_area stores the I/O result.
212
213ret_code stores a return code for each access of the region.
214
215This region is always available.
216
217vfio-ccw cmd region
218-------------------
219
220The vfio-ccw cmd region is used to accept asynchronous instructions
221from userspace::
222
223  #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0)
224  #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1)
225  struct ccw_cmd_region {
226         __u32 command;
227         __u32 ret_code;
228  } __packed;
229
230This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD.
231
232Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region.
233
234vfio-ccw operation details
235--------------------------
236
237vfio-ccw follows what vfio-pci did on the s390 platform and uses
238vfio-iommu-type1 as the vfio iommu backend.
239
240* CCW translation APIs
241  A group of APIs (start with `cp_`) to do CCW translation. The CCWs
242  passed in by a user space program are organized with their guest
243  physical memory addresses. These APIs will copy the CCWs into kernel
244  space, and assemble a runnable kernel channel program by updating the
245  guest physical addresses with their corresponding host physical addresses.
246  Note that we have to use IDALs even for direct-access CCWs, as the
247  referenced memory can be located anywhere, including above 2G.
248
249* vfio_ccw device driver
250  This driver utilizes the CCW translation APIs and introduces
251  vfio_ccw, which is the driver for the I/O subchannel devices you want
252  to pass through.
253  vfio_ccw implements the following vfio ioctls::
254
255    VFIO_DEVICE_GET_INFO
256    VFIO_DEVICE_GET_IRQ_INFO
257    VFIO_DEVICE_GET_REGION_INFO
258    VFIO_DEVICE_RESET
259    VFIO_DEVICE_SET_IRQS
260
261  This provides an I/O region, so that the user space program can pass a
262  channel program to the kernel, to do further CCW translation before
263  issuing them to a real device.
264  This also provides the SET_IRQ ioctl to setup an event notifier to
265  notify the user space program the I/O completion in an asynchronous
266  way.
267
268The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a
269good example to get understand how these patches work. Here is a little
270bit more detail how an I/O request triggered by the QEMU guest will be
271handled (without error handling).
272
273Explanation:
274
275- Q1-Q7: QEMU side process.
276- K1-K5: Kernel side process.
277
278Q1.
279    Get I/O region info during initialization.
280
281Q2.
282    Setup event notifier and handler to handle I/O completion.
283
284... ...
285
286Q3.
287    Intercept a ssch instruction.
288Q4.
289    Write the guest channel program and ORB to the I/O region.
290
291    K1.
292	Copy from guest to kernel.
293    K2.
294	Translate the guest channel program to a host kernel space
295	channel program, which becomes runnable for a real device.
296    K3.
297	With the necessary information contained in the orb passed in
298	by QEMU, issue the ccwchain to the device.
299    K4.
300	Return the ssch CC code.
301Q5.
302    Return the CC code to the guest.
303
304... ...
305
306    K5.
307	Interrupt handler gets the I/O result and write the result to
308	the I/O region.
309    K6.
310	Signal QEMU to retrieve the result.
311
312Q6.
313    Get the signal and event handler reads out the result from the I/O
314    region.
315Q7.
316    Update the irb for the guest.
317
318Limitations
319-----------
320
321The current vfio-ccw implementation focuses on supporting basic commands
322needed to implement block device functionality (read/write) of DASD/ECKD
323device only. Some commands may need special handling in the future, for
324example, anything related to path grouping.
325
326DASD is a kind of storage device. While ECKD is a data recording format.
327More information for DASD and ECKD could be found here:
328https://en.wikipedia.org/wiki/Direct-access_storage_device
329https://en.wikipedia.org/wiki/Count_key_data
330
331Together with the corresponding work in QEMU, we can bring the passed
332through DASD/ECKD device online in a guest now and use it as a block
333device.
334
335The current code allows the guest to start channel programs via
336START SUBCHANNEL, and to issue HALT SUBCHANNEL and CLEAR SUBCHANNEL.
337
338vfio-ccw supports classic (command mode) channel I/O only. Transport
339mode (HPF) is not supported.
340
341QDIO subchannels are currently not supported. Classic devices other than
342DASD/ECKD might work, but have not been tested.
343
344Reference
345---------
3461. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
3472. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
3483. https://en.wikipedia.org/wiki/Channel_I/O
3494. Documentation/s390/cds.rst
3505. Documentation/driver-api/vfio.rst
3516. Documentation/driver-api/vfio-mediated-device.rst
352