1Kernel Connection Multiplexor
2-----------------------------
3
4Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
5interface over TCP for generic application protocols. With KCM an application
6can efficiently send and receive application protocol messages over TCP using
7datagram sockets.
8
9KCM implements an NxM multiplexor in the kernel as diagrammed below:
10
11+------------+   +------------+   +------------+   +------------+
12| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
13+------------+   +------------+   +------------+   +------------+
14      |                 |               |                |
15      +-----------+     |               |     +----------+
16                  |     |               |     |
17               +----------------------------------+
18               |           Multiplexor            |
19               +----------------------------------+
20                 |   |           |           |  |
21       +---------+   |           |           |  ------------+
22       |             |           |           |              |
23+----------+  +----------+  +----------+  +----------+ +----------+
24|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
25+----------+  +----------+  +----------+  +----------+ +----------+
26      |              |           |            |             |
27+----------+  +----------+  +----------+  +----------+ +----------+
28| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
29+----------+  +----------+  +----------+  +----------+ +----------+
30
31KCM sockets
32-----------
33
34The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
35bound to a multiplexor are considered to have equivalent function, and I/O
36operations in different sockets may be done in parallel without the need for
37synchronization between threads in userspace.
38
39Multiplexor
40-----------
41
42The multiplexor provides the message steering. In the transmit path, messages
43written on a KCM socket are sent atomically on an appropriate TCP socket.
44Similarly, in the receive path, messages are constructed on each TCP socket
45(Psock) and complete messages are steered to a KCM socket.
46
47TCP sockets & Psocks
48--------------------
49
50TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
51for each bound TCP socket, this structure holds the state for constructing
52messages on receive as well as other connection specific information for KCM.
53
54Connected mode semantics
55------------------------
56
57Each multiplexor assumes that all attached TCP connections are to the same
58destination and can use the different connections for load balancing when
59transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
60can be used to send and receive messages from the KCM socket.
61
62Socket types
63------------
64
65KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
66
67Message delineation
68-------------------
69
70Messages are sent over a TCP stream with some application protocol message
71format that typically includes a header which frames the messages. The length
72of a received message can be deduced from the application protocol header
73(often just a simple length field).
74
75A TCP stream must be parsed to determine message boundaries. Berkeley Packet
76Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
77BPF program must be specified. The program is called at the start of receiving
78a new message and is given an skbuff that contains the bytes received so far.
79It parses the message header and returns the length of the message. Given this
80information, KCM will construct the message of the stated length and deliver it
81to a KCM socket.
82
83TCP socket management
84---------------------
85
86When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
87write space available (POLLOUT) events are handled by the multiplexor. If there
88is a state change (disconnection) or other error on a TCP socket, an error is
89posted on the TCP socket so that a POLLERR event happens and KCM discontinues
90using the socket. When the application gets the error notification for a
91TCP socket, it should unattach the socket from KCM and then handle the error
92condition (the typical response is to close the socket and create a new
93connection if necessary).
94
95KCM limits the maximum receive message size to be the size of the receive
96socket buffer on the attached TCP socket (the socket buffer size can be set by
97SO_RCVBUF). If the length of a new message reported by the BPF program is
98greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
99socket. The BPF program may also enforce a maximum messages size and report an
100error when it is exceeded.
101
102A timeout may be set for assembling messages on a receive socket. The timeout
103value is taken from the receive timeout of the attached TCP socket (this is set
104by SO_RCVTIMEO). If the timer expires before assembly is complete an error
105(ETIMEDOUT) is posted on the socket.
106
107User interface
108==============
109
110Creating a multiplexor
111----------------------
112
113A new multiplexor and initial KCM socket is created by a socket call:
114
115  socket(AF_KCM, type, protocol)
116
117  - type is either SOCK_DGRAM or SOCK_SEQPACKET
118  - protocol is KCMPROTO_CONNECTED
119
120Cloning KCM sockets
121-------------------
122
123After the first KCM socket is created using the socket call as described
124above, additional sockets for the multiplexor can be created by cloning
125a KCM socket. This is accomplished by an ioctl on a KCM socket:
126
127  /* From linux/kcm.h */
128  struct kcm_clone {
129        int fd;
130  };
131
132  struct kcm_clone info;
133
134  memset(&info, 0, sizeof(info));
135
136  err = ioctl(kcmfd, SIOCKCMCLONE, &info);
137
138  if (!err)
139    newkcmfd = info.fd;
140
141Attach transport sockets
142------------------------
143
144Attaching of transport sockets to a multiplexor is performed by calling an
145ioctl on a KCM socket for the multiplexor. e.g.:
146
147  /* From linux/kcm.h */
148  struct kcm_attach {
149        int fd;
150	int bpf_fd;
151  };
152
153  struct kcm_attach info;
154
155  memset(&info, 0, sizeof(info));
156
157  info.fd = tcpfd;
158  info.bpf_fd = bpf_prog_fd;
159
160  ioctl(kcmfd, SIOCKCMATTACH, &info);
161
162The kcm_attach structure contains:
163  fd: file descriptor for TCP socket being attached
164  bpf_prog_fd: file descriptor for compiled BPF program downloaded
165
166Unattach transport sockets
167--------------------------
168
169Unattaching a transport socket from a multiplexor is straightforward. An
170"unattach" ioctl is done with the kcm_unattach structure as the argument:
171
172  /* From linux/kcm.h */
173  struct kcm_unattach {
174        int fd;
175  };
176
177  struct kcm_unattach info;
178
179  memset(&info, 0, sizeof(info));
180
181  info.fd = cfd;
182
183  ioctl(fd, SIOCKCMUNATTACH, &info);
184
185Disabling receive on KCM socket
186-------------------------------
187
188A setsockopt is used to disable or enable receiving on a KCM socket.
189When receive is disabled, any pending messages in the socket's
190receive buffer are moved to other sockets. This feature is useful
191if an application thread knows that it will be doing a lot of
192work on a request and won't be able to service new messages for a
193while. Example use:
194
195  int val = 1;
196
197  setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
198
199BFP programs for message delineation
200------------------------------------
201
202BPF programs can be compiled using the BPF LLVM backend. For example,
203the BPF program for parsing Thrift is:
204
205  #include "bpf.h" /* for __sk_buff */
206  #include "bpf_helpers.h" /* for load_word intrinsic */
207
208  SEC("socket_kcm")
209  int bpf_prog1(struct __sk_buff *skb)
210  {
211       return load_word(skb, 0) + 4;
212  }
213
214  char _license[] SEC("license") = "GPL";
215
216Use in applications
217===================
218
219KCM accelerates application layer protocols. Specifically, it allows
220applications to use a message based interface for sending and receiving
221messages. The kernel provides necessary assurances that messages are sent
222and received atomically. This relieves much of the burden applications have
223in mapping a message based protocol onto the TCP stream. KCM also make
224application layer messages a unit of work in the kernel for the purposes of
225steering and scheduling, which in turn allows a simpler networking model in
226multithreaded applications.
227
228Configurations
229--------------
230
231In an Nx1 configuration, KCM logically provides multiple socket handles
232to the same TCP connection. This allows parallelism between in I/O
233operations on the TCP socket (for instance copyin and copyout of data is
234parallelized). In an application, a KCM socket can be opened for each
235processing thread and inserted into the epoll (similar to how SO_REUSEPORT
236is used to allow multiple listener sockets on the same port).
237
238In a MxN configuration, multiple connections are established to the
239same destination. These are used for simple load balancing.
240
241Message batching
242----------------
243
244The primary purpose of KCM is load balancing between KCM sockets and hence
245threads in a nominal use case. Perfect load balancing, that is steering
246each received message to a different KCM socket or steering each sent
247message to a different TCP socket, can negatively impact performance
248since this doesn't allow for affinities to be established. Balancing
249based on groups, or batches of messages, can be beneficial for performance.
250
251On transmit, there are three ways an application can batch (pipeline)
252messages on a KCM socket.
253  1) Send multiple messages in a single sendmmsg.
254  2) Send a group of messages each with a sendmsg call, where all messages
255     except the last have MSG_BATCH in the flags of sendmsg call.
256  3) Create "super message" composed of multiple messages and send this
257     with a single sendmsg.
258
259On receive, the KCM module attempts to queue messages received on the
260same KCM socket during each TCP ready callback. The targeted KCM socket
261changes at each receive ready callback on the KCM socket. The application
262does not need to configure this.
263
264Error handling
265--------------
266
267An application should include a thread to monitor errors raised on
268the TCP connection. Normally, this will be done by placing each
269TCP socket attached to a KCM multiplexor in epoll set for POLLERR
270event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
271on the socket thus waking up the application thread. When the application
272sees the error (which may just be a disconnect) it should unattach the
273socket from KCM and then close it. It is assumed that once an error is
274posted on the TCP socket the data stream is unrecoverable (i.e. an error
275may have occurred in the middle of receiving a message).
276
277TCP connection monitoring
278-------------------------
279
280In KCM there is no means to correlate a message to the TCP socket that
281was used to send or receive the message (except in the case there is
282only one attached TCP socket). However, the application does retain
283an open file descriptor to the socket so it will be able to get statistics
284from the socket which can be used in detecting issues (such as high
285retransmissions on the socket).
286