1Kernel Connection Multiplexor 2----------------------------- 3 4Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based 5interface over TCP for generic application protocols. With KCM an application 6can efficiently send and receive application protocol messages over TCP using 7datagram sockets. 8 9KCM implements an NxM multiplexor in the kernel as diagrammed below: 10 11+------------+ +------------+ +------------+ +------------+ 12| KCM socket | | KCM socket | | KCM socket | | KCM socket | 13+------------+ +------------+ +------------+ +------------+ 14 | | | | 15 +-----------+ | | +----------+ 16 | | | | 17 +----------------------------------+ 18 | Multiplexor | 19 +----------------------------------+ 20 | | | | | 21 +---------+ | | | ------------+ 22 | | | | | 23+----------+ +----------+ +----------+ +----------+ +----------+ 24| Psock | | Psock | | Psock | | Psock | | Psock | 25+----------+ +----------+ +----------+ +----------+ +----------+ 26 | | | | | 27+----------+ +----------+ +----------+ +----------+ +----------+ 28| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | 29+----------+ +----------+ +----------+ +----------+ +----------+ 30 31KCM sockets 32----------- 33 34The KCM sockets provide the user interface to the multiplexor. All the KCM sockets 35bound to a multiplexor are considered to have equivalent function, and I/O 36operations in different sockets may be done in parallel without the need for 37synchronization between threads in userspace. 38 39Multiplexor 40----------- 41 42The multiplexor provides the message steering. In the transmit path, messages 43written on a KCM socket are sent atomically on an appropriate TCP socket. 44Similarly, in the receive path, messages are constructed on each TCP socket 45(Psock) and complete messages are steered to a KCM socket. 46 47TCP sockets & Psocks 48-------------------- 49 50TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated 51for each bound TCP socket, this structure holds the state for constructing 52messages on receive as well as other connection specific information for KCM. 53 54Connected mode semantics 55------------------------ 56 57Each multiplexor assumes that all attached TCP connections are to the same 58destination and can use the different connections for load balancing when 59transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) 60can be used to send and receive messages from the KCM socket. 61 62Socket types 63------------ 64 65KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. 66 67Message delineation 68------------------- 69 70Messages are sent over a TCP stream with some application protocol message 71format that typically includes a header which frames the messages. The length 72of a received message can be deduced from the application protocol header 73(often just a simple length field). 74 75A TCP stream must be parsed to determine message boundaries. Berkeley Packet 76Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a 77BPF program must be specified. The program is called at the start of receiving 78a new message and is given an skbuff that contains the bytes received so far. 79It parses the message header and returns the length of the message. Given this 80information, KCM will construct the message of the stated length and deliver it 81to a KCM socket. 82 83TCP socket management 84--------------------- 85 86When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and 87write space available (POLLOUT) events are handled by the multiplexor. If there 88is a state change (disconnection) or other error on a TCP socket, an error is 89posted on the TCP socket so that a POLLERR event happens and KCM discontinues 90using the socket. When the application gets the error notification for a 91TCP socket, it should unattach the socket from KCM and then handle the error 92condition (the typical response is to close the socket and create a new 93connection if necessary). 94 95KCM limits the maximum receive message size to be the size of the receive 96socket buffer on the attached TCP socket (the socket buffer size can be set by 97SO_RCVBUF). If the length of a new message reported by the BPF program is 98greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP 99socket. The BPF program may also enforce a maximum messages size and report an 100error when it is exceeded. 101 102A timeout may be set for assembling messages on a receive socket. The timeout 103value is taken from the receive timeout of the attached TCP socket (this is set 104by SO_RCVTIMEO). If the timer expires before assembly is complete an error 105(ETIMEDOUT) is posted on the socket. 106 107User interface 108============== 109 110Creating a multiplexor 111---------------------- 112 113A new multiplexor and initial KCM socket is created by a socket call: 114 115 socket(AF_KCM, type, protocol) 116 117 - type is either SOCK_DGRAM or SOCK_SEQPACKET 118 - protocol is KCMPROTO_CONNECTED 119 120Cloning KCM sockets 121------------------- 122 123After the first KCM socket is created using the socket call as described 124above, additional sockets for the multiplexor can be created by cloning 125a KCM socket. This is accomplished by an ioctl on a KCM socket: 126 127 /* From linux/kcm.h */ 128 struct kcm_clone { 129 int fd; 130 }; 131 132 struct kcm_clone info; 133 134 memset(&info, 0, sizeof(info)); 135 136 err = ioctl(kcmfd, SIOCKCMCLONE, &info); 137 138 if (!err) 139 newkcmfd = info.fd; 140 141Attach transport sockets 142------------------------ 143 144Attaching of transport sockets to a multiplexor is performed by calling an 145ioctl on a KCM socket for the multiplexor. e.g.: 146 147 /* From linux/kcm.h */ 148 struct kcm_attach { 149 int fd; 150 int bpf_fd; 151 }; 152 153 struct kcm_attach info; 154 155 memset(&info, 0, sizeof(info)); 156 157 info.fd = tcpfd; 158 info.bpf_fd = bpf_prog_fd; 159 160 ioctl(kcmfd, SIOCKCMATTACH, &info); 161 162The kcm_attach structure contains: 163 fd: file descriptor for TCP socket being attached 164 bpf_prog_fd: file descriptor for compiled BPF program downloaded 165 166Unattach transport sockets 167-------------------------- 168 169Unattaching a transport socket from a multiplexor is straightforward. An 170"unattach" ioctl is done with the kcm_unattach structure as the argument: 171 172 /* From linux/kcm.h */ 173 struct kcm_unattach { 174 int fd; 175 }; 176 177 struct kcm_unattach info; 178 179 memset(&info, 0, sizeof(info)); 180 181 info.fd = cfd; 182 183 ioctl(fd, SIOCKCMUNATTACH, &info); 184 185Disabling receive on KCM socket 186------------------------------- 187 188A setsockopt is used to disable or enable receiving on a KCM socket. 189When receive is disabled, any pending messages in the socket's 190receive buffer are moved to other sockets. This feature is useful 191if an application thread knows that it will be doing a lot of 192work on a request and won't be able to service new messages for a 193while. Example use: 194 195 int val = 1; 196 197 setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) 198 199BFP programs for message delineation 200------------------------------------ 201 202BPF programs can be compiled using the BPF LLVM backend. For example, 203the BPF program for parsing Thrift is: 204 205 #include "bpf.h" /* for __sk_buff */ 206 #include "bpf_helpers.h" /* for load_word intrinsic */ 207 208 SEC("socket_kcm") 209 int bpf_prog1(struct __sk_buff *skb) 210 { 211 return load_word(skb, 0) + 4; 212 } 213 214 char _license[] SEC("license") = "GPL"; 215 216Use in applications 217=================== 218 219KCM accelerates application layer protocols. Specifically, it allows 220applications to use a message based interface for sending and receiving 221messages. The kernel provides necessary assurances that messages are sent 222and received atomically. This relieves much of the burden applications have 223in mapping a message based protocol onto the TCP stream. KCM also make 224application layer messages a unit of work in the kernel for the purposes of 225steering and scheduling, which in turn allows a simpler networking model in 226multithreaded applications. 227 228Configurations 229-------------- 230 231In an Nx1 configuration, KCM logically provides multiple socket handles 232to the same TCP connection. This allows parallelism between in I/O 233operations on the TCP socket (for instance copyin and copyout of data is 234parallelized). In an application, a KCM socket can be opened for each 235processing thread and inserted into the epoll (similar to how SO_REUSEPORT 236is used to allow multiple listener sockets on the same port). 237 238In a MxN configuration, multiple connections are established to the 239same destination. These are used for simple load balancing. 240 241Message batching 242---------------- 243 244The primary purpose of KCM is load balancing between KCM sockets and hence 245threads in a nominal use case. Perfect load balancing, that is steering 246each received message to a different KCM socket or steering each sent 247message to a different TCP socket, can negatively impact performance 248since this doesn't allow for affinities to be established. Balancing 249based on groups, or batches of messages, can be beneficial for performance. 250 251On transmit, there are three ways an application can batch (pipeline) 252messages on a KCM socket. 253 1) Send multiple messages in a single sendmmsg. 254 2) Send a group of messages each with a sendmsg call, where all messages 255 except the last have MSG_BATCH in the flags of sendmsg call. 256 3) Create "super message" composed of multiple messages and send this 257 with a single sendmsg. 258 259On receive, the KCM module attempts to queue messages received on the 260same KCM socket during each TCP ready callback. The targeted KCM socket 261changes at each receive ready callback on the KCM socket. The application 262does not need to configure this. 263 264Error handling 265-------------- 266 267An application should include a thread to monitor errors raised on 268the TCP connection. Normally, this will be done by placing each 269TCP socket attached to a KCM multiplexor in epoll set for POLLERR 270event. If an error occurs on an attached TCP socket, KCM sets an EPIPE 271on the socket thus waking up the application thread. When the application 272sees the error (which may just be a disconnect) it should unattach the 273socket from KCM and then close it. It is assumed that once an error is 274posted on the TCP socket the data stream is unrecoverable (i.e. an error 275may have occurred in the middle of receiving a message). 276 277TCP connection monitoring 278------------------------- 279 280In KCM there is no means to correlate a message to the TCP socket that 281was used to send or receive the message (except in the case there is 282only one attached TCP socket). However, the application does retain 283an open file descriptor to the socket so it will be able to get statistics 284from the socket which can be used in detecting issues (such as high 285retransmissions on the socket). 286