1
2Overview
3========
4
5This readme tries to provide some background on the hows and whys of RDS,
6and will hopefully help you find your way around the code.
7
8In addition, please see this email about RDS origins:
9http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
10
11RDS Architecture
12================
13
14RDS provides reliable, ordered datagram delivery by using a single
15reliable connection between any two nodes in the cluster. This allows
16applications to use a single socket to talk to any other process in the
17cluster - so in a cluster with N processes you need N sockets, in contrast
18to N*N if you use a connection-oriented socket transport like TCP.
19
20RDS is not Infiniband-specific; it was designed to support different
21transports.  The current implementation used to support RDS over TCP as well
22as IB.
23
24The high-level semantics of RDS from the application's point of view are
25
26 *	Addressing
27        RDS uses IPv4 addresses and 16bit port numbers to identify
28        the end point of a connection. All socket operations that involve
29        passing addresses between kernel and user space generally
30        use a struct sockaddr_in.
31
32        The fact that IPv4 addresses are used does not mean the underlying
33        transport has to be IP-based. In fact, RDS over IB uses a
34        reliable IB connection; the IP address is used exclusively to
35        locate the remote node's GID (by ARPing for the given IP).
36
37        The port space is entirely independent of UDP, TCP or any other
38        protocol.
39
40 *	Socket interface
41        RDS sockets work *mostly* as you would expect from a BSD
42        socket. The next section will cover the details. At any rate,
43        all I/O is performed through the standard BSD socket API.
44        Some additions like zerocopy support are implemented through
45        control messages, while other extensions use the getsockopt/
46        setsockopt calls.
47
48        Sockets must be bound before you can send or receive data.
49        This is needed because binding also selects a transport and
50        attaches it to the socket. Once bound, the transport assignment
51        does not change. RDS will tolerate IPs moving around (eg in
52        a active-active HA scenario), but only as long as the address
53        doesn't move to a different transport.
54
55 *	sysctls
56        RDS supports a number of sysctls in /proc/sys/net/rds
57
58
59Socket Interface
60================
61
62  AF_RDS, PF_RDS, SOL_RDS
63	AF_RDS and PF_RDS are the domain type to be used with socket(2)
64	to create RDS sockets. SOL_RDS is the socket-level to be used
65	with setsockopt(2) and getsockopt(2) for RDS specific socket
66	options.
67
68  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
69        This creates a new, unbound RDS socket.
70
71  setsockopt(SOL_SOCKET): send and receive buffer size
72        RDS honors the send and receive buffer size socket options.
73        You are not allowed to queue more than SO_SNDSIZE bytes to
74        a socket. A message is queued when sendmsg is called, and
75        it leaves the queue when the remote system acknowledges
76        its arrival.
77
78        The SO_RCVSIZE option controls the maximum receive queue length.
79        This is a soft limit rather than a hard limit - RDS will
80        continue to accept and queue incoming messages, even if that
81        takes the queue length over the limit. However, it will also
82        mark the port as "congested" and send a congestion update to
83        the source node. The source node is supposed to throttle any
84        processes sending to this congested port.
85
86  bind(fd, &sockaddr_in, ...)
87        This binds the socket to a local IP address and port, and a
88        transport, if one has not already been selected via the
89	SO_RDS_TRANSPORT socket option
90
91  sendmsg(fd, ...)
92        Sends a message to the indicated recipient. The kernel will
93        transparently establish the underlying reliable connection
94        if it isn't up yet.
95
96        An attempt to send a message that exceeds SO_SNDSIZE will
97        return with -EMSGSIZE
98
99        An attempt to send a message that would take the total number
100        of queued bytes over the SO_SNDSIZE threshold will return
101        EAGAIN.
102
103        An attempt to send a message to a destination that is marked
104        as "congested" will return ENOBUFS.
105
106  recvmsg(fd, ...)
107        Receives a message that was queued to this socket. The sockets
108        recv queue accounting is adjusted, and if the queue length
109        drops below SO_SNDSIZE, the port is marked uncongested, and
110        a congestion update is sent to all peers.
111
112        Applications can ask the RDS kernel module to receive
113        notifications via control messages (for instance, there is a
114        notification when a congestion update arrived, or when a RDMA
115        operation completes). These notifications are received through
116        the msg.msg_control buffer of struct msghdr. The format of the
117        messages is described in manpages.
118
119  poll(fd)
120        RDS supports the poll interface to allow the application
121        to implement async I/O.
122
123        POLLIN handling is pretty straightforward. When there's an
124        incoming message queued to the socket, or a pending notification,
125        we signal POLLIN.
126
127        POLLOUT is a little harder. Since you can essentially send
128        to any destination, RDS will always signal POLLOUT as long as
129        there's room on the send queue (ie the number of bytes queued
130        is less than the sendbuf size).
131
132        However, the kernel will refuse to accept messages to
133        a destination marked congested - in this case you will loop
134        forever if you rely on poll to tell you what to do.
135        This isn't a trivial problem, but applications can deal with
136        this - by using congestion notifications, and by checking for
137        ENOBUFS errors returned by sendmsg.
138
139  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
140        This allows the application to discard all messages queued to a
141        specific destination on this particular socket.
142
143        This allows the application to cancel outstanding messages if
144        it detects a timeout. For instance, if it tried to send a message,
145        and the remote host is unreachable, RDS will keep trying forever.
146        The application may decide it's not worth it, and cancel the
147        operation. In this case, it would use RDS_CANCEL_SENT_TO to
148        nuke any pending messages.
149
150  setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
151  getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
152	Set or read an integer defining  the underlying
153	encapsulating transport to be used for RDS packets on the
154	socket. When setting the option, integer argument may be
155	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
156	value, RDS_TRANS_NONE will be returned on an unbound socket.
157	This socket option may only be set exactly once on the socket,
158	prior to binding it via the bind(2) system call. Attempts to
159	set SO_RDS_TRANSPORT on a socket for which the transport has
160	been previously attached explicitly (by SO_RDS_TRANSPORT) or
161	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
162	An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will
163	always return EINVAL.
164
165RDMA for RDS
166============
167
168  see rds-rdma(7) manpage (available in rds-tools)
169
170
171Congestion Notifications
172========================
173
174  see rds(7) manpage
175
176
177RDS Protocol
178============
179
180  Message header
181
182    The message header is a 'struct rds_header' (see rds.h):
183    Fields:
184      h_sequence:
185          per-packet sequence number
186      h_ack:
187          piggybacked acknowledgment of last packet received
188      h_len:
189          length of data, not including header
190      h_sport:
191          source port
192      h_dport:
193          destination port
194      h_flags:
195          CONG_BITMAP - this is a congestion update bitmap
196          ACK_REQUIRED - receiver must ack this packet
197          RETRANSMITTED - packet has previously been sent
198      h_credit:
199          indicate to other end of connection that
200          it has more credits available (i.e. there is
201          more send room)
202      h_padding[4]:
203          unused, for future use
204      h_csum:
205          header checksum
206      h_exthdr:
207          optional data can be passed here. This is currently used for
208          passing RDMA-related information.
209
210  ACK and retransmit handling
211
212      One might think that with reliable IB connections you wouldn't need
213      to ack messages that have been received.  The problem is that IB
214      hardware generates an ack message before it has DMAed the message
215      into memory.  This creates a potential message loss if the HCA is
216      disabled for any reason between when it sends the ack and before
217      the message is DMAed and processed.  This is only a potential issue
218      if another HCA is available for fail-over.
219
220      Sending an ack immediately would allow the sender to free the sent
221      message from their send queue quickly, but could cause excessive
222      traffic to be used for acks. RDS piggybacks acks on sent data
223      packets.  Ack-only packets are reduced by only allowing one to be
224      in flight at a time, and by the sender only asking for acks when
225      its send buffers start to fill up. All retransmissions are also
226      acked.
227
228  Flow Control
229
230      RDS's IB transport uses a credit-based mechanism to verify that
231      there is space in the peer's receive buffers for more data. This
232      eliminates the need for hardware retries on the connection.
233
234  Congestion
235
236      Messages waiting in the receive queue on the receiving socket
237      are accounted against the sockets SO_RCVBUF option value.  Only
238      the payload bytes in the message are accounted for.  If the
239      number of bytes queued equals or exceeds rcvbuf then the socket
240      is congested.  All sends attempted to this socket's address
241      should return block or return -EWOULDBLOCK.
242
243      Applications are expected to be reasonably tuned such that this
244      situation very rarely occurs.  An application encountering this
245      "back-pressure" is considered a bug.
246
247      This is implemented by having each node maintain bitmaps which
248      indicate which ports on bound addresses are congested.  As the
249      bitmap changes it is sent through all the connections which
250      terminate in the local address of the bitmap which changed.
251
252      The bitmaps are allocated as connections are brought up.  This
253      avoids allocation in the interrupt handling path which queues
254      sages on sockets.  The dense bitmaps let transports send the
255      entire bitmap on any bitmap change reasonably efficiently.  This
256      is much easier to implement than some finer-grained
257      communication of per-port congestion.  The sender does a very
258      inexpensive bit test to test if the port it's about to send to
259      is congested or not.
260
261
262RDS Transport Layer
263==================
264
265  As mentioned above, RDS is not IB-specific. Its code is divided
266  into a general RDS layer and a transport layer.
267
268  The general layer handles the socket API, congestion handling,
269  loopback, stats, usermem pinning, and the connection state machine.
270
271  The transport layer handles the details of the transport. The IB
272  transport, for example, handles all the queue pairs, work requests,
273  CM event handlers, and other Infiniband details.
274
275
276RDS Kernel Structures
277=====================
278
279  struct rds_message
280    aka possibly "rds_outgoing", the generic RDS layer copies data to
281    be sent and sets header fields as needed, based on the socket API.
282    This is then queued for the individual connection and sent by the
283    connection's transport.
284  struct rds_incoming
285    a generic struct referring to incoming data that can be handed from
286    the transport to the general code and queued by the general code
287    while the socket is awoken. It is then passed back to the transport
288    code to handle the actual copy-to-user.
289  struct rds_socket
290    per-socket information
291  struct rds_connection
292    per-connection information
293  struct rds_transport
294    pointers to transport-specific functions
295  struct rds_statistics
296    non-transport-specific statistics
297  struct rds_cong_map
298    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
299
300Connection management
301=====================
302
303  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
304  ERROR states.
305
306  The first time an attempt is made by an RDS socket to send data to
307  a node, a connection is allocated and connected. That connection is
308  then maintained forever -- if there are transport errors, the
309  connection will be dropped and re-established.
310
311  Dropping a connection while packets are queued will cause queued or
312  partially-sent datagrams to be retransmitted when the connection is
313  re-established.
314
315
316The send path
317=============
318
319  rds_sendmsg()
320    struct rds_message built from incoming data
321    CMSGs parsed (e.g. RDMA ops)
322    transport connection alloced and connected if not already
323    rds_message placed on send queue
324    send worker awoken
325  rds_send_worker()
326    calls rds_send_xmit() until queue is empty
327  rds_send_xmit()
328    transmits congestion map if one is pending
329    may set ACK_REQUIRED
330    calls transport to send either non-RDMA or RDMA message
331    (RDMA ops never retransmitted)
332  rds_ib_xmit()
333    allocs work requests from send ring
334    adds any new send credits available to peer (h_credits)
335    maps the rds_message's sg list
336    piggybacks ack
337    populates work requests
338    post send to connection's queue pair
339
340The recv path
341=============
342
343  rds_ib_recv_cq_comp_handler()
344    looks at write completions
345    unmaps recv buffer from device
346    no errors, call rds_ib_process_recv()
347    refill recv ring
348  rds_ib_process_recv()
349    validate header checksum
350    copy header to rds_ib_incoming struct if start of a new datagram
351    add to ibinc's fraglist
352    if competed datagram:
353      update cong map if datagram was cong update
354      call rds_recv_incoming() otherwise
355      note if ack is required
356  rds_recv_incoming()
357    drop duplicate packets
358    respond to pings
359    find the sock associated with this datagram
360    add to sock queue
361    wake up sock
362    do some congestion calculations
363  rds_recvmsg
364    copy data into user iovec
365    handle CMSGs
366    return to application
367
368Multipath RDS (mprds)
369=====================
370  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
371  (though the concept can be extended to other transports). The classical
372  implementation of RDS-over-TCP is implemented by demultiplexing multiple
373  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
374  port]) over a single TCP socket between the 2 IP addresses involved. This
375  has the limitation that it ends up funneling multiple RDS flows over a
376  single TCP flow, thus it is
377  (a) upper-bounded to the single-flow bandwidth,
378  (b) suffers from head-of-line blocking for all the RDS sockets.
379
380  Better throughput (for a fixed small packet size, MTU) can be achieved
381  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
382  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
383  connection. RDS sockets will be attached to a path based on some hash
384  (e.g., of local address and RDS port number) and packets for that RDS
385  socket will be sent over the attached path using TCP to segment/reassemble
386  RDS datagrams on that path.
387
388  Multipathed RDS is implemented by splitting the struct rds_connection into
389  a common (to all paths) part, and a per-path struct rds_conn_path. All
390  I/O workqs and reconnect threads are driven from the rds_conn_path.
391  Transports such as TCP that are multipath capable may then set up a
392  TPC socket per rds_conn_path, and this is managed by the transport via
393  the transport privatee cp_transport_data pointer.
394
395  Transports announce themselves as multipath capable by setting the
396  t_mp_capable bit during registration with the rds core module. When the
397  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
398  across multiple paths. The outgoing hash is computed based on the
399  local address and port that the PF_RDS socket is bound to.
400
401  Additionally, even if the transport is MP capable, we may be
402  peering with some node that does not support mprds, or supports
403  a different number of paths. As a result, the peering nodes need
404  to agree on the number of paths to be used for the connection.
405  This is done by sending out a control packet exchange before the
406  first data packet. The control packet exchange must have completed
407  prior to outgoing hash completion in rds_sendmsg() when the transport
408  is mutlipath capable.
409
410  The control packet is an RDS ping packet (i.e., packet to rds dest
411  port 0) with the ping packet having a rds extension header option  of
412  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
413  number of paths supported by the sender. The "probe" ping packet will
414  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
415  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
416  be able to compute the min(sender_paths, rcvr_paths). The pong
417  sent in response to a probe-ping should contain the rcvr's npaths
418  when the rcvr is mprds-capable.
419
420  If the rcvr is not mprds-capable, the exthdr in the ping will be
421  ignored.  In this case the pong will not have any exthdrs, so the sender
422  of the probe-ping can default to single-path mprds.
423
424