1Thrift Compact protocol encoding 2================================ 3 4<!-- 5-------------------------------------------------------------------- 6 7Licensed to the Apache Software Foundation (ASF) under one 8or more contributor license agreements. See the NOTICE file 9distributed with this work for additional information 10regarding copyright ownership. The ASF licenses this file 11to you under the Apache License, Version 2.0 (the 12"License"); you may not use this file except in compliance 13with the License. You may obtain a copy of the License at 14 15 http://www.apache.org/licenses/LICENSE-2.0 16 17Unless required by applicable law or agreed to in writing, 18software distributed under the License is distributed on an 19"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 20KIND, either express or implied. See the License for the 21specific language governing permissions and limitations 22under the License. 23 24-------------------------------------------------------------------- 25--> 26 27This documents describes the wire encoding for RPC using the Thrift *compact protocol*. 28 29The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and 30[THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however, 31should behave the same. 32 33For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf). 34 35# Contents 36 37* Compact protocol 38 * Base types 39 * Message 40 * Struct 41 * List and Set 42 * Map 43* BNF notation used in this document 44 45# Compact protocol 46 47## Base types 48 49### Integer encoding 50 51The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_. 52 53Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative 54numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1, 55-2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back: 56 57```scala 58def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31) 59def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1) 60def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63) 61def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1) 62``` 63 64The zigzag int is then encoded as a *var int*, also known as *Unsigned LEB128*. Var ints take 1 to 5 bytes (int32) or 651 to 10 bytes (int64). The process consists in taking a Big Endian unsigned integer, left-padding the bit-string to 66make it a multiple of 7 bits, splitting it into 7-bit groups, prefixing the most-significant 7-bit group with the 0 67bit, prefixing the remaining 7-bit groups with the 1 bit and encoding the resulting bit-string in Little Endian. 68 69For example, the integer 50399 is encoded as follows: 70 71``` 7250399 = 1100 0100 1101 1111 (Big Endian representation) 73 = 00000 1100 0100 1101 1111 (Left-padding) 74 = 0000011 0001001 1011111 (7-bit groups) 75 = 00000011 10001001 11011111 (Most-significant bit prefixes) 76 = 11011111 10001001 00000011 (Little Endian representation) 77 = 0xDF 0x89 0x03 78``` 79 80Var ints are sometimes used directly inside the compact protocol to represent positive numbers. 81 82To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply 83uses a single byte as in the binary protocol. 84 85### Enum encoding 86 87The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32. 88 89### Binary encoding 90 91Binary is sent as follows: 92 93``` 94Binary protocol, binary data, 1+ bytes: 95+--------+...+--------+--------+...+--------+ 96| byte length | bytes | 97+--------+...+--------+--------+...+--------+ 98``` 99 100Where: 101 102* `byte length` is the length of the byte array, using var int encoding (must be >= 0). 103* `bytes` are the bytes of the byte array. 104 105### String encoding 106 107*String*s are first encoded to UTF-8, and then send as binary. They do not 108include a NUL delimiter. 109 110### Double encoding 111 112Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit 113layout. Most run-times provide a library to make this conversion. But while the binary protocol encodes the int64 114in 8 bytes in big endian order, the compact protocol encodes it in little endian order - this is due to an early 115implementation bug that finally became the de-facto standard. 116 117### Boolean encoding 118 119Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set, 120list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an int8; 121true as `1` and false as `0`. 122 123### Universal unique identifier encoding 124 125Values of `uuid` type are expected as 16-byte binary in big endian (or "network") order. Byte order conversion 126might be necessary on certain platforms, e.g. Windows holds GUIDs in a complex record-like structure whose 127memory layout differs. 128 129*Note*: Since the length is fixed, no `byte length` prefix is necessary and the field is always 16 bytes long. 130 131 132## Message 133 134A `Message` on the wire looks as follows: 135 136``` 137Compact protocol Message (4+ bytes): 138+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+ 139|pppppppp|mmmvvvvv| seq id | name length | name | 140+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+ 141``` 142 143Where: 144 145* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82. 146* `mmm` is the message type, an unsigned 3 bit integer. 147* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`. 148* `seq id` is the sequence id, a signed 32 bit integer encoded as a var int. 149* `name length` is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0). 150* `name` is the method name to invoke, a UTF-8 encoded string. 151 152Message types are encoded with the following values: 153 154* _Call_: 1 155* _Reply_: 2 156* _Exception_: 3 157* _Oneway_: 4 158 159### Struct 160 161A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and 162is followed by the encoded field value. The encoding can be summarized by the following BNF: 163 164``` 165struct ::= ( field-header field-value )* stop-field 166field-header ::= field-type field-id 167``` 168 169Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any 170order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also 171possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to 172determine how to decode the field value. 173 174Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility. 175 176The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has 177another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking. 178Other implementation may perform this check and then either ignore the field, or return a protocol exception. 179 180A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded. 181 182An *Exception* is encoded exactly the same as a struct. 183 184### Struct encoding 185 186``` 187Compact protocol field header (short form) and field value: 188+--------+--------+...+--------+ 189|ddddtttt| field value | 190+--------+--------+...+--------+ 191 192Compact protocol field header (1 to 3 bytes, long form) and field value: 193+--------+--------+...+--------+--------+...+--------+ 194|0000tttt| field id | field value | 195+--------+--------+...+--------+--------+...+--------+ 196 197Compact protocol stop field: 198+--------+ 199|00000000| 200+--------+ 201``` 202 203Where: 204 205* `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive. 206* `tttt` is field-type id, an unsigned 4 bit integer. 207* `field id` the field id, a signed 16 bit integer encoded as zigzag int. 208* `field-value` the encoded field value. 209 210The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the 211first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive). 212 213The following field-types can be encoded: 214 215* `BOOLEAN_TRUE`, encoded as `1` 216* `BOOLEAN_FALSE`, encoded as `2` 217* `I8`, encoded as `3` 218* `I16`, encoded as `4` 219* `I32`, encoded as `5` 220* `I64`, encoded as `6` 221* `DOUBLE`, encoded as `7` 222* `BINARY`, used for binary and string fields, encoded as `8` 223* `LIST`, encoded as `9` 224* `SET`, encoded as `10` 225* `MAP`, encoded as `11` 226* `STRUCT`, used for both structs and union fields, encoded as `12` 227* `UUID`, encoded as `13` 228 229Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no 230length (0 bytes). 231 232## List and Set 233 234List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the 235encoded elements. 236 237``` 238Compact protocol list header (1 byte, short form) and elements: 239+--------+--------+...+--------+ 240|sssstttt| elements | 241+--------+--------+...+--------+ 242 243Compact protocol list header (2+ bytes, long form) and elements: 244+--------+--------+...+--------+--------+...+--------+ 245|1111tttt| size | elements | 246+--------+--------+...+--------+--------+...+--------+ 247``` 248 249Where: 250 251* `ssss` is the size, 4 bit unsigned int, values `0` - `14` 252* `tttt` is the element-type, a 4 bit unsigned int 253* `size` is the size, a var int (int32), positive values `15` or higher 254* `elements` are the encoded elements 255 256The short form should be used when the length is in the range 0 - 14 (inclusive). 257 258The following element-types are used (see note below): 259 260* `BOOL`, encoded as `2` 261* `I8`, encoded as `3` 262* `I16`, encoded as `4` 263* `I32`, encoded as `5` 264* `I64`, encoded as `6` 265* `DOUBLE`, encoded as `7` 266* `BINARY`, used for binary and string fields, encoded as `8` 267* `LIST`, encoded as `9` 268* `SET`, encoded as `10` 269* `MAP`, encoded as `11` 270* `STRUCT`, used for structs and union fields, encoded as `12` 271* `UUID`, encoded as `13` 272 273*Note*: Although field-types and element-types lists are currently very similar, there is _no guarantee_ that this will 274remain true after new types are added. 275 276The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: 2772147483647). 278 279## Map 280 281Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed 282by the encoded elements. The encoding follows this BNF: 283 284``` 285map ::= empty-map | non-empty-map 286empty-map ::= `0` 287non-empty-map ::= size key-element-type value-element-type (key value)+ 288``` 289 290``` 291Compact protocol map header (1 byte, empty map): 292+--------+ 293|00000000| 294+--------+ 295 296Compact protocol map header (2+ bytes, non empty map) and key value pairs: 297+--------+...+--------+--------+--------+...+--------+ 298| size |kkkkvvvv| key value pairs | 299+--------+...+--------+--------+--------+...+--------+ 300``` 301 302Where: 303 304* `size` is the size, a var int (int32), strictly positive values 305* `kkkk` is the key element-type, a 4 bit unsigned int 306* `vvvv` is the value element-type, a 4 bit unsigned int 307* `key value pairs` are the encoded keys and values 308 309The element-types are the same as for lists. The full list is included in the 'List and set' section. 310 311The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: 3122147483647). 313 314# BNF notation used in this document 315 316The following BNF notation is used: 317 318* a plus `+` appended to an item represents repetition; the item is repeated 1 or more times 319* a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times 320* a pipe `|` between items represents choice, the first matching item is selected 321* parenthesis `(` and `)` are used for grouping multiple items 322