1zcbor architecture
2=====================
3
4Since zcbor is a Python script that generates C code, this document is split into two sections:
5
61. Architecture of the Python script
72. Architecture of the generated code
8
9Architecture of the Python script
10=================================
11
12The `zcbor.py` script is located in [zcbor/zcbor.py](zcbor/zcbor.py).
13
14The functionality is spread across 5 classes:
15
161. CddlParser
172. CddlXcoder (inherits from CddlParser)
183. DataTranslator (inherits from CddlXcoder)
194. CodeGenerator (inherits from CddlXcoder)
205. CodeRenderer
21
22CddlParser
23----------
24
25Each CddlParser object represents a CDDL type.
26Since CDDL types can contain other types, CddlParser recursively parses a CDDL string, and spawns new instances of itself to represent the (child) types it contains.
27The two most important member variables in CddlParser are `self.value` and `self.type`.
28`self.type` is a string representing the base CDDL type, the options are (corresponding CBOR types are in the form of #majortype.val):
29
30 - `"UINT"` (#0)
31 - `"INT"` (#0 or #1)
32 - `"NINT"` (#1)
33 - `"BSTR"` (#2)
34 - `"TSTR"` (#3)
35 - `"FLOAT"` (#7.26 or #7.27)
36 - `"BOOL"` (#7.20 or #7.21)
37 - `"NIL"` (#7.22)
38 - `"UNDEF"` (#7.23)
39 - `"LIST"` (#4)
40 - `"MAP"` (#5)
41 - `"GROUP"` (N/A)
42 - `"UNION"` (N/A)
43 - `"ANY"` (#0 - #5 or #7)
44 - `"OTHER"` (N/A)
45
46`"OTHER"` means another type defined in the same document, and is used as a pointer to that type definition.
47The CDDL code that can give rise to these are described in the [README](README.md).
48
49`self.value` means different things for different values of `self.type`.
50E.g. for `"UINT"`, `self.value` has the value dictated by the type, or `None` if different values are allowed, so in the following example, `Foo` will have a `self.value` of 5, and `Bar` will have `None`.
51
52```cddl
53Foo = 5
54Bar = uint
55```
56
57For container types, i.e `"LIST"`, `"MAP"`, `"GROUP"`, and `"UNION"`, `self.value` contains a list of their contents.
58The code usually refers to the elements/contents in `self.value` as "children".
59For `"OTHER"`, `self.value` is a string with the name of the type it refers to.
60The following example shows use of both container and `"OTHER"` types.
61
62```cddl
63Foo = uint
64Bar = [*Foo]
65```
66
67This will spawn 3 CddlParser objects:
68
691. `Foo`, which has `self.type = "UINT"` and `self.value = None`
702. An anonymous child of Bar, which has `self.type = "OTHER"`, and `self.value = "Foo"`
713. Bar, which has `self.type = "LIST"`, and `self.value` is a python `list` containing the above object.
72
73CDDL supports many other constraints on the types, and these all have member variables in CddlParser, e.g. `self.min_qty` and `self.max_qty` which are the minimum and maximum quantity/repetitions of this type.
74
75Children of `"MAP"` objects come in key/value pairs.
76These are represented such that the values will be children of the `"MAP"` object, and the keys can be found as `self.key` in these children.
77
78There is a member called `self.my_types`, which is a dict of all the types defined in the CDDL file.
79The elements are on the form `<name>: <CddlParser object>`.
80`"OTHER"` objects will look into `self.my_types[self.value]` to find its own definition.
81
82The actual parsing of the CDDL happens with regular expressions.
83There are one or more expressions for each base type.
84The expressions consume a number of characters from the input string, and also capture a substring to use as the value of the type.
85For container types, the regex will match the whole type, and then recursively parse the children within the matched string.
86
87CddlXcoder
88----------
89
90CddlXcoder inherits from CddlParser, and provides common functionality for DataTranslator and CodeGenerator.
91
92Most of the functionality falls into one of two categories:
93
94- Common names for members in generated code. A single type possibly needs multiple member variables in the generated code to describe it, like
95   - the value
96   - the key associated with this value
97   - the number of times it repeats
98- Condition functions that make inferences about the type based on the member variables in CddlParser, like:
99   - key_var_condition(): Whether it needs a key member
100   - is_unambiguous(): Whether the type is completely specified, i.e. whether we know beforehand exactly how the encoding will look (e.g. `Foo = 5`).
101
102DataTranslator
103-----------
104
105DataTranslator is for handling and manipulating CBOR on the "host".
106For example, the user can compose data in YAML or JSON files and have them converted to CBOR and validated against the CDDL.
107Or they can decode binary CBOR and write python code to inspect it, or just convert it back into YAML or JSON.
108
109DataTranslator inherits from CddlXcoder and allows converting data between a number of different representations like CBOR/YAML/JSON strings, but also internal Python representations.
110While the conversion is happening, the data is validated against the CDDL description.
111
112This class relies heavily on decoding libraries for CBOR/YAML/JSON:
113
114- [cbor2](https://pypi.org/project/cbor2/)
115- [PyYAML](https://pypi.org/project/PyYAML/)
116- [json](https://docs.python.org/3/library/json.html)
117
118All three use the same internal representation of the decoded data, so it's trivial to convert between them.
119The representation for all three is 1-to-1 with the corresponding Python types, (list -> list, map -> dict, uint -> int, bstr -> bytes etc.).
120The only proprietary Python class used is `CBORTag` for CBOR tags.
121
122One caveat is that CBOR supports more features than YAML/JSON, namely:
123
124- non-text map keys
125- bytes
126- tags
127
128In YAML/JSON, these are converted to maps in the following way:
129
130- `{<key>: <value>}` => `{keyval<i>: {"key": <key>, "val": <value>}}` (i is an integer unique within this map)
131- `<bytestring>` => `{"bstr": "<hex representation of bytestring>"}` or
132- `<bytestring>` => `{"bstr": <CBOR decoding of bytestring>}` if `<bytestring>` is encodable as CBOR.
133- `<tag, value>` => `{"tag": <tag>, "val": <value>}` where `<tag>` is the actual tag (a number), and `<value>` is the tagged value (the following CBOR object).
134
135Finally, DataTranslator can also generate a separate internal representation using `namedtuple`s to allow browsing CBOR data by the names given in the CDDL.
136(This is more analogous to how the data is accessed in the C code.)
137
138DataTranslator functionality is tested in [tests/scripts](tests/scripts)
139
140CodeGenerator
141-------------
142
143CodeGenerator, like DataTranslator, inherits from CddlXcoder.
144Its primary purpose is to construct the individual decoding/encoding functions for the types specified in the given CDDL document.
145It also constructs struct definitions used to hold the decoded data/data to be encoded.
146
147CodeGenerator contains optimizations to reduce both the verbosity of the code and the level of indirection in the types.
148For example:
149 - If the type is unambiguous (i.e. its value is predetermined, like in `Foo = 5`), the code will validate it, but CodeGenerator won't include the actual value in the encompassing struct definition.
150 - If a `"GROUP"` or `"UNION"` has only one child, it can be removed as a level of indirection.
151 - If the type needs only a single member variable (i.e. no additional `foo_count` or `foo_key` etc.), that variable can instead be added to the parent struct, and its decoding/encoding code moved into the parent's function.
152 - `"UNION"` are typically implemented as anonymous `union`s which removes one level of indirection when accessing them.
153
154A CodeGenerator object operates in one of two modes: `"encode"` or `"decode"`.
155The generated code for the two is not very different, but they call into different libraries.
156
157Base types, like `"UINT"`, `"BOOL"`, `"FLOAT"` are represented by native C types. `"BSTR"`s and `"TSTR"`s are represented by a proprietary `struct zcbor_string` which is just a `uint8_t` pointer and length.
158These types are decoded/encoded with C code that is not generated.
159More on this in the Architecture of the generated C code below.
160
161When a type is repeated (max_qty > 1 or max_qty > min_qty), there needs to be a distinction between repeated_foo() and foo() (these can be either encoding or decoding functions).
162repeated_foo() concerns itself with the individual value, while foo() concerns itself with the value including repetitions.
163
164When invoking CodeGenerator, the user must decide which types it will need direct access to decode/encode.
165These types are called "entry types" and they are typically the "outermost" types, or the types it is expected that the data will have.
166
167The user can also use entry types when there are `"BSTR"`s that are CBOR encoded, specified as `Foo = bstr .cbor Bar`.
168Usually such strings are automatically decoded/encoded by the generated code, and the objects part of the encompassing struct.
169However, if the user instead wants to manually decode/encode such strings, they can add them to `self.entry_types`.
170In this case, the strings will be stored as a regular `struct zcbor_string` instead of being decoded/encoded.
171
172CodeRenderer
173------------
174
175CodeRenderer is a standalone class that takes the result of the CodeGenerator class and constructs files.
176There are 3 files constructed:
177
178- The C file with the decoding/encoding functions.
179- The H file with the public API to some functions in the C file.
180- The H file with all the struct definitions (the type file). If both decoding and encoding files are generated for the same CDDL, they can share the same type file.
181
182CodeRenderer conducts some pruning and deduplication of the list of types and functions received from CodeGenerator.
183
184
185Architecture of the generated C code
186====================================
187
188In the generated C file, each type from the CDDL file gets its own decoding/encoding function, unless they are trivial types, like `Foo = uint`.
189These functions are all `static`.
190In addition, all entry types get public wrapper functions.
191
192All decoding/encoding functions operate on a state variable of type `zcbor_state_t` which keeps track of:
193
194- The current position in the payload, and the end of the payload.
195- The current position in a list or map, and the maximum expected number of elements.
196- A list of backup states, used for saving states so they can be restored if decoding/encoding fails while processing an optional element.
197
198Each function returns a `bool` indicating whether it was successful at decoding/encoding.
199In most cases, a failure in one function will result in a failure of the whole operation.
200
201However, in the following scenarios, a failure is fine because we don't know ahead of time whether the object will be found or not:
202
203- An object with unknown number of repetitions (`min_qty` and `max_qty` are not the same).
204- `"UNION"`s, since only one of the children should be present.
205
206In these cases, the code attempts to decode the object. If it fails, it restores the state to before the attempt, and then tries decoding the next candidate type.
207
208All generated functions take the form of a single if statement.
209This if statement performs boolean algebra on statements depending on the children (typically only container types get a generated function).
210The assignment of values in the structs mostly happens in the non-generated code.
211The generated code mostly combines and validates calls into the non-generated code or other generated functions.
212
213All functions (generated and not) have the same API structure: `bool <name>(zcbor_state_t *state, <type> *result)`.
214The number of arguments is kept to a minimum to reduce code size.
215
216The exceptions to the API structure are `zcbor_multi_decode`/`zcbor_multi_encode` and `zcbor_present_decode`/`zcbor_present_encode`.
217These functions accept function pointers with the above API and run them multiple times.
218When this happens, the function pointers are cast to a generic function pointer type, and processed without knowledge of the type.
219
220The non-generated files provide decoding/encoding functions for all the basic types except `"OTHER"`.
221There are also housekeeping functions for managing state and logging.
222This code is documented in the header files in the [include](include) folder.
223
224The C tests for the code generation can be found in the [tests/decode](tests/decode) and [tests/encode](tests/encode) folders.
225