1<!-- mdformat off(b/169948621#comment2) -->
2
3<!--ts-->
4
5*   [Pre-allocated tensors](#pre-allocated-tensors)
6    *   [Background](#background)
7    *   [Current status](#current-status)
8    *   [Proposed implementation](#proposed-implementation)
9    *   [Performance overview](#performance-overview)
10        *   [Cycle aspect](#cycle-aspect)
11        *   [Memory aspect](#memory-aspect)
12            <!-- Semi-automated TOC generation with instructions from https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc -->
13
14<!--te-->
15
16# Pre-allocated tensors
17
18## Background
19
20Tensors are allocated differently depending on the type of tensor. Weight
21tensors are located in the flatbuffer, which is allocated by the application
22that calls TensorFlow Lite Micro. EvalTensors are allocated in the tensor arena,
23either offline planned as specified in the flatbuffers metadata (described in
24this
25[RFC](https://docs.google.com/document/d/16aTSHL5wxsq99t6adVbBz1U3K8Y5tBDAvs16iroZDEU)),
26or allocated during runtime by the
27[memory planner](https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/memory_planner)
28(online planned), see
29[RFC](https://docs.google.com/document/d/1akpqu0uiPQshmCrnV6dOEFgYM4tCCnI8Zce85PnjHMI).
30The tensor arena is allocated by MicroAllocator in TensorFlow Lite Micro, and
31the model buffer (represented by a .tflite-file) is allocated by the application
32using TensorFlow Lite Micro. An illustration of this can be seen in the image
33below.
34
35![Image of two blocks](../images/preallocated_tensors/preallocated_tensors_bg_1.png)
36
37Is some use cases it could be advantageous to place some of the EvalTensors
38outside of the tensor arena, for example: * When sensor output data is stored in
39its own defined buffer, outside the tensor arena, and therefore needs to be
40copied into the tensor arena before inference. * When the tensor is to be
41consumed from a memory location outside the tensor arena, e.g. a separate memory
42bank DSP. \
43Details regarding the impact on the number of clock cycles and memory
44consumption can be found under “Performance overview”. In this RFC we present an
45option to allow an application to provide pre-allocated buffers to TensorFlow
46Lite Micro for selected tensors. An illustration of the resulting memory layout
47with pre-allocated tensors can be seen in the figure below.
48
49![Image of three blocks](../images/preallocated_tensors/preallocated_tensors_bg_2.png)
50
51## Current status
52
53The purpose of pre-allocating tensors is to reduce the number of clock cycles,
54and our initial motivation for this feature was that avoiding the copying of the
55buffer described in the Background section would reduce the number of cycles
56consumed by the application.
57
58Our second motivation was that by using a buffer outside of the memory arena,
59there was an opportunity to significantly reduce the required size of the memory
60arena.
61
62An initial investigation into these matters, using the person detection model as
63an example, indicates that the performance gain might not be very significant in
64many use cases. The reduction in the number of clock cycles looks to be ~1%.
65Details regarding this can be found in the Performance overview section.
66
67The reduction in the size of the memory arena is not straightforward to
68estimate. As described in the Performance overview section, it depends on the
69size of other tensors in the network. In the worst case scenario it might not
70reduce the memory arena size at all. If the pre allocated buffer is much larger
71than the second largest buffer, then the reduction in size may be significant.
72
73Therefore, our current position is that the performance gain expected from pre
74allocating the tensors does not motivate the increased complexity that this
75feature would introduce to the TensorFlow Lite Micro framework.
76
77## Proposed implementation
78
79MicroAllocator initializes all tensors to nullptr, and during the allocation
80process only allocates the tensors whose data field is nullptr. The application
81tells the MicroInterpreter which tensor is preallocated, and supplies a memory
82buffer using the RegisterPreallocatedTensor() function.
83
84The MicroInterpreter then assigns the pre-allocated buffer to the tensor
85data-field. If the tensor in question is marked as offline planned, as described
86in this [RFC](https://docs.google.com/document/d/16aTSHL5wxsq99t6adVbBz1U3K8Y5tBDAvs16iroZDEU),
87the MicroInterpreter should not pre-allocated it, and instead return an error.
88
89If multiple tensors are to be pre-allocated, multiple calls to
90RegisterPreallocatedTensor() are required. An example can be seen in the MSC
91below.
92
93![MSC](../images/preallocated_tensors/preallocated_tensors_impl1.png)
94
95## Performance overview
96
97### Cycle aspect
98
99In this section we try to estimate the number of clock cycles one memcpy() takes
100in relation to the total inference time for the person_detection model. The
101reason for looking closer at this model is that it has a relatively large input
102data size, which should make the cycle consumption of a memcpy() relatively
103large. Please note that these numbers are approximate and based on calculations,
104not actual benchmarking numbers.
105
106A word aligned memcpy() consumes somewhere between 1 - 4 bytes per cycle
107depending on which CPU is used. The input size for the person_detection model
108is 96x96 = 9216 bytes. On a reference system without accelerators one memcpy()
109of 9216 bytes corresponds to, in order of magnitudes, ~0.01% of the total amount
110of clock cycles for one inference. The ratio will differ depending on the input
111size and the number of inferences/second.
112
113When using an accelerator, the total inference time will be significantly less
114which means that the memcpy()-call will consume a larger part of the total
115inference time. Approximations show that one memcpy() of 9216 bytes will consume
116~1% of the total execution time for a reference system utilizing an ML HW
117accelerator.
118
119### Memory aspect
120
121In this section we'll look at memory savings aspects of pre-allocating tensors
122outside the tensor arena. The default memory planner in TFLu is
123[GreedyPlanner](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/memory_planner/greedy_memory_planner.h)
124(see
125[RFC](https://docs.google.com/document/d/1akpqu0uiPQshmCrnV6dOEFgYM4tCCnI8Zce85PnjHMI)).
126One good tool for understanding tensor layout in the tensor arena is using
127[PrintMemoryPlan API](https://github.com/tensorflow/tflite-micro/blob/73c5fa4d2bfbfd974552957818de2ab18ff42f39/tensorflow/lite/micro/memory_planner/greedy_memory_planner.h#L84).
128If we print the calculated memory layout for the
129[person detection model](https://storage.googleapis.com/download.tensorflow.org/data/tf_lite_micro_person_data_int8_grayscale_2020_06_23.zip),
130the tensor arena looks like this at each layer:
131
132`Layer 1:
13300000000000000000000000000tttttttttttttt........................................
134Layer 2:
13500000000000000000000000000...........................999999999999999999999999999
136Layer 3:
137aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa999999999999999999999999999
138Layer 4:
139aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbb..............
140Layer 5:
141cccccccccccccccccccccccccc...........................bbbbbbbbbbbbb..............
142Layer 6:
143ccccccccccccccccccccccccccddddddddddddddddddddddddddd...........................`
144
145The horizontal axis shows offset from the start of the tensor arena. The
146vertical axis shows execution order. The dots are "unused" memory for that
147specific layer. The letters and numbers represent the EvalTensor index, mapped
148to 0-9, then a-z. 't' is the input tensor of layer 1 (equivalent to the input
149data to the model) and '0' is the output tensor of layer 1. Hence, '0' is also
150the input tensor to layer 2, and '9' is the output tensor of layer 2. And so on.
151
152The reason for showing this illustration is that it becomes obvious that it is
153**the largest combination of simultaneously used tensors, of your model, that
154defines how large the tensor arena needs to be.** In this example, it's Layer 3.
155
156The combined size of tensors 'a' and '9' defines the size needed for the tensors
157arena. As a consequence, to save tensor arena memory by pre-allocation, we must
158start by pre-allocating tensor 'a' or '9' outside the arena. This will make the
159total size of the tensor arena smaller, which will reduce the total memory
160footprint of TensorFlow Lite Micro if the pre-allocated tensor is already
161allocated outside of the memory arena, like in the examples given in the
162Background section.
163