1<!-- mdformat off(b/169948621#comment2) --> 2 3<!--ts--> 4 5* [Pre-allocated tensors](#pre-allocated-tensors) 6 * [Background](#background) 7 * [Current status](#current-status) 8 * [Proposed implementation](#proposed-implementation) 9 * [Performance overview](#performance-overview) 10 * [Cycle aspect](#cycle-aspect) 11 * [Memory aspect](#memory-aspect) 12 <!-- Semi-automated TOC generation with instructions from https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc --> 13 14<!--te--> 15 16# Pre-allocated tensors 17 18## Background 19 20Tensors are allocated differently depending on the type of tensor. Weight 21tensors are located in the flatbuffer, which is allocated by the application 22that calls TensorFlow Lite Micro. EvalTensors are allocated in the tensor arena, 23either offline planned as specified in the flatbuffers metadata (described in 24this 25[RFC](https://docs.google.com/document/d/16aTSHL5wxsq99t6adVbBz1U3K8Y5tBDAvs16iroZDEU)), 26or allocated during runtime by the 27[memory planner](https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/memory_planner) 28(online planned), see 29[RFC](https://docs.google.com/document/d/1akpqu0uiPQshmCrnV6dOEFgYM4tCCnI8Zce85PnjHMI). 30The tensor arena is allocated by MicroAllocator in TensorFlow Lite Micro, and 31the model buffer (represented by a .tflite-file) is allocated by the application 32using TensorFlow Lite Micro. An illustration of this can be seen in the image 33below. 34 35![Image of two blocks](../images/preallocated_tensors/preallocated_tensors_bg_1.png) 36 37Is some use cases it could be advantageous to place some of the EvalTensors 38outside of the tensor arena, for example: * When sensor output data is stored in 39its own defined buffer, outside the tensor arena, and therefore needs to be 40copied into the tensor arena before inference. * When the tensor is to be 41consumed from a memory location outside the tensor arena, e.g. a separate memory 42bank DSP. \ 43Details regarding the impact on the number of clock cycles and memory 44consumption can be found under “Performance overview”. In this RFC we present an 45option to allow an application to provide pre-allocated buffers to TensorFlow 46Lite Micro for selected tensors. An illustration of the resulting memory layout 47with pre-allocated tensors can be seen in the figure below. 48 49![Image of three blocks](../images/preallocated_tensors/preallocated_tensors_bg_2.png) 50 51## Current status 52 53The purpose of pre-allocating tensors is to reduce the number of clock cycles, 54and our initial motivation for this feature was that avoiding the copying of the 55buffer described in the Background section would reduce the number of cycles 56consumed by the application. 57 58Our second motivation was that by using a buffer outside of the memory arena, 59there was an opportunity to significantly reduce the required size of the memory 60arena. 61 62An initial investigation into these matters, using the person detection model as 63an example, indicates that the performance gain might not be very significant in 64many use cases. The reduction in the number of clock cycles looks to be ~1%. 65Details regarding this can be found in the Performance overview section. 66 67The reduction in the size of the memory arena is not straightforward to 68estimate. As described in the Performance overview section, it depends on the 69size of other tensors in the network. In the worst case scenario it might not 70reduce the memory arena size at all. If the pre allocated buffer is much larger 71than the second largest buffer, then the reduction in size may be significant. 72 73Therefore, our current position is that the performance gain expected from pre 74allocating the tensors does not motivate the increased complexity that this 75feature would introduce to the TensorFlow Lite Micro framework. 76 77## Proposed implementation 78 79MicroAllocator initializes all tensors to nullptr, and during the allocation 80process only allocates the tensors whose data field is nullptr. The application 81tells the MicroInterpreter which tensor is preallocated, and supplies a memory 82buffer using the RegisterPreallocatedTensor() function. 83 84The MicroInterpreter then assigns the pre-allocated buffer to the tensor 85data-field. If the tensor in question is marked as offline planned, as described 86in this [RFC](https://docs.google.com/document/d/16aTSHL5wxsq99t6adVbBz1U3K8Y5tBDAvs16iroZDEU), 87the MicroInterpreter should not pre-allocated it, and instead return an error. 88 89If multiple tensors are to be pre-allocated, multiple calls to 90RegisterPreallocatedTensor() are required. An example can be seen in the MSC 91below. 92 93![MSC](../images/preallocated_tensors/preallocated_tensors_impl1.png) 94 95## Performance overview 96 97### Cycle aspect 98 99In this section we try to estimate the number of clock cycles one memcpy() takes 100in relation to the total inference time for the person_detection model. The 101reason for looking closer at this model is that it has a relatively large input 102data size, which should make the cycle consumption of a memcpy() relatively 103large. Please note that these numbers are approximate and based on calculations, 104not actual benchmarking numbers. 105 106A word aligned memcpy() consumes somewhere between 1 - 4 bytes per cycle 107depending on which CPU is used. The input size for the person_detection model 108is 96x96 = 9216 bytes. On a reference system without accelerators one memcpy() 109of 9216 bytes corresponds to, in order of magnitudes, ~0.01% of the total amount 110of clock cycles for one inference. The ratio will differ depending on the input 111size and the number of inferences/second. 112 113When using an accelerator, the total inference time will be significantly less 114which means that the memcpy()-call will consume a larger part of the total 115inference time. Approximations show that one memcpy() of 9216 bytes will consume 116~1% of the total execution time for a reference system utilizing an ML HW 117accelerator. 118 119### Memory aspect 120 121In this section we'll look at memory savings aspects of pre-allocating tensors 122outside the tensor arena. The default memory planner in TFLu is 123[GreedyPlanner](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/memory_planner/greedy_memory_planner.h) 124(see 125[RFC](https://docs.google.com/document/d/1akpqu0uiPQshmCrnV6dOEFgYM4tCCnI8Zce85PnjHMI)). 126One good tool for understanding tensor layout in the tensor arena is using 127[PrintMemoryPlan API](https://github.com/tensorflow/tflite-micro/blob/73c5fa4d2bfbfd974552957818de2ab18ff42f39/tensorflow/lite/micro/memory_planner/greedy_memory_planner.h#L84). 128If we print the calculated memory layout for the 129[person detection model](https://storage.googleapis.com/download.tensorflow.org/data/tf_lite_micro_person_data_int8_grayscale_2020_06_23.zip), 130the tensor arena looks like this at each layer: 131 132`Layer 1: 13300000000000000000000000000tttttttttttttt........................................ 134Layer 2: 13500000000000000000000000000...........................999999999999999999999999999 136Layer 3: 137aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa999999999999999999999999999 138Layer 4: 139aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbb.............. 140Layer 5: 141cccccccccccccccccccccccccc...........................bbbbbbbbbbbbb.............. 142Layer 6: 143ccccccccccccccccccccccccccddddddddddddddddddddddddddd...........................` 144 145The horizontal axis shows offset from the start of the tensor arena. The 146vertical axis shows execution order. The dots are "unused" memory for that 147specific layer. The letters and numbers represent the EvalTensor index, mapped 148to 0-9, then a-z. 't' is the input tensor of layer 1 (equivalent to the input 149data to the model) and '0' is the output tensor of layer 1. Hence, '0' is also 150the input tensor to layer 2, and '9' is the output tensor of layer 2. And so on. 151 152The reason for showing this illustration is that it becomes obvious that it is 153**the largest combination of simultaneously used tensors, of your model, that 154defines how large the tensor arena needs to be.** In this example, it's Layer 3. 155 156The combined size of tensors 'a' and '9' defines the size needed for the tensors 157arena. As a consequence, to save tensor arena memory by pre-allocation, we must 158start by pre-allocating tensor 'a' or '9' outside the arena. This will make the 159total size of the tensor arena smaller, which will reduce the total memory 160footprint of TensorFlow Lite Micro if the pre-allocated tensor is already 161allocated outside of the memory arena, like in the examples given in the 162Background section. 163