1<!-- mdformateoff(b/169948621#comment2) -->
2
3<!--
4Semi-automated TOC generation with instructions from
5https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
6-->
7
8<!--ts-->
9
10*   [Summary](#summary)
11*   [High-Level Steps](#high-level-steps)
12    *   [Why not Optimize the Reference Kernels](#why-not-optimize-the-reference-kernels)
13*   [Software Architecture](#software-architecture)
14    *   [Hardware-specific NN library](#hardware-specific-nn-library)
15    *   [Optimized Kernels](#optimized-kernels)
16    *   [Build System Integration](#build-system-integration)
17    *   [Testing and Continuous Integration](#testing-and-continuous-integration)
18
19<!-- Added by: advaitjain, at: Wed 17 Feb 2021 02:14:16 PM PST -->
20
21<!--te-->
22
23# Summary
24
25This guide describes the recommended high-level architecture and steps to add
26hardware-specific optimized kernels to TfLite Micro.
27
28The goal with these optimizations and the process that we recommend to getting
29them merged into the TfLite Micro codebase is to have a measurable and
30documented performance improvement on a benchmark of interest.
31
32Once the optimizations are merged, they will indeed be used for more than the
33benchmark but the context for why the optimizations were added is still very
34important.
35
36# High-Level Steps
37
381.  Pick a benchmark that you would like to measure the performance for.
39
40    *   Existing benchmarks are in the [benchmarks directory](../benchmarks).
41    *   If none of the existing benchmarks capture your use-case, then please
42        create a github issue or start a thread on micro@tensorflow.org to
43        figure out how to add in a new benchmark.
44    *   If adding a publicly-available benchmark to the TFLM codebase is
45        determined to be infeasible, then a fall-back would be to have an
46        internal benchmark that can be used to document the benefits of adding
47        in the optimizations via PR descriptions.
48    *   Adding optimized code without any associated benchmarks will need very
49        strong justification and will most likely not be permitted.
50
511.  Do the groundwork and architecture needed to be able to add in optimizations
52    for your target (more details in the
53    [software architecture](#software-architecture) section).
54
551.  Create one pull request for each optimized kernel with the PR description
56    clearly stating the commands that were used to measure the performance
57    improvement.
58
59    *   This context is important even if the toolchain is proprietary and there
60        are currently a small number of users.
61        *   See [this PR](https://github.com/tensorflow/tensorflow/pull/47098)
62            as an example.
63        *   At minimum the latency with and without the particular optimized
64            kernel should be documented.
65            [Additional context](https://github.com/tensorflow/tensorflow/pull/46746)
66            may also be desirable.
67    *   Here is some
68        [general guidance](https://testing.googleblog.com/2017/09/code-health-providing-context-with.html)
69        on writing
70        [good PR descriptions](https://google.github.io/eng-practices/review/developer/cl-descriptions.html)
71
72## Why Not Optimize the Portable Reference Kernels?
73
74We would like to explicitly point out (as have others) that the reference kernel
75implementations are not performant and there are plenty of opportunities to
76speed them up. This is by design and the reference kernels are meant to be a
77shared starting point to then be optimized in a target specific optimized kernel
78implementation.
79
80Two previous discussions on this topic are on
81[PR #42477](https://github.com/tensorflow/tensorflow/pull/42477) and
82[PR #45227](https://github.com/tensorflow/tensorflow/pull/45227)
83
84Our current point of view on this topic is that while optimizing shared
85reference code in a portable manner is attractive, we are making an explicit
86choice to not go down that path and instead rely on target-specific optimized
87implementations. The TFLM codebase has a growing list of optimized kernel
88implementations, and we are investing in making the process of adding new
89implementations smoother.
90
91# Software Architecture
92
93The optimized kernel architecture is composed of the following three modules:
94
951.  Hardware-specific NN library
961.  Optimized Kernels
971.  Build System Integration
98
99## Hardware-specific NN library
100
101This library uses knowledge of the hardware and compiler to implement the
102underlying operations. Examples of this are
103[CMSIS-NN](https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN) from
104ARM and [NNLib](https://github.com/foss-xtensa/nnlib-hifi4) from Cadence.
105
106The benefits of having this API separation are:
107
1081.  The NN library does not need to follow the style guide of the rest of the
109    TFLM code.
1101.  Releases of the NN library can be made independent of TFLM
1111.  The same NN library can be used and tested independent of TFLM.
1121.  The maintainers of the NN library have full control over the development
113    process that they would like to follow.
114
115## Optimized Kernels
116
117These will be (hopefully thin) wrappers that act as the glue between TFLM and
118the NN library.
119
120The goal here is to delegate as much work as possible to the NN library while
121still allowing the two APIs (TFLM and NN library) to be independent of each
122other. If there is a performance degradation due to this (for example,
123unnecessary memory copies) then we can evaluate those on a case-by-case basis.
124
125This code will be reviewed and merged in the TFLM github repository and must
126follow the development style of the TFLM codebase.
127
128Some amount of refactoring of the existing code may be needed to ensure that
129code is suitably shared between the reference and optimized kernels. There is
130currently no fixed recipe for this refactor and we will evaluate on a
131case-by-case basis during the PR review.
132
133For example, to add an optimized implementation for `fully_conntected` for the
134Xtensa Fusion F1 the steps were: *
135[PR 1](https://github.com/tensorflow/tensorflow/pull/45464): refactor for
136reference fallbacks and a baseline latency. *
137[PR 2](https://github.com/tensorflow/tensorflow/pull/46242): refactor to share
138code between reference and optimized kernels. *
139[PR 3](https://github.com/tensorflow/tensorflow/pull/46411): add the code needed
140to use the optimized NN lib and document the latency improvement.
141
142## Build System Integration
143
144This module is the least defined but we strongly recommend the following: 1. A
145single target makefile.inc for all the architectures that you would like to
146support along with optional target-specific
147[system_setup.cc](../arduino/system_setup.cc). See
148[cortex_m_generic_makefile.inc](../tools/make/targets/cortex_m_generic_makefile.inc)
149and [xtensa_makefile.inc](../tools/make/targets/xtensa_makefile.inc) as
150examples.
151
1521.  A single `ext_libs.inc` (and associated scripts) that downloads any external
153    dependencies (including the NN library). For example:
154
155    *   [cmsis_nn.inc](../tools/make/ext_libs/cmsis_nn.inc) and
156        [cmsis_download.sh](../tools/make/ext_libs/cmsis_download.sh)
157    *   [xtensa.inc](../tools/make/ext_libs/xtensa.inc) and
158        [xtensa_download.sh](../tools/make/ext_libs/xtensa_download.sh)
159
1601.  The optimized kernels will then live in a kernels subdirectory (e.g.
161    [kernels/cmsis_nn](../kernels/cmsis_nn) and
162    [kernels/xtensa](../kernels/xtensa))
163
164Two development workflows that the TFLM team would like to encourage and
165support:
166
1671.  Export static library + headers into target-specific development environment
168
169    *   Build a static libtensorflow-microlite.a using the TFLM makefile with:
170        `make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target>
171        OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite`
172    *   Use the static library and any TFLM headers as part of the overall
173        application (with its own build system).
174
1751.  Integrate TFLM with IDE:
176
177    *   This has historically been done using the TFLM Makefile’s support for
178        project generation.
179
180    *   However, given the learning curve and high-maintenance overhead, we are
181        moving away from supporting project generation via the Makefile and are
182        encouraging future IDE integrations to be done outside of the TFLM
183        Makefiles.
184
185    *   The TFLM team is currently working through the details on this topic.
186
187## Testing and Continuous Integration
188
189The kernel tests are the primary method of ensuring that the optimized kernel
190implementations are accurate.
191
192Currently, most of the tests require the optimizations to be bit-exact to the
193quantized reference implementation. We can revisit this requirement if it ends
194up having a high associated cost on the latency.
195
196We strongly encourage optimized kernel implementations to have an associated
197continuous build that runs through all the unit tests and publishes a build
198badge to the
199[TFLM community supported builds](../README.md#community-supported-builds)
200table. Running the units tests once a day is often a good place to start.
201