1.. _safety_overview:
2
3Zephyr Safety Overview
4########################
5
6Introduction
7************
8
9This document is the safety documentation providing an overview over the safety-relevant activities
10and what the Zephyr Project and the Zephyr Safety Working Group / Committee try to achieve.
11
12This overview is provided for people who are interested in the functional safety development part
13of the Zephyr RTOS and project members who want to contribute to the safety aspects of the
14project.
15
16Overview
17********
18
19In this section we give the reader an overview of what the general goal of the safety certification
20is, what standard we aim to achieve and what quality standards and processes need to be implemented
21to reach such a safety certification.
22
23Safety Document update
24**********************
25
26This document is a living document and may evolve over time as new requirements, guidelines, or
27processes are introduced.
28
29#. Changes will be submitted from the interested party(ies) via pull requests to the Zephyr
30   documentation repository.
31
32#. The Zephyr Safety Committee will review these changes and provide feedback or acceptance of
33   the changes.
34
35#. Once accepted, these changes will become part of the document.
36
37General safety scope
38********************
39
40The general scope of the Safety Committee is to achieve a certification for the `IEC 61508
41<https://en.wikipedia.org/wiki/IEC_61508>`__ standard and the Safety Integrity Level (SIL) 3 /
42Systematic Capability (SC) 3 for a limited source scope (see certification scope TBD). Since the
43code base is pre-existing, we use the route 3s/1s approach defined by the IEC 61508 standard.
44
45Route 3s
46   *Assessment of non-compliant development. Which is basically the route 1s with existing
47   sources.*
48
49Route 1s
50   *Compliant development. Compliance with the requirements of this standard for the avoidance and
51   control of systematic faults in software.*
52
53Summarization IEC 61508 standard
54================================
55
56The IEC 61508 standard is a widely recognized international standard for functional safety of
57electrical, electronic, and programmable electronic safety-related systems. Here's an overview of
58some of the key safety aspects of the standard:
59
60#. **Hazard and Risk Analysis**: The IEC 61508 standard requires a thorough analysis of potential
61   hazards and risks associated with a system in order to determine the appropriate level of safety
62   measures needed to reduce those risks to acceptable levels.
63
64#. **Safety Integrity Level (SIL)**: The standard introduces the concept of Safety Integrity Level
65   (SIL) to classify the level of risk reduction required for each safety function. The higher the
66   SIL, the greater the level of risk reduction required.
67
68#. **System Design**: The IEC 61508 standard requires a systematic approach to system design that
69   includes the identification of safety requirements, the development of a safety plan, and the
70   use of appropriate safety techniques and measures to ensure that the system meets the required
71   SIL.
72
73#. **Verification and Validation**: The standard requires rigorous testing and evaluation of the
74   safety-related system to ensure that it meets the specified SIL and other safety requirements.
75   This includes verification of the system design, validation of the system's functionality, and
76   ongoing monitoring and maintenance of the system.
77
78#. **Documentation and Traceability**: The IEC 61508 standard requires a comprehensive
79   documentation process to ensure that all aspects of the safety-related system are fully
80   documented and that there is full traceability from the safety requirements to the final system
81   design and implementation.
82
83Overall, the IEC 61508 standard provides a framework for the design, development, and
84implementation of safety-related systems that aims to reduce the risk of accidents and improve
85overall safety. By following the standard, organizations can ensure that their safety-related
86systems are designed and implemented to the highest level of safety integrity.
87
88Why IEC 61508?
89==============
90The IEC 61508 standard was selected because it serves as a foundational functional safety standard
91applicable across various industry sectors. It provides a robust framework that can be used as
92base for specific standards for different industries. This makes IEC 61508 particularly relevant
93for Zephyr, as the operating system's versatility allows it to be effectively utilized across a
94wide range of industry sectors.
95
96The following diagram illustrates the relationship between the IEC 61508 standard and other related
97standards:
98
99.. figure:: images/IEC-61508-basis.svg
100   :align: center
101   :alt: IEC 61508 relation to other standards
102   :figclass: align-center
103
104   IEC 61508 relation to other standards
105
106Quality
107*******
108
109Quality is a mandatory expectation for software across the industry. The code base of the project
110must achieve various software quality goals in order to be considered an auditable code base from a
111safety perspective and to be usable for certification purposes. But software quality is not an
112additional requirement caused by functional safety standards. Functional safety considers quality
113as an existing pre-condition and therefore the "quality managed" status should be pursued for any
114project regardless of the functional safety goals. The following list describes the quality goals
115which need to be reached to achieve an auditable code base:
116
1171. Basic software quality standards
118
119   a. :ref:`coding_guidelines` (including: static code analysis, coding style, etc.)
120   b. :ref:`safety_requirements` and requirements tracing
121   c. Test coverage
122
1232. Software architecture design principles
124
125   a. Layered architecture model
126   b. Encapsulated components
127   c. Encapsulated single functionality (if not fitable and manageable in safety)
128
129Basic software quality standards - Safety view
130==============================================
131
132In this chapter the Safety Committee describes why they need the above listed quality goals as
133pre-condition and what needs to be done to achieve an auditable code base from the safety
134perspective. Generally speaking, it can be said that all of these quality measures regarding safety
135are used to minimize the error rate during code development.
136
137Coding Guidelines
138-----------------
139
140The coding guidelines are the basis to a common understanding and a unified ruleset and development
141style for industrial software products. For safety the coding guidelines are essential and have
142another purpose beside the fact of a unified ruleset. It is also necessary to prove that the
143developers follow a unified development style to prevent **systematic errors** in the process of
144developing software and thus to minimize the overall **error rate** of the complete software
145system.
146
147Also the **IEC 61508 standard** sets a pre-condition and recommendation towards the use of coding
148standards / guidelines to reduce likelihood of errors.
149
150The project TSC and the Safety Committee of the project agreed to implement
151a staged and incremental approach for complying with a set of coding rules (AKA
152Coding Guidelines) to improve quality and consistency of the code base. Below
153are the agreed upon stages:
154
155Stage I (COMPLETED)
156  Coding guideline rules are available to be followed and referenced,
157  but not enforced. Rules are not yet enforced in CI and pull-requests cannot be
158  blocked by reviewers/approvers due to violations.
159
160Stage II
161  Reviewers/approvers can block pull-requests due to violations of the coding guidelines
162  in pull-requests across the codebase.
163
164  Begin enforcement on a limited scope of the code base. Initially, this would be
165  the safety certification scope. For rules easily applied across codebase, we
166  should not limit compliance to initial scope. This step requires tooling,
167  CI setup and an enforcement strategy.
168
169Stage III
170  Revisit the coding guideline rules and based on experience from previous
171  stages, refine/iterate on selected rules.
172
173Stage IV
174   Expand enforcement to the wider codebase. Exceptions may be granted on some
175   areas of the codebase with a proper justification. Exception would require
176   TSC approval.
177
178.. note::
179
180    Coding guideline rules may be removed/changed at any time by filing a
181    GH issue/RFC.
182
183.. important::
184
185    **Current stage:**
186    The prerequisites to complete **Stage II** are currently being looked at:
187    The tooling is in evaluation, CI setup and `enforcement strategy
188    <https://github.com/zephyrproject-rtos/zephyr/issues/58903>`__ is being worked on.
189
190Requirements and requirements tracing
191-------------------------------------
192
193Requirements and requirement management are not only important for software development, but also
194very important in terms of safety. On the one hand, this specifies and describes in detail and on a
195technical level what the software should do, and on the other hand, it is an important and
196necessary tool to verify whether the described functionality is implemented as expected. For this
197purpose, tracing the requirements down to the code level is used. With the requirements management
198and tracing in hand, it can now be verified whether the functionality has been tested and
199implemented correctly, thus minimizing the systematic error rate.
200
201Also the IEC 61508 standard highly recommends (which is like a must-have for the certification)
202requirements and requirements tracing.
203
204Test coverage
205-------------
206
207A high test coverage, in turn, is evidence of safety that the code conforms precisely to what it
208was developed for and does not execute any unforeseen instructions. If the entire code is tested
209and has a high (ideally 100%) test coverage, it has the additional advantage of quickly detecting
210faulty changes and further minimizing the error rate. However, it must be noted that different
211requirements apply to safety for test coverage, and various metrics must be considered, which are
212prescribed by the IEC 61508 standard for the SIL 3 / SC3 target. The following must be fulfilled,
213among other things:
214
215* Structural test coverage (entry points) 100%
216* Structural test coverage (statements) 100%
217* Structural test coverage (branches) 100%
218
219If the 100% cannot be reached (e.g. statement coverage of defensive code) that part needs to be
220described and justified in the documentation.
221
222Software architecture design principles
223=======================================
224
225To create and maintain a structured software product it is also necessary to consider individual
226software architecture designs and implement them in accordance with safety standards because some
227designs and implementations are not reasonable in safety, so that the overall software and code
228base can be used as auditable code. However, most of these software architecture designs have
229already been implemented in the Zephyr project and need to be verified by the Safety Committee /
230Safety Working Group and the safety architect.
231
232Layered architecture model
233--------------------------
234
235The **IEC 61508 standard** strongly recommends a modular approach to software architecture. This
236approach has been pursued in the Zephyr project from the beginning with its layered architecture.
237The idea behind this architecture is to organize modules or components with similar functionality
238into layers. As a result, each layer can be assigned a specific role in the system. This model has
239the advantage in safety that interfaces between different components and layers can be shown at a
240very high level, and thus it can be determined which functionalities are safety-relevant and can be
241limited. Furthermore, various analyses and documentations can be built on top of this architecture,
242which are important for certification and the responsible certification body.
243
244Encapsulated components
245-----------------------
246
247Encapsulated components are an essential part of the architecture design for safety at this point.
248The most important aspect is the separation of safety-relevant components from non-safety-relevant
249components, including their associated interfaces. This ensures that the components have no
250**repercussions** on other components.
251
252Encapsulated single functionality (if not reasonable and manageable in safety)
253------------------------------------------------------------------------------
254
255Another requirement for the overall system and software environment is that individual
256functionalities can be disabled within components. This is because if a function is absolutely
257unacceptable for safety (e.g. complete dynamic memory management), then these individual
258functionalities should be able to be turned off. The Zephyr Project already offers such a
259possibility through the use of Kconfig and its flexible configurability.
260
261Processes and workflow
262**********************
263
264.. figure:: images/zephyr-safety-process.svg
265   :align: center
266   :alt: Safety process and workflow overview
267   :figclass: align-center
268
269   Safety process and workflow overview
270
271The diagram describes the rough process defined by the Safety Committee to ensure safety in the
272development of the Zephyr project. To ensure understanding, a few points need to be highlighted and
273some details explained regarding the role of the safety architect and the role of the safety
274committee in the whole process. The diagram only describes the paths that are possible when a
275change is related to safety.
276
277#. On the main branch, the safety scope of the project should be identified, which typically
278   represents a small subset of the entire code base. This subset should then be made auditable
279   during normal development on “main”, which means that special attention is paid to quality goals
280   (`Quality`_) and safety processes within this scope. The Safety Architect works alongside the
281   Technical Steering Committee (TSC) in this area, monitoring the development process to ensure
282   that the architecture meets the safety requirements.
283
284#. At this point, the safety architect plays an increasingly important role. For PRs/issues that
285   fall within the safety scope, the safety architect should ideally be involved in the discussions
286   and decisions of minor changes in the safety scope to be able to react to safety-relevant
287   changes that are not conformant. If a pull request or issue introduces a significant and
288   influential change or improvement that requires extended discussion or decision-making, the
289   safety architect should bring it to the attention of the Safety Committee or the Technical
290   Steering Committee (TSC) as appropriate, so that they can make a decision on the best course of
291   action.
292
293#. This section describes the certification side. At this point, the code base has to be in an
294   "auditable" state, and ideally no further changes should be necessary or made to the code base.
295   There is still a path from the main branch to this area. This is needed in case a serious bug or
296   important change is found or implemented on the main branch in the safety scope, after the LTS
297   and the auditable branch were created. In this case, the Safety Committee, together with the
298   safety architect, must decide whether this bug fix or change should be integrated into the LTS
299   so that the bug fix or change could also be integrated into the auditable branch. This
300   integration can take three forms: First either as only a code change or second as only an update
301   to the safety documentation or third as both.
302
303#. This describes the necessary safety process required for certification itself. Here, the final
304   analyses, tests, and documents are created and conducted which must be created and conducted
305   during the certification, and which are prescribed by the certifying authority and the standard
306   being certified. If the certification body approves everything at this stage and the safety
307   process is completed, a safety release can be created and published.
308
309#. This transition from the auditable branch to the main branch should only occur in exceptional
310   circumstances, specifically when something has been identified during the certification process
311   that needs to be quickly adapted on the “auditable” branch in order to obtain certification. In
312   order to prevent this issue from arising again during the next certification, there needs to be
313   a path to merge these changes back into the main branch so that they are not lost, and to have
314   them ready for the next certification if necessary.
315
316.. important::
317   Safety should not block the project and minimize the room to grow in any way.
318
319.. important::
320   **TODO:** Find and define ways, guidelines and processes which minimally impact the daily work
321   of the maintainers, reviewers and contributors and also the safety architect itself.
322   But which are also suitable for safety.
323