.. # # Critical Application Monitoring (CAM) # # SPDX-FileCopyrightText: Copyright 2023-2024 Arm Limited # and/or its affiliates # # SPDX-License-Identifier: BSD-3-Clause # .. _overview: ######## Overview ######## ************ Introduction ************ Automotive compute platforms hosting advanced features such as Advanced Driver Assistance Systems (ADAS) and Autonomous Drive (AD) stacks require increasingly complex and higher performance CPUs in order to meet the demanding workloads. In these environments, the detection of application runtime faults is one strategy used to achieve the required system reliability goals. To help reach these goals, automotive systems benefit from the addition of a Safety Island; a separate compute sub-system that provides a higher safety level compute area for system and application monitoring services. The Critical Application Monitoring (CAM) project demonstrates an application observation mechanism hosted on a Safety Island which can improve the overall system fault coverage. ********************** Principle of Operation ********************** Critical applications often follow a pattern where the workloads are split into multiple periodic tasks chained together to produce a feature pipeline. CAM's principle of operation revolves around this pattern where such tasks generate periodic events which are then monitored by the CAM monitoring service. The two main classes of issues that can be detected are: * Temporal issues: Events arriving outside the expected period * Logical issues: Events arriving out of order .. image:: images/cam_overview.svg :align: center The diagram above describes the mains steps and components involved. Further sections of this document will further describe the individual components in more detail. **cam-service** is the monitoring agent that executes from the higher safety cores in the Safety Island. **cam-service** exposes a socket-based communication channel. Critical applications use this communication channel to stream their periodic events (heartbeat). **libcam** library provides a high-level API that implements the message protocol used to communicate with **cam-service** as well other features. The **stream configuration file** defines the number of events and their timing characteristics according to the requirements of the critical application. With the help of **cam-tool**, this file is converted into a binary format (**stream deployment data**) which is then deployed in the Safety Island to be consumed by **cam-service**. **cam-service** implements a driver interface to communicate with a **fault manager** which is system specific (software or hardware) responsible for taking any action. Many aspects of the CAM implementation revolve around time. The main goal for CAM is to ensure a certain piece of code in critical applications executes periodically on a specific frequency. When the execution time is violated, critical applications are deemed as malfunctioning. .. image:: images/cam_timings.svg :align: center Given the diagram above, the following sequence is described: * Within the monitored piece of code, an event is created at some point in time (T\ :sub:`0`) * The **stream deployment data** provides **cam-service** with the event period. Together with a start sequence, a timer is setup to trigger in T\ :sub:`n` * The event arrival time in **cam-service** (T\ :sub:`a`) is expected to be ` section for the API documentation. =========== cam-service =========== **cam-service** is an application used for monitoring all the event streams sent by critical applications. The main goal is for **cam-service** to run on higher-safety level subsystem. **cam-service** can be built for both Linux and Zephyr RTOS. The Linux porting is primarily intended to provide a development environment allowing easier development and validation. The Zephyr RTOS porting provides a closer experience to a real production where a real-time operating system is more suitable. **Interface** **cam-service** exposes a socket interface which implements the :ref:`Stream Message Protocol `. This is the communication channel available for critical applications to send events. Each connection from an application spawns a new thread. One or more streams of events are initialized on a per-connection basis. **Event Stream** An event stream defines a set of periodic events to be monitored by **cam-service**. As part of the system deployment, **cam-service** must have access to the stream deployment data files of each critical application. Note that each critical application can have on more event streams. During the initial initialization, critical applications 'create' an event stream on the connection using specific commands in the message protocol. Each stream is uniquely identified using UUIDs. **cam-service** uses the UUID to match the stream deployment data files available to it. The alarm and timings found in the corresponding file is then used for the monitoring. Refer to :ref:`CSD_File` section for more information on event streams. **Fault Handling** **cam-service** is able to raise various faults when it observes an exception in the event streams. These include: * Stream state fault: Stream message which does not match expected state. * Stream event logic fault: Stream event out of order. * Stream event temporal fault: Stream event timeout. In addition, when an unrecoverable error occurs in **cam-service** itself, it also reports the fault. The fault module in **cam-service** is divided into front-end and back-end. The back-end implements a driver interface allowing platform specific drivers to receive faults from **cam-service**. This allows custom modules (both software and hardware) to better accommodate the safety workflow required in a given system. ======== cam-tool ======== **cam-tool** is CAM's Swiss army knife. **File conversion** CAM Stream Configuration (CSC) files are written in YAML format. **cam-tool** can convert these into CAM Stream Deployment (CSD) files ready for deployment. Refer to :ref:`CSC_File` and :ref:`CSD_File` for more information on CSC and CSD file specifications. **Event Log Analysis** **libcam** supports a log mode where the stream of events can be saved into a CAM Stream Event Log (CSEL) log file. **cam-tool** has a simple analysis mode capable of reading these files to provide an initial CSC file with pre-set data extracted from the logs. Refer to :ref:`CSEL_File` and :ref:`CSC_File` for more information on CSEL and CSD file specification. **Deployment** Stream Deployment files are meant to be deployed into the system following all security and safety relevant process adopted by the target platform. But in order to simplify the development lifecycle, **cam-service** has the option to allow deployments over the network using the same communication channel used by the event streams. **cam-tool** supports sending deployment files directly to **cam-service** using this feature. Execute **cam-tool** with ``--help`` for details on all possible parameters and features. =========== Test Suites =========== **libcam**, **cam-uuid** and **cam-service** have their own CUnit based unit tests. These are built and run using CMake's CTest support. **cam-tool** has a Pytest based tests. **cam-app-example** has a Python based set of scripts used as integration tests. These tests are capable of launching both **cam-app-example** and **cam-service** on Linux. For more details on how to build and run the tests, refer to :ref:`development_validation`. ********************************* Contributions and Issue Reporting ********************************* This project has not put in place a process for contributions currently. To report issues with the repository such as potential bugs, security concerns, or feature requests, submit an Issue via `GitLab Issues`_, following the project's template. ******************** Feedback and Support ******************** To request support contact Arm at support@arm.com. Arm licensees may also contact Arm via their partner managers.