Overview

Introduction

Automotive compute platforms hosting advanced features such as Advanced Driver Assistance Systems (ADAS) and Autonomous Drive (AD) stacks require increasingly complex and higher performance CPUs in order to meet the demanding workloads. In these environments, the detection of application runtime faults is one strategy used to achieve the required system reliability goals.

To help reach these goals, automotive systems benefit from the addition of a Safety Island; a separate compute sub-system that provides a higher safety level compute area for system and application monitoring services.

The Critical Application Monitoring (CAM) project demonstrates an application observation mechanism hosted on a Safety Island which can improve the overall system fault coverage.

Principle of Operation

Critical applications often follow a pattern where the workloads are split into multiple periodic tasks chained together to produce a feature pipeline. CAM’s principle of operation revolves around this pattern where such tasks generate periodic events which are then monitored by the CAM monitoring service. The two main classes of issues that can be detected are:

  • Temporal issues: Events arriving outside the expected period

  • Logical issues: Events arriving out of order

_images/cam_overview.svg

The diagram above describes the mains steps and components involved. Further sections of this document will further describe the individual components in more detail.

cam-service is the monitoring agent that executes from the higher safety cores in the Safety Island. cam-service exposes a socket-based communication channel.

Critical applications use this communication channel to stream their periodic events (heartbeat). libcam library provides a high-level API that implements the message protocol used to communicate with cam-service as well other features.

The stream configuration file defines the number of events and their timing characteristics according to the requirements of the critical application. With the help of cam-tool, this file is converted into a binary format (stream deployment data) which is then deployed in the Safety Island to be consumed by cam-service.

cam-service implements a driver interface to communicate with a fault manager which is system specific (software or hardware) responsible for taking any action.

Many aspects of the CAM implementation revolve around time. The main goal for CAM is to ensure a certain piece of code in critical applications executes periodically on a specific frequency. When the execution time is violated, critical applications are deemed as malfunctioning.

_images/cam_timings.svg

Given the diagram above, the following sequence is described:

  • Within the monitored piece of code, an event is created at some point in time (T0)

  • The stream deployment data provides cam-service with the event period. Together with a start sequence, a timer is setup to trigger in Tn

  • The event arrival time in cam-service (Ta) is expected to be <Tn allowing the service to disable the timer for this event.

  • If the event deadline (Tn) is missed, cam-service will raise a fault in the Fault Manager which is responsible for acting accordingly (Tm)

There different approaches to define (T0) which affects how (T0) is calculated. The nature of the critical application workloads dictates what is most suitable. CAM defines two main approaches:

  • Fixed periodic timers - Once the period of the events is known, events are expected to happen on a fixed (non-drifting) time interval. This operating mode covers strict/hard deadlines.

  • On-demand scheduled timers - Events are expected to happen within a period since the last event with a +/- margin. Note that the deadline can either be a fixed amount from when the previous event happened or when the event message arrived in the CAM Service. When the deadline is based on when the previous event happened, the drift will only be negative (events can happen earlier).

The current CAM implementation supports only on-demand scheduled timers with deadlines based on the previous event happened.

The time it takes to process the different steps is composed by many variants in both software and hardware. The acceptable latencies and reaction times are ultimately defined by product requirements.

Components

The Critical Application Monitoring source code can be found at https://gitlab.arm.com/automotive-and-industrial/safety-island/critical-app-monitoring

The main components of the project can be seen in the diagram below and are described in the next sub-sections.

For further detailed information on the file formats, message protocol and libcam API, refer to Development Manual section.

_images/cam_components.svg

Build System

CAM uses CMake in its build system to build and install the entire project. The project targets Linux with the exception of the cam-service which targets both Linux and Zephyr RTOS.

Apart from cam-tool which is written in Python, the libraries, cam-service and cam-app-example are C based and can be compiled with the GCC toolchain.

For more details on how to build the project and run an example application, refer to Getting Started section.

Documentation

The documentation directory in the project contains all the source code for this documentation.

The project’s main documentation is based on Sphinx. libcam API is documented using Doxygen at source code level and it is then automatically integrated into the main documentation.

Refer to Critical Application Monitoring Documentation for the latest published version of the documentation.

cam-uuid

cam-service uses Universal Unique Identifier (UUID) to uniquely identify the event streams from the critical applications deployed in the system.

cam-uuid is a small library used by libcam and cam-service to manipulate UUIDs in code.

cam-app-example

cam-app-example is an example on how to use libcam. It is also a testing application used to validate cam-service.

cam-app-example accepts a number of command line parameters to simulate different number of stream events and their respective timings.

It also supports error injection into the stream events to trigger fault events in cam-service.

Execute cam-app-example with --help for details on all possible parameters and features.

libcam

libcam is a C based library available to critical applications. It offers a simple, thread-safe and modular API. The build system installs both static and dynamic versions of the library. Applications need to include a single cam.h header file in their projects.

The library allows applications to create different topologies to communicate with cam-service. Applications can create one or more connections (via sockets) with cam-service. A connection supports one or more streams of events. Apart from being able to send events, applications can also control the event streams state (start, stop or end).

The library provides a calibration mode where events are saved into a CAM Stream Event Log (CSEL) file instead of being sent to cam-service. Such logs can later be used for timing analysis using cam-tool.

Refer to Libcam API section for the API documentation.

cam-service

cam-service is an application used for monitoring all the event streams sent by critical applications. The main goal is for cam-service to run on higher-safety level subsystem.

cam-service can be built for both Linux and Zephyr RTOS. The Linux porting is primarily intended to provide a development environment allowing easier development and validation. The Zephyr RTOS porting provides a closer experience to a real production where a real-time operating system is more suitable.

Interface

cam-service exposes a socket interface which implements the Stream Message Protocol. This is the communication channel available for critical applications to send events. Each connection from an application spawns a new thread. One or more streams of events are initialized on a per-connection basis.

Event Stream

An event stream defines a set of periodic events to be monitored by cam-service. As part of the system deployment, cam-service must have access to the stream deployment data files of each critical application. Note that each critical application can have on more event streams. During the initial initialization, critical applications ‘create’ an event stream on the connection using specific commands in the message protocol. Each stream is uniquely identified using UUIDs. cam-service uses the UUID to match the stream deployment data files available to it. The alarm and timings found in the corresponding file is then used for the monitoring.

Refer to Stream Deployment File Format section for more information on event streams.

Fault Handling

cam-service is able to raise various faults when it observes an exception in the event streams. These include:

  • Stream state fault: Stream message which does not match expected state.

  • Stream event logic fault: Stream event out of order.

  • Stream event temporal fault: Stream event timeout.

In addition, when an unrecoverable error occurs in cam-service itself, it also reports the fault.

The fault module in cam-service is divided into front-end and back-end. The back-end implements a driver interface allowing platform specific drivers to receive faults from cam-service. This allows custom modules (both software and hardware) to better accommodate the safety workflow required in a given system.

cam-tool

cam-tool is CAM’s Swiss army knife.

File conversion

CAM Stream Configuration (CSC) files are written in YAML format. cam-tool can convert these into CAM Stream Deployment (CSD) files ready for deployment.

Refer to Stream Configuration File Format and Stream Deployment File Format for more information on CSC and CSD file specifications.

Event Log Analysis

libcam supports a log mode where the stream of events can be saved into a CAM Stream Event Log (CSEL) log file. cam-tool has a simple analysis mode capable of reading these files to provide an initial CSC file with pre-set data extracted from the logs.

Refer to Stream Event Log File Format and Stream Configuration File Format for more information on CSEL and CSD file specification.

Deployment

Stream Deployment files are meant to be deployed into the system following all security and safety relevant process adopted by the target platform.

But in order to simplify the development lifecycle, cam-service has the option to allow deployments over the network using the same communication channel used by the event streams. cam-tool supports sending deployment files directly to cam-service using this feature.

Execute cam-tool with --help for details on all possible parameters and features.

Test Suites

libcam, cam-uuid and cam-service have their own CUnit based unit tests. These are built and run using CMake’s CTest support.

cam-tool has a Pytest based tests.

cam-app-example has a Python based set of scripts used as integration tests. These tests are capable of launching both cam-app-example and cam-service on Linux.

For more details on how to build and run the tests, refer to Validation.

Contributions and Issue Reporting

This project has not put in place a process for contributions currently.

To report issues with the repository such as potential bugs, security concerns, or feature requests, submit an Issue via GitLab Issues, following the project’s template.

Feedback and Support

To request support contact Arm at support@arm.com. Arm licensees may also contact Arm via their partner managers.