Written by Deepak Molly Mathew (Fraunhofer ITWM)
In earlier CAPE updates, we introduced the main elements of CAPE:
Edge Micro Data Centers (EMDCs) and embedded High Performance Servers (eHPS) as a new “unit of computing” at the edge, providing a fully composable hardware platform based on COM-HPC modules and PCIe/CXL, and a cloud-agnostic software stack that describes infrastructure and applications as code. This update focuses on the CAPE validation use case Edge AI.
The main goal is:
Run modern deep-learning pipelines directly on edge servers
and send only compact, privacy-preserving information to the cloud.
We target processing real-world video streams by running vision-language models on the CAPE hard- and software stack.
Why Edge AI?
Many smart-city and industrial operators already deploy large numbers of cameras and sensors. Today, data is often sent to a central cloud, processed by deep-learning services, and then converted into dashboards or alarms.
Video processing in particular comes along with several challenges:
- Latency: For safety-critical use cases, a cloud round trip requiring more than 100 ms, can be too slow.
- Bandwidth and cost: Continuous upload of video from many cameras is expensive.
- Privacy and data sovereignty: In Europe, operators often want to pre-process raw data on-premise and transmit lightweight results of detections to centralized servers.
The Edge AI use case moves deep-learning inference closer to the edge and performs processing near the sensors. Traditional cloud servers support the data evaluation and is used mainly for aggregation, long-term analytics, and (re)training.
Scenario: video analytics at the edge
We work with a smart-city video analytics pipeline for anomaly and event detection, covering scenarios such as unusual human behavior, safety incidents, and irregular traffic. The anomaly labels are predefined (e.g. fight, vandalism, accident, normal) and expressed as short text prompts, which are encoded by a vision–language model (e.g. CLIP). This leverages the model’s strong semantic knowledge to interpret scenes consistently across edge sites, reducing the need for retraining and avoiding the transmission of raw video to the cloud.
To adapt to previously unseen scenarios or labels, the pipeline extends to a federated Weakly Supervised Video Anomaly Detection (WSVAD) setup. This enables partial model retraining at the edge, where raw data remains local, and only model parameter updates are shared with the cloud.
Mapping to CAPE’s hardware platforms
Our main implementation focus is a RISC-V deep-learning accelerator running on an FPGA, integrated into the eHPS. We use CXL for host–accelerator synchronization and kernel offload, aiming for lower overhead compared to a classic loosely coupled PCIe model. The host prepares and schedules workloads, while compute-heavy deep learning inference kernels run on the FPGA-based RISC-V engine. CXL enables efficient buffer sharing and fast coordination between host and accelerator.
Evaluation Concept
We evaluate the combined benefits of RISC-V and CXL using representative kernels against COTS accelerator solutions, focusing on offload overheads, performance and energy efficiency. This use case helps show where open RISC-V acceleration plus CXL can offer a strong, practical path for low-latency edge AI within CAPE’s hardware platforms.


