From FIWARE Forge Wiki

Jump to: navigation, search
Name FIWARE.OpenSpecification.Data.CompressedDomainVideoAnalysis
Chapter Data/Context Management,
Catalogue-Link to Implementation [ <N/A>]
Owner Siemens AG, Marcus Laumer



Within this document you find a self-contained open specification of a FIWARE generic enabler, please consult as well the FIWARE Product Vision, the website on and similar pages in order to understand the complete context of the FIWARE platform.

FIWARE WIKI editorial remark:
This page corresponds to Release 3 of FIWARE. The latest version associated to the latest Release is linked from FIWARE Architecture


Copyright © 2012-2014 by Siemens AG

Legal Notice

Please check the following Legal Notice to understand the rights to use these specifications.


In the media era of the web, much content is user-generated (UGC) and spans over any possible kind, from amateur to professional, nature, parties, etc. In such context, video content analysis can provide several advantages for classifying content and later search, or to provide additional information about the content itself.

The Compressed Domain Video Analysis GE consists of a set of tools for analyzing video streams in the compressed domain, i.e., the received streams are either directly processed without prior decoding or just few relevant elements of the stream are parsed to be used within the analysis.

Target Usage

The target users of the Compressed Domain Video Analysis GE are all applications that want to extract meaningful information from video content and that need to automatically find characteristics in video data. The GE can work for previously stored video data as well as for video data streams (e.g., received from a camera in real time).

User roles in different industries addressed by this Generic Enabler are:

  • Telecom industry: Identify characteristics in video content recorded by single mobile users; identify communalities in the recordings across several mobile users (e.g., within the same cell).
  • Mobile users: (Semi-) automated annotation of recorded video content, point of interest recognition and tourist information in augmented reality scenarios, social services (e.g., facial recognition).
  • IT companies: Automated processing of video content in databases.
  • Surveillance industry: Automated detection of relevant events (e.g., alarms, etc.).
  • Marketing industry: Object/brand recognition and sales information offered (shops near user, similar products, etc.).

Basic Concepts

Block-Based Hybrid Video Coding

Video coding is always required if a sequence of pictures has to be stored or transferred efficiently. The most common method to compress video content is the so-called block-based hybrid video coding technique. A single frame of the raw video content is divided into several smaller blocks and each block is processed individually. Hybrid means that the encoder as well as the decoder consists of a combination of motion compensation and prediction error coding techniques. A block diagram of a hybrid video coder is depicted in the figure below.

Block diagram of a block-based hybrid video coder

A hybrid video coder can be divided in several generic components:

  • Coder Control: Controls all other components to fulfill pre-defined stream properties, like a certain bit rate or quality. (Indicated by colored block corners)
  • Intra-Frame Encoder: This component usually performs a transform to the frequency domain, followed by quantization and scaling of the transform coefficients.
  • Intra-Frame Decoder: To avoid a drift between encoder and decoder, the encoder includes a decoder. Therefore, this component reverses the previous encoding step.
  • In-Loop Filter: This filter component could be a set of consecutive filters. The most common filter operation here is deblocking.
  • Motion Estimator: Comparing blocks of the current frame with regions in previous and/or subsequent frames permits modeling the motion between these frames.
  • Motion Compensator: According to the results of the Motion Estimator, this component compensates the estimated motion by creating a predictor for the current block.
  • Intra-Frame Predictor: If the control decides to use intra-frame coding techniques, this component creates a predictor for the current block by just using neighbouring blocks of the current frame.
  • Entropy Encoder: The information gathered during the encoding process is entropy encoded in this component. Usually, a resource-efficient variable length coding technique (e.g., CAVLC in H.264/AVC) or even an arithmetic coder (e.g., CABAC in H.264/AVC) is used.

During the encoding process, the predicted video data p[x,y,k] (where x and y are the Cartesian coordinates of the k-th sample, i.e., frame) gets subtracted from the raw video data r[x,y,k]. The resulting prediction error signal e[x,y,k] then gets intra-frame and entropy encoded.

The decoder within the encoder sums up the en- and decoded error signal e'[x,y,k] and the predicted video data p[x,y,k] to get the reconstructed video data r'[x,y,k]. These reconstructed frames are stored in the Frame Buffer. During the motion compensation process, previous and/or subsequent frames of the current frame ( r'[x,y,k+i], i ∈ Z \ {0} ) are extracted from the buffer.

Compressed Domain Video Analysis

In literature, there are several techniques for different post-processing steps for videos. Most of them operate in the so-called pixel domain. Pixel domain means that any processing is directly performed on the actual pixel values of a video image. Thereto all compressed video data has to be decoded before analysis algorithms can be applied. A simple processing chain of pixel domain approaches is depicted in the figure below.

A simple pixel domain processing chain

The simplest way of analyzing video content is to watch it on an appropriate display. For example, a surveillance camera could transmit images of an area that is relevant for security to be evaluated by a watchman. Although this mode obviously finds its application in practice, it is not applicable for all systems, because of two major problems. The first problem is that at any time someone needs to keep track of the monitors. As a result this mode is indeed on the one hand real-time capable, but on the other hand quite expensive. A second major problem is that it is hardly scalable. If a surveillance system has a huge amount of cameras installed, it is nearly impossible to keep track of all of the monitors at the same time. So the efficiency of this mode will decrease with an increasing number of sources.

Beside a manual analysis of video content, automated analysis has become more and more important in the last years. At first, the received video content from the network has to be decoded. Thereby the decoded video frames are stored in a frame buffer to have access to them during the analysis procedure. Based on these video frames an analysis algorithm, e.g., object detection and tracking can be performed. A main advantage over a manual analysis is that this mode is usually easily scalable and less expensive. But due to the decoding process, the frame buffer operations, and the usually high computing time of pixel domain detection algorithms, this mode is not always real-time capable and has furthermore a high complexity.

Due to the limitations of pixel domain approaches, more and more attempts were made to transfer the video analysis procedures from pixel domain to compressed domain. Working within compressed domain means to work directly on compressed data. The following figure gives an example for a compressed domain processing chain.

A simple compressed domain processing chain

Due to the omission of the preceding decoder it is possible to work directly with the received data. At the same time, the integrated syntax parser permits to extract single required elements from the data stream and to use them for analyzing. As a result, the analysis becomes less computationally intensive due to the reason that the costly decoding process does not always have to be passed through completely. Furthermore, this solution consumes fewer resources since it is not required anymore to store the video frames in a buffer. This leads to a technique that is compared to pixel domain techniques usually more efficient and appears more scalable.


The Compressed Domain Video Analysis GE consists of a set of tools for analyzing video streams in the compressed domain. Its purpose is to avoid costly video content decoding prior to the actual analysis. Thereby, the tool set processes video streams by analyzing compressed or just partially decoded syntax elements. The main benefit is its very fast analysis due to a hierarchical architecture. The following figure illustrates the functional blocks of the GE. Note that codoan is the name of the tool that represents the reference implementation of this GE. Therefore, in some figures one will find the term codoan instead of CDVA GE.

CDVA GE – Functional description

The components of the Compressed Domain Video Analysis GE are Media Interface, Media (Stream) Analysis, Metadata Interface, Control, and the API. They are described in detail in the following subsections. A realization of a Compressed Domain Video Analysis GE consists of a composition of different types of realizations for the five building blocks (i.e., components). The core functionality of the realization is determined by the selection of the Media (Stream) Analysis component (and the related subcomponents). Input and output format are determined by the selection of the inbound and outbound interface component, i.e., Media Interface and Metadata Interface components. The interfaces are stream-oriented.

Media Interface

The Media Interface receives the media data through different formats. Several streams/files can be accessed in parallel (e.g., different RTP sessions can be handled). Two different usage scenarios are regarded:

  • Media Storage: A multimedia file has already been generated and is stored on a server in a file system or in a database. For analysis, the media file can be accessed independently of the original timing. This means that analysis can happen slower or faster than real-time and random access on the timed media data can be performed. The corresponding subcomponent is able to process the following file types:
    • RTP dump file format used by the RTP Tools, as described in [rtpdump]
    • An ISO-based file format (e.g., MP4), according to ISO/IEC 14496-12 [ISO08], is envisioned
  • Streaming Device: A video stream is generated by a device (e.g., a video camera) and streamed over a network using dedicated transport protocols (e.g., RTP, DASH). For analysis, the media stream can be accessed only in its original timing, since the stream is generated in real time. The corresponding subcomponent is able to process the following stream types:
    • Real-time Transport Protocol (RTP) packet streams as standardized in RFC 3550 [RFC3550]. Payload formats to describe the contained compression format can be further specified (e.g., RFC 6184 [RFC6184] for the H.264/AVC payload).
    • Media sessions established using RTSP (RFC 2326 [RFC2326])
    • HTTP-based video streams (e.g., REST-like APIs). URLs/URIs could be used to identify the relevant media resources (envisioned).

Note that according to the scenario (file or stream) the following component either operates in the Media Analysis or Media Stream Analysis mode. Some subcomponents of the Media (Stream) Analysis component are codec-independent. Subcomponents on a lower abstraction level are able to process H.264/AVC video streams.

Media (Stream) Analysis

The main component is the Media (Stream) Analysis component. The GE operates in the compressed domain, i.e., the video data is analyzed without prior decoding. This allows for low-complexity and therefore resource-efficient processing and analysis of the media stream. The analysis can happen on different semantic layers of the compressed media (e.g., packet layer, symbol layer, etc.). The higher (i.e., more abstract) the layer, the lower the necessary computing power. Some schemes work codec-agnostic (i.e., across a variety of compression/media formats) while other schemes require a specific compression format.

Currently the following subcomponents are integrated:

  • Event (Change) Detection
    • Receiving RTP packets and evaluating their size and number per frame leads to a robust detection of global changes
    • Codec-independent
    • No decoding required
    • For more details see [CDA]
  • Moving Object Detection
    • Analyzing H.264/AVC video streams
    • Evaluating syntax elements leads to a robust detection of moving objects
    • For more details see [ODA]
  • Person Detection
    • Special case of Moving Object Detection
    • If previous knowledge about the actual objects exists, i.e., the objects are persons, the detection can be further enhanced
  • Object Tracking
    • Detected objects will be tracked over several subsequent frames
    • Works with preceding Moving Object Detection and Person Detection subcomponents

In principle, the analysis operations can be done in real time. In practical implementations, this depends on computational resources, the complexity of the algorithm and the quality of the implementation. In general, low complexity implementations are targeted for the realization of this GE. In some more sophisticated realizations of this GE (e.g., crawling through a multimedia database), a larger time span of the stream is needed for analysis. In this case, real-time processing is in principle not possible and also not intended.

Metadata Interface

The Metadata Interface uses a metadata format suitable for subsequent processing. The format could, for instance, be HTTP-based (e.g., RESTful APIs) or XML-based. An XML example is given in section Main Interactions.

The Media (Stream) Analysis subcomponent either detects events or moving objects/persons. Therefore, the Metadata Interface provides information about detected global changes and moving objects/persons within the analyzed streams. This information is sent to previously registered Sinks. Sinks can be added by Users of the GE by sending corresponding requests to the API.


The Control component is used to control the aforementioned components of the Compressed Domain Video Analysis GE. Furthermore, it processes requests received via the API. Thereby, it creates and handles a separate instance of the GE for each stream to be analyzed.


The RESTful API defines an interface that enables Users of the GE to request several operations using standard HTTP requests. These operations are described in detail in the following section.

Main Interactions

The API is a RESTful API that permits easy integration with web services or other components requiring analyses of compressed video streams. The following operations are defined:

Returns the current version of the CDVA GE implementation.
Lists all available instances.
Creates a new instance. Thereby, the URI of the compressed video stream and whether events and/or moving objects/persons should be detected (and tracked) have to be provided with this request.
Returns information about a created instance.
Destroys a previously created and stopped instance.
Starts the corresponding instance. This actually starts the analysis.
Stops a previously started instance.
Returns the current configuration of an instance. This includes the configurations of all activated analysis algorithms.
Configures one or more analysis algorithms of the corresponding instance.
Lists all registered sinks of the corresponding instance.
Adds a new sink to an instance. Thereby, an URI has to be provided that enables the GE to notify the sink in case of detections.
Returns information about a previously added sink.
Removes a previously added sink from an instance. Once the sink is removed, it will not be notified in case of detections anymore.

The following figure shows an example of a typical usage scenario (two analyzer instances are attached to different media sources). Note that responses and notifications are not shown for reasons of clarity and comprehensibility.

CDVA GE – Usage scenario

First of all, Sink 1 requests a list of all created instances (listInstances). As no instance has been created so far, Sink 1 creates (createInstance) and configures (configureInstance) a new instance for analyzing a specific video stream. To get notified in case of events and/or moving objects/persons, Sink 1 adds itself as a sink to this instance (addSink). The actual analysis is finally started by sending a startInstance request. During the analysis of the video stream, a second sink, Sink 2, also requests a list of instances (listInstances). As Sink 2 is also interested in the results of the analysis Sink 1 previously started, it also adds itself to this instance (addSink), just before Sink 1 removes itself from the instance (removeSink) to not get notified anymore. Additionally, Sink 2 wants another video stream to be analyzed and therefore creates (createInstance) and configures (configureInstance) a new instance, adds itself to this instance (addSink) and starts the analysis (startInstance). While receiving the results of the second analysis, Sink 2 removes itself from the first instance (removeSink) and requests to stop the analysis (stopInstance) and to destroy the instance (destroyInstance). Note that the instance will only be destroyed if all sinks have been removed. At the end of this scenario, Sink 2 finally removes itself from the second instance (removeSink) and also requests to stop the analysis of this instance (stopInstance) and to destroy this instance (destroyInstance).

Event and object metadata that are sent to registered sinks are encapsulated in an XML-based Scene Description format, according to the ONVIF specifications [ONVIF]. Thereby, the XML root element is called MetadataStream. The following code block depicts a brief example to illustrate the XML structure:

<?xml version="1.0" encoding="UTF-8"?>
<MetadataStream xmlns=""

  <Frame UtcTime="2012-05-10T18:12:05.432Z" codoan:FrameNumber="100">
    <Object ObjectId="0">
          <BoundingBox bottom="15.0" top="5.0" right="25.0" left="15.0"/>
          <CenterOfGravity x="20.0" y="10.0"/>
    <Object ObjectId="1">
          <BoundingBox bottom="25.0" top="15.0" right="35.0" left="25.0"/>
          <CenterOfGravity x="30.0" y="20.0"/>


Note that not all elements are mandatory to compose a valid XML document according to the corresponding ONVIF XML Schema.

Basic Design Principles

  • Critical product attributes for the Compressed Domain Video Analysis GE are especially high detection rates containing only few false positives and low-complexity operation.
  • Partitioning to independent functional blocks enables the GE to support a variety of analysis methods on several media types and to get easily extended by new features. Even several operations can be combined.
  • Low-complexity algorithms and implementations enable the GE to perform very fast analyses and to be highly scalable.
  • GE implementations support performing parallel analyses using different subcomponents.


[ISO08] ISO/IEC 14496-12:2008, Information technology – Coding of audio-visual objects – Part 12: ISO base media file format, Oct. 2008.
[RFC2326] H. Schulzrinne, A. Rao, and R. Lanphier, "Real Time Streaming Protocol (RTSP)", RFC 2326, Apr. 1998.
[RFC3550] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: A transport protocol for real-time applications", RFC 3550, Jul. 2003.
[RFC6184] Y.-K. Wang, R. Even, T. Kristensen, R. Jesup, "RTP Payload Format for H.264 Video", RFC 6184, May 2011.
[CDA] M. Laumer, P. Amon, A. Hutter, and A. Kaup, "A Compressed Domain Change Detection Algorithm for RTP Streams in Video Surveillance Applications", MMSP 2011, Oct. 2011.
[ODA] M. Laumer, P. Amon, A. Hutter, and A. Kaup, "Compressed Domain Moving Object Detection Based on H.264/AVC Macroblock Types", VISAPP 2013, Feb. 2013.
[ONVIF] ONVIF Specifications
[rtpdump] rtpdump format specified by RTP Tools

Detailed Specifications

Following is a list of Open Specifications linked to this Generic Enabler. Specifications labeled as "PRELIMINARY" are considered stable but subject to minor changes derived from lessons learned during last interactions of the development of a first reference implementation planned for the current Major Release of FI-WARE. Specifications labeled as "DRAFT" are planned for future Major Releases of FI-WARE but they are provided for the sake of future users.

Open API Specifications

Re-utilised Technologies/Specifications

The following technologies/specifications are incorporated in this GE:

  • ISO/IEC 14496-12:2008, Information technology – Coding of audio-visual objects – Part 12: ISO base media file format
  • Real-Time Transport Protocol (RTP) / RTP Control Protocol (RTCP) as defined in RFC 3550
  • Real-Time Streaming Protocol (RTSP) as defined in RFC 2326
  • RTP Payload Format for H.264 Video as defined in RFC 6184
  • ONVIF Specifications
  • rtpdump format as defined in RTP Tools

Terms and definitions

This section comprises a summary of terms and definitions introduced during the previous sections. It intends to establish a vocabulary that will be help to carry out discussions internally and with third parties (e.g., Use Case projects in the EU FP7 Future Internet PPP). For a summary of terms and definitions managed at overall FI-WARE level, please refer to FIWARE Global Terms and Definitions

  • Data refers to information that is produced, generated, collected or observed that may be relevant for processing, carrying out further analysis and knowledge extraction. Data in FIWARE has associated a data type and avalue. FIWARE will support a set of built-in basic data types similar to those existing in most programming languages. Values linked to basic data types supported in FIWARE are referred as basic data values. As an example, basic data values like ‘2’, ‘7’ or ‘365’ belong to the integer basic data type.
  • A data element refers to data whose value is defined as consisting of a sequence of one or more <name, type, value> triplets referred as data element attributes, where the type and value of each attribute is either mapped to a basic data type and a basic data value or mapped to the data type and value of another data element.
  • Context in FIWARE is represented through context elements. A context element extends the concept of data element by associating an EntityId and EntityType to it, uniquely identifying the entity (which in turn may map to a group of entities) in the FIWARE system to which the context element information refers. In addition, there may be some attributes as well as meta-data associated to attributes that we may define as mandatory for context elements as compared to data elements. Context elements are typically created containing the value of attributes characterizing a given entity at a given moment. As an example, a context element may contain values of some of the attributes “last measured temperature”, “square meters” and “wall color” associated to a room in a building. Note that there might be many different context elements referring to the same entity in a system, each containing the value of a different set of attributes. This allows that different applications handle different context elements for the same entity, each containing only those attributes of that entity relevant to the corresponding application. It will also allow representing updates on set of attributes linked to a given entity: each of these updates can actually take the form of a context element and contain only the value of those attributes that have changed.
  • An event is an occurrence within a particular system or domain; it is something that has happened, or is contemplated as having happened in that domain. Events typically lead to creation of some data or context element describing or representing the events, thus allowing them to processed. As an example, a sensor device may be measuring the temperature and pressure of a given boiler, sending a context element every five minutes associated to that entity (the boiler) that includes the value of these to attributes (temperature and pressure). The creation and sending of the context element is an event, i.e., what has occurred. Since the data/context elements that are generated linked to an event are the way events get visible in a computing system, it is common to refer to these data/context elements simply as "events".
  • A data event refers to an event leading to creation of a data element.
  • A context event refers to an event leading to creation of a context element.
  • An event object is used to mean a programming entity that represents an event in a computing system [EPIA] like event-aware GEs. Event objects allow to perform operations on event, also known as event processing. Event objects are defined as a data element (or a context element) representing an event to which a number of standard event object properties (similar to a header) are associated internally. These standard event object properties support certain event processing functions.
Personal tools
Create a book