We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.ArchitectureDescription.Data.CompressedDomainVideoAnalysis R3 - FIWARE Forge Wiki

FIWARE.ArchitectureDescription.Data.CompressedDomainVideoAnalysis R3

From FIWARE Forge Wiki

Jump to: navigation, search
FIWARE WIKI editorial remark:
This page corresponds to Release 3 of FIWARE. The latest version associated to the latest Release is linked from FIWARE Architecture



Copyright © 2012-2014 by Siemens AG

Legal Notice

Please check the following Legal Notice to understand the rights to use these specifications.


In the media era of the web, much content is user-generated (UGC) and spans over any possible kind, from amateur to professional, nature, parties, etc. In such context, video content analysis can provide several advantages for classifying content and later search, or to provide additional information about the content itself.

The Compressed Domain Video Analysis GE consists of a set of tools for analyzing video streams in the compressed domain, i.e., the received streams are either directly processed without prior decoding or just few relevant elements of the stream are parsed to be used within the analysis.

Target Usage

The target users of the Compressed Domain Video Analysis GE are all applications that want to extract meaningful information from video content and that need to automatically find characteristics in video data. The GE can work for previously stored video data as well as for video data streams (e.g., received from a camera in real time).

User roles in different industries addressed by this Generic Enabler are:

  • Telecom industry: Identify characteristics in video content recorded by single mobile users; identify communalities in the recordings across several mobile users (e.g., within the same cell).
  • Mobile users: (Semi-) automated annotation of recorded video content, point of interest recognition and tourist information in augmented reality scenarios, social services (e.g., facial recognition).
  • IT companies: Automated processing of video content in databases.
  • Surveillance industry: Automated detection of relevant events (e.g., alarms, etc.).
  • Marketing industry: Object/brand recognition and sales information offered (shops near user, similar products, etc.).

Basic Concepts

Block-Based Hybrid Video Coding

Video coding is always required if a sequence of pictures has to be stored or transferred efficiently. The most common method to compress video content is the so-called block-based hybrid video coding technique. A single frame of the raw video content is divided into several smaller blocks and each block is processed individually. Hybrid means that the encoder as well as the decoder consists of a combination of motion compensation and prediction error coding techniques. A block diagram of a hybrid video coder is depicted in the figure below.

Block diagram of a block-based hybrid video coder

A hybrid video coder can be divided in several generic components:

  • Coder Control: Controls all other components to fulfill pre-defined stream properties, like a certain bit rate or quality. (Indicated by colored block corners)
  • Intra-Frame Encoder: This component usually performs a transform to the frequency domain, followed by quantization and scaling of the transform coefficients.
  • Intra-Frame Decoder: To avoid a drift between encoder and decoder, the encoder includes a decoder. Therefore, this component reverses the previous encoding step.
  • In-Loop Filter: This filter component could be a set of consecutive filters. The most common filter operation here is deblocking.
  • Motion Estimator: Comparing blocks of the current frame with regions in previous and/or subsequent frames permits modeling the motion between these frames.
  • Motion Compensator: According to the results of the Motion Estimator, this component compensates the estimated motion by creating a predictor for the current block.
  • Intra-Frame Predictor: If the control decides to use intra-frame coding techniques, this component creates a predictor for the current block by just using neighbouring blocks of the current frame.
  • Entropy Encoder: The information gathered during the encoding process is entropy encoded in this component. Usually, a resource-efficient variable length coding technique (e.g., CAVLC in H.264/AVC) or even an arithmetic coder (e.g., CABAC in H.264/AVC) is used.

During the encoding process, the predicted video data p[x,y,k] (where x and y are the Cartesian coordinates of the k-th sample, i.e., frame) gets subtracted from the raw video data r[x,y,k]. The resulting prediction error signal e[x,y,k] then gets intra-frame and entropy encoded.

The decoder within the encoder sums up the en- and decoded error signal e'[x,y,k] and the predicted video data p[x,y,k] to get the reconstructed video data r'[x,y,k]. These reconstructed frames are stored in the Frame Buffer. During the motion compensation process, previous and/or subsequent frames of the current frame ( r'[x,y,k+i], i ∈ Z \ {0} ) are extracted from the buffer.

Compressed Domain Video Analysis

In literature, there are several techniques for different post-processing steps for videos. Most of them operate in the so-called pixel domain. Pixel domain means that any processing is directly performed on the actual pixel values of a video image. Thereto all compressed video data has to be decoded before analysis algorithms can be applied. A simple processing chain of pixel domain approaches is depicted in the figure below.

A simple pixel domain processing chain

The simplest way of analyzing video content is to watch it on an appropriate display. For example, a surveillance camera could transmit images of an area that is relevant for security to be evaluated by a watchman. Although this mode obviously finds its application in practice, it is not applicable for all systems, because of two major problems. The first problem is that at any time someone needs to keep track of the monitors. As a result this mode is indeed on the one hand real-time capable, but on the other hand quite expensive. A second major problem is that it is hardly scalable. If a surveillance system has a huge amount of cameras installed, it is nearly impossible to keep track of all of the monitors at the same time. So the efficiency of this mode will decrease with an increasing number of sources.

Beside a manual analysis of video content, automated analysis has become more and more important in the last years. At first, the received video content from the network has to be decoded. Thereby the decoded video frames are stored in a frame buffer to have access to them during the analysis procedure. Based on these video frames an analysis algorithm, e.g., object detection and tracking can be performed. A main advantage over a manual analysis is that this mode is usually easily scalable and less expensive. But due to the decoding process, the frame buffer operations, and the usually high computing time of pixel domain detection algorithms, this mode is not always real-time capable and has furthermore a high complexity.

Due to the limitations of pixel domain approaches, more and more attempts were made to transfer the video analysis procedures from pixel domain to compressed domain. Working within compressed domain means to work directly on compressed data. The following figure gives an example for a compressed domain processing chain.

A simple compressed domain processing chain

Due to the omission of the preceding decoder it is possible to work directly with the received data. At the same time, the integrated syntax parser permits to extract single required elements from the data stream and to use them for analyzing. As a result, the analysis becomes less computationally intensive due to the reason that the costly decoding process does not always have to be passed through completely. Furthermore, this solution consumes fewer resources since it is not required anymore to store the video frames in a buffer. This leads to a technique that is compared to pixel domain techniques usually more efficient and appears more scalable.


The Compressed Domain Video Analysis GE consists of a set of tools for analyzing video streams in the compressed domain. Its purpose is to avoid costly video content decoding prior to the actual analysis. Thereby, the tool set processes video streams by analyzing compressed or just partially decoded syntax elements. The main benefit is its very fast analysis due to a hierarchical architecture. The following figure illustrates the functional blocks of the GE. Note that codoan is the name of the tool that represents the reference implementation of this GE. Therefore, in some figures one will find the term codoan instead of CDVA GE.

CDVA GE – Functional description

The components of the Compressed Domain Video Analysis GE are Media Interface, Media (Stream) Analysis, Metadata Interface, Control, and the API. They are described in detail in the following subsections. A realization of a Compressed Domain Video Analysis GE consists of a composition of different types of realizations for the five building blocks (i.e., components). The core functionality of the realization is determined by the selection of the Media (Stream) Analysis component (and the related subcomponents). Input and output format are determined by the selection of the inbound and outbound interface component, i.e., Media Interface and Metadata Interface components. The interfaces are stream-oriented.

Media Interface

The Media Interface receives the media data through different formats. Several streams/files can be accessed in parallel (e.g., different RTP sessions can be handled). Two different usage scenarios are regarded:

  • Media Storage: A multimedia file has already been generated and is stored on a server in a file system or in a database. For analysis, the media file can be accessed independently of the original timing. This means that analysis can happen slower or faster than real-time and random access on the timed media data can be performed. The corresponding subcomponent is able to process the following file types:
    • RTP dump file format used by the RTP Tools, as described in [rtpdump]
    • An ISO-based file format (e.g., MP4), according to ISO/IEC 14496-12 [ISO08], is envisioned
  • Streaming Device: A video stream is generated by a device (e.g., a video camera) and streamed over a network using dedicated transport protocols (e.g., RTP, DASH). For analysis, the media stream can be accessed only in its original timing, since the stream is generated in real time. The corresponding subcomponent is able to process the following stream types:
    • Real-time Transport Protocol (RTP) packet streams as standardized in RFC 3550 [RFC3550]. Payload formats to describe the contained compression format can be further specified (e.g., RFC 6184 [RFC6184] for the H.264/AVC payload).
    • Media sessions established using RTSP (RFC 2326 [RFC2326])
    • HTTP-based video streams (e.g., REST-like APIs). URLs/URIs could be used to identify the relevant media resources (envisioned).

Note that according to the scenario (file or stream) the following component either operates in the Media Analysis or Media Stream Analysis mode. Some subcomponents of the Media (Stream) Analysis component are codec-independent. Subcomponents on a lower abstraction level are able to process H.264/AVC video streams.

Media (Stream) Analysis

The main component is the Media (Stream) Analysis component. The GE operates in the compressed domain, i.e., the video data is analyzed without prior decoding. This allows for low-complexity and therefore resource-efficient processing and analysis of the media stream. The analysis can happen on different semantic layers of the compressed media (e.g., packet layer, symbol layer, etc.). The higher (i.e., more abstract) the layer, the lower the necessary computing power. Some schemes work codec-agnostic (i.e., across a variety of compression/media formats) while other schemes require a specific compression format.

Currently the following subcomponents are integrated:

  • Event (Change) Detection
    • Receiving RTP packets and evaluating their size and number per frame leads to a robust detection of global changes
    • Codec-independent
    • No decoding required
    • For more details see [CDA]
  • Moving Object Detection
    • Analyzing H.264/AVC video streams
    • Evaluating syntax elements leads to a robust detection of moving objects
    • For more details see [ODA]
  • Person Detection
    • Special case of Moving Object Detection
    • If previous knowledge about the actual objects exists, i.e., the objects are persons, the detection can be further enhanced
  • Object Tracking
    • Detected objects will be tracked over several subsequent frames
    • Works with preceding Moving Object Detection and Person Detection subcomponents

In principle, the analysis operations can be done in real time. In practical implementations, this depends on computational resources, the complexity of the algorithm and the quality of the implementation. In general, low complexity implementations are targeted for the realization of this GE. In some more sophisticated realizations of this GE (e.g., crawling through a multimedia database), a larger time span of the stream is needed for analysis. In this case, real-time processing is in principle not possible and also not intended.

Metadata Interface

The Metadata Interface uses a metadata format suitable for subsequent processing. The format could, for instance, be HTTP-based (e.g., RESTful APIs) or XML-based. An XML example is given in section Main Interactions.

The Media (Stream) Analysis subcomponent either detects events or moving objects/persons. Therefore, the Metadata Interface provides information about detected global changes and moving objects/persons within the analyzed streams. This information is sent to previously registered Sinks. Sinks can be added by Users of the GE by sending corresponding requests to the API.


The Control component is used to control the aforementioned components of the Compressed Domain Video Analysis GE. Furthermore, it processes requests received via the API. Thereby, it creates and handles a separate instance of the GE for each stream to be analyzed.


The RESTful API defines an interface that enables Users of the GE to request several operations using standard HTTP requests. These operations are described in detail in the following section.

Main Interactions

The API is a RESTful API that permits easy integration with web services or other components requiring analyses of compressed video streams. The following operations are defined:

Returns the current version of the CDVA GE implementation.
Lists all available instances.
Creates a new instance. Thereby, the URI of the compressed video stream and whether events and/or moving objects/persons should be detected (and tracked) have to be provided with this request.
Returns information about a created instance.
Destroys a previously created and stopped instance.
Starts the corresponding instance. This actually starts the analysis.
Stops a previously started instance.
Returns the current configuration of an instance. This includes the configurations of all activated analysis algorithms.
Configures one or more analysis algorithms of the corresponding instance.
Lists all registered sinks of the corresponding instance.
Adds a new sink to an instance. Thereby, an URI has to be provided that enables the GE to notify the sink in case of detections.
Returns information about a previously added sink.
Removes a previously added sink from an instance. Once the sink is removed, it will not be notified in case of detections anymore.

The following figure shows an example of a typical usage scenario (two analyzer instances are attached to different media sources). Note that responses and notifications are not shown for reasons of clarity and comprehensibility.

CDVA GE – Usage scenario

First of all, Sink 1 requests a list of all created instances (listInstances). As no instance has been created so far, Sink 1 creates (createInstance) and configures (configureInstance) a new instance for analyzing a specific video stream. To get notified in case of events and/or moving objects/persons, Sink 1 adds itself as a sink to this instance (addSink). The actual analysis is finally started by sending a startInstance request. During the analysis of the video stream, a second sink, Sink 2, also requests a list of instances (listInstances). As Sink 2 is also interested in the results of the analysis Sink 1 previously started, it also adds itself to this instance (addSink), just before Sink 1 removes itself from the instance (removeSink) to not get notified anymore. Additionally, Sink 2 wants another video stream to be analyzed and therefore creates (createInstance) and configures (configureInstance) a new instance, adds itself to this instance (addSink) and starts the analysis (startInstance). While receiving the results of the second analysis, Sink 2 removes itself from the first instance (removeSink) and requests to stop the analysis (stopInstance) and to destroy the instance (destroyInstance). Note that the instance will only be destroyed if all sinks have been removed. At the end of this scenario, Sink 2 finally removes itself from the second instance (removeSink) and also requests to stop the analysis of this instance (stopInstance) and to destroy this instance (destroyInstance).

Event and object metadata that are sent to registered sinks are encapsulated in an XML-based Scene Description format, according to the ONVIF specifications [ONVIF]. Thereby, the XML root element is called MetadataStream. The following code block depicts a brief example to illustrate the XML structure:

<?xml version="1.0" encoding="UTF-8"?>
<MetadataStream xmlns="http://www.onvif.org/ver10/schema"
                xsi:schemaLocation="http://www.fi-ware.eu/data/cdva/codoan/schema http://cdvideo.testbed.fi-ware.eu/codoan/schema/codoan_onvif.xsd">

  <Frame UtcTime="2012-05-10T18:12:05.432Z" codoan:FrameNumber="100">
    <Object ObjectId="0">
          <BoundingBox bottom="15.0" top="5.0" right="25.0" left="15.0"/>
          <CenterOfGravity x="20.0" y="10.0"/>
    <Object ObjectId="1">
          <BoundingBox bottom="25.0" top="15.0" right="35.0" left="25.0"/>
          <CenterOfGravity x="30.0" y="20.0"/>


Note that not all elements are mandatory to compose a valid XML document according to the corresponding ONVIF XML Schema.

Basic Design Principles

  • Critical product attributes for the Compressed Domain Video Analysis GE are especially high detection rates containing only few false positives and low-complexity operation.
  • Partitioning to independent functional blocks enables the GE to support a variety of analysis methods on several media types and to get easily extended by new features. Even several operations can be combined.
  • Low-complexity algorithms and implementations enable the GE to perform very fast analyses and to be highly scalable.
  • GE implementations support performing parallel analyses using different subcomponents.


[ISO08] ISO/IEC 14496-12:2008, Information technology – Coding of audio-visual objects – Part 12: ISO base media file format, Oct. 2008.
[RFC2326] H. Schulzrinne, A. Rao, and R. Lanphier, "Real Time Streaming Protocol (RTSP)", RFC 2326, Apr. 1998.
[RFC3550] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: A transport protocol for real-time applications", RFC 3550, Jul. 2003.
[RFC6184] Y.-K. Wang, R. Even, T. Kristensen, R. Jesup, "RTP Payload Format for H.264 Video", RFC 6184, May 2011.
[CDA] M. Laumer, P. Amon, A. Hutter, and A. Kaup, "A Compressed Domain Change Detection Algorithm for RTP Streams in Video Surveillance Applications", MMSP 2011, Oct. 2011.
[ODA] M. Laumer, P. Amon, A. Hutter, and A. Kaup, "Compressed Domain Moving Object Detection Based on H.264/AVC Macroblock Types", VISAPP 2013, Feb. 2013.
[ONVIF] ONVIF Specifications
[rtpdump] rtpdump format specified by RTP Tools
Personal tools
Create a book