We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.OpenSpecification.Data.SocialDataAggregator - FIWARE Forge Wiki

FIWARE.OpenSpecification.Data.SocialDataAggregator

From FIWARE Forge Wiki

Jump to: navigation, search
Name FIWARE.OpenSpecification.Data.SocialDataAggregator
Chapter Data/Context Management,
Catalogue-Link to Implementation [This GE is not present on the catalogue since it was discontinued Social Data Aggregator]
Owner Telecom Italia, Claudio Venezia (TI)


Contents

Preface

Within this document you find a self-contained open specification of a FIWARE generic enabler, please consult as well the FIWARE Product Vision, the website on http://www.fiware.org and similar pages in order to understand the complete context of the FIWARE platform.

Copyright

Copyright © 2010-2014 by TI. All Rights Reserved.

Legal Notice

Please check the following Legal Notice to understand the rights to use these specifications.

Overview

Over time all generations have come to embrace the changes social network has brought about. Nowadays online social media have gained astounding worldwide growth and popularity. They are generating a huge amount of records about users’ activities and also attracting attention from variety of researchers and companies worldwide.

Every day data collected by social networks are gathered and feed a variety of analytics (users behavior, habits, topic trends..) which are capable of extracting significant patterns and further analisys.

The aim of Social Data Aggregator (SDA) GE is to retrieve data from different Social Networks and provide different analytics depending on user needs. The GE relies on Apache Spark for computation on data. The SDA comes with different built-in modules but custom modules can be added as well.

Basic Concepts

Producer/Consumer paradigm & Lambda architecture

The Social Data aggregator follows a producer/consumer paradigm. There are connector modules that retrieves data from Social Networks and send them over a distributed internal bus. Consumer modules retrieve the data from the internal bus and “consume” them producing analytics.

SDA connector modules has been developed using lambda architecture.

Lambda architecture is a data-processing architecture designed to handle a massive load of data by taking advantage of both batch- and stream-processing methods. This architectural design helps to balance latency, throughput, and fault-tolerance both when using batch or real-time processing to achieve comprehensive and accurate views.

Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time processing), and a business layer for managing queries on the collected data.

Social Data Aggregator Lambda Architecture


Batch Layer

The batch layer precomputes results using a distributed processing system that can handle a very large amount of data. The batch layer aims at a perfect accuracy by being able to process all available data when generating views. This means it can handle or fix errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates which entirely replace existing precomputed views.

In our implementation we use apache Spark to compute analytics on the raw data provided from the connector.

Speed Layer

The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer is able to find a trade off between the amount of data processed and results’ latency while providing real-time views into the most recent data.

Essentially, the speed layer is responsible for filling the temporal gap introduced by the batch layer's by providing views based on the most recent data. As a matter of fact this layer's views may not be as accurate or complete as the ones eventually produced by the batch layer (which can count on a significantly larger information base), but they are available almost immediately after receiving data.

This layer is implemented using spark streaming to compute data analytics in near real time. Results are provided at the end of a spark streaming batch or by window (window duration is customizable).

Serving Layer

Output from the batch and speed layers are made available by means of the serving layer, which responds to ad-hoc queries by returning precomputed views or building views over the processed data.

Criticism

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides require a different code base that must be maintained separately but kept in sync, so that processed data produces the same result from both paths. To reduce this coding overhead, consumer modules in the Social Data Aggregator GE are composed of three modules: a batch part, a streaming part and a core part. The core part is in common between the two other implementation.

This approach allows to use specific parts for batch and streaming while easying code maintenance since they share a common core.

Social Data Aggregator GE Architecture

As said, SDA implements a lambda architecture: retrieved data are stored on disk for later batch analysis and meanwhile sent to an internal bus for real-time analysis. The GE is composed of different modules that belongs to two main categories:

  1. Connectors: connectors retrieve data from social networks for storing and sending them onto the internal bus.
  2. Consumers: consumers can be of two types:
  • Batch: load data from storage and provide complex analytics on periodical time intervals;
  • Real-Time: retrieve data from the internal stream bus and provide analytics in near real-time.

In Both cases consumers can save analytics result on storage or in a volatile memory for temporary results. The conceptual representation of the GE architecture is shown in the following figure.

Social Data Aggregator General Architecture


Connectors

The top left side of the picture shows the connectors which are in charge to retrieve the data from social networks. Each Controller is “specialized”: it is connected with a specific social network and gathers data by interacting with it. The way data are retrieved can vary from SN to SN: some SNs provide stream Apis (e.g. twitter,instagram) while others that needs to be polled by the connector.

The connector receives live input data streams and divides the data into batches namely a specific set of data collected during a given timeframe. The content of each batch is mapped onto an internal model and then sent to the internal bus to make it available to real time consumers.

The internal stream bus is a communication bus for a loosely-coupled interconnection between modules. It can be implemented by different technologies (apache kafka, amazon kinesis, rabbitMQ, etc.).

Data belonging to different batches are also collected in a window. The content of each window is saved on the storage as raw data in json format. In this way raw data can be subsequently processed by batch consumers.

The storage has to be reachable from every node of the cluster, it can be implemented by a Database (Mysql, OrientDB, MongoDB..), a distributed filesystem (HDFS..), an online file storage web service (s3) or a shared disk (NFS).

Each connector can expose apis that can be contacted from the controller in order to modify the settings or the topics being under monitoring.

A topic can be based on:

  • key-word(s)
  • geo location (latitute,longitude..)
  • a target user (if the social network allows user tracking)
  • hashtags
Social Data Aggregator Connector Details

Controller

The controller is a component that manage the connectors providing the following functionalities:

  • Define topics to be monitored by the connector: It is possible to define topics of interest to be monitored within a target social network or among different ones. This definition can be done by an user through an admin gui.
  • Add/remove a topic from monitoring targets: the controller allows to add new topics that are somewhat useful for the user or remove thse that are not useful anymore.
  • Dynamically recognize trending sub-topics and start monitoring them automatically: the component is smart enough to recognize if there are secondary topics (topics that are not directly monitored but appear in the same posts as monitored ones) that are becoming hot and starts monitoring them as soon as they are of interest and as far as the related social activity would come to decrease again.


Consumers

Consumers are modules that retrieve from the storage raw data collected by the connectors or from the internal stream bus and produces different kind of analytics from gathered data. Examples of analytics provided from the Social Data Aggregator are:

  • Basic Aggregations: calculation of the ppm (posts per minute) or number of posts in a time range, grouped by keywords or belonging to specific geo located areas, to recognize trending topics.
  • Gender Recognition: this feature is useful for social networks that don’t provide information about the gender of the user (twitter for example). Recognizing a user gender from his profile is a challenging task.
  • Sentiment Analysis: sentiment analysis aims to determine the attitude of a commenter upon a specific topic. It is used by the SDA to infere the mood of users with respect to a monitored topic.

By subscribing to a target topic and looking for a particular key, consumers can retrieve only the information that they really need, discarding any data when not relevant to their analytics. Result data can be saved on storage rather then re-injected to the internal bus to be processed from other consumers capable of other types of analytics.

Social Data Aggregator Consumer Bus Interaction

Main Interactions

There are three ways to interact with the SDA GE:

  1. Customizing the installation by changing components implementations: for example use a different bus technology (default configurations consider apache Kafka) or change the way data are saved or the storing technology (by developing a custom DAO or using one different from the Default)
  2. Integrating the 3rd parties components within the GE (connectors to other social networks or consumers to provide other analytics)
  3. Develop a web application to expose analytics’ results by means of a web service.

Customizing The Installation

The SDA Ge only techological constraint is to have apache Spark and Java 8 installed on the hosting system. The GE doesn’t rely on any other specific technology. There are default implementations but they can be changed allowing end users to use suitable, reliable or simply highly available technologies based on their expertise/needs.

The SDA can be customized by editing the internal configuration files without editing the main code. The following figures shows an example of consumer customization with different technologies.

The first one is an example of default configuration.

The second one is an example of custom configuration with non default technologies. The reader will note that the core part doesn’t vary in the two different cases meaning that there is no need of intervention on the internal code for implementing the customization.

Default DAO

Integrating With Social Data Aggregator GE

The SDA comes along with different analytics from Twitter social network as an example. However it is possible to extend the functionalities of the Generic Enabler. New connectors can be added as well as new consumers:

  • 3rd parties connectors may allow to monitor new social networks, retrieving posts and store them for futher analysis. A custom connector can be created from scratch following the guidelines to make it agnostic to infrastructural technology changes or replacements (e.g. internal bus, storage and so on).
  • 3rd parties consumers can provide different types of analytics on existing or newly monitored social networks. It is also possible to adapt pre-existent analytics on new models if the data provided are similar since consumers are built upon a shared core part (where possible) upon which both batch and real-time specific implementations rely for analytics computation.


Consumer Stack

Creating An Application On Top Of Social Data Aggregator GE

Since the SDA is highly customizable it doesn’t come along with built-in APIs. Both model and storage by which data are saved can be defined/changed depending on user needs. In this way the user is free to choose the data/format to be exposed also based on the needs of an external application. The SDA is agnostic of the hosting web container or application server.

That said a first prototype will be released as a Demo Web Application with the following objectives:

  • To show that the Social Data Aggregator can easily cooperate with other GEs (in this case Stream Oriented GE)
  • Provide to developers an example of web application that reads the results of the analytics provided by the SDA (with the default configurations) and expose them by means of a web GUI.

Basic Design Principles

  • Application development: Developers don’t necessarily need to be aware of the SDA internals to develop custom consumer/connectors, they can use their preferred programming language. They only need to know the message bus interfaces in order to send/retrieve data for real-time processing.
  • Scalability: The SDA is highly scalable in two ways:
    • Based on Apache Spark, that is a cluster computing framework, if the data flow increases above the initial expectations (the viral effect of social activities is quite unpredictable), it is sufficient to add other machines to the cluster to make spark span the computation over more nodes without modifying the application
    • The GE is composed by different modules that don’t have to run on the same node, since the only constraint is that they were capable of reaching the internal bus that is distributed among the nodes of the cluster. New modules can be deployed on different nodes without interfering with the pre-existing architecture.
  • Separation of concerns: SDA GE is divided into distinct modules with the minimum functionality overlap, which minimizes interaction points for achieving high cohesion and low coupling.
  • Single Responsibility principle: Each module is responsible for only a specific feature or functionality, or aggregates cohesive functionalities.
  • Principle of Least Knowledge: A component doesn’t need to know about internal details of other components or their implementations. Consumers are separated and can be developed with different languages or technologies.
  • Don’t repeat yourself (DRY): Specific functionalities that are in common between different consumers/connectors are implemented in only one component thus avoiding duplications. In this way every change or bug fix of a specific component is seamlessly propagated to all the modules related to it.

Detailed Specifications

Please refere to Social Data Aggregator wiki on GitHub Social Data Aggregator Wiki

Re-utilised Technologies/Specifications

The Project Makes use of:

Terms and definitions

The description should not contain specific terms with the need to be explained. However, if this may happen, please contact GE owner for any clarification

Personal tools
Create a book