We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.OpenSpecification.Data.UnstructuredDataAnalysis - FIWARE Forge Wiki

FIWARE.OpenSpecification.Data.UnstructuredDataAnalysis

From FIWARE Forge Wiki

Jump to: navigation, search
Name FIWARE.OpenSpecification.Data.UnstructuredDataAnalysis
Chapter Data/Context Management,
Catalogue-Link to Implementation [ <N/A>]
Owner Atos Origin, Jose Maria Fuentes Lopez


Contents

Preface

Within this document you find a self-contained open specification of a FIWARE generic enabler, please consult as well the FIWARE Product Vision, the website on http://www.fiware.org and similar pages in order to understand the complete context of the FIWARE platform.


FIWARE WIKI editorial remark:
This page corresponds to Release 3 of FIWARE. The latest version associated to the latest Release is linked from FIWARE Architecture

Copyright

Legal Notice

Please check the following Legal Notice to understand the rights to use these specifications.

Overview

The Unstructured Data Analysis is thee generic enabler focusing on gathering and analysing unstructured data. The target user as well as a detailed description are provided in the following.


Target usage

Target users are mainly data analysis and data visualization developers that need to analyze unstructured web resources.

Unstructured Data Analysis GE Description

The Unstructured Data Analysis is a composition of technologies in order to gather and analyse unstructured data. In a specific way, this generic enable is responsible for acquiring unstructured data from several data sources, preparing it for the analysis, and executed the linguistic and statistical analysis required by the tendencies detection). This GE is running continuously, polling the Web (through RSS feeds) for recent content, turning it into a stream of processed text documents.

The documents gathered by this GE are provided in a semi-structured fashion (following the RSS structure) such as titles, publication dates, and other metadata are clearly indicated. Furthermore, the Web pages related with the RSS feeds are also gathered. Finally, the trending topics are identified in texts and article bodies using linguistic and statistical analysis. The GE provides a REST API that allows define the data sources and get the results of the analysis.

The web content contains a “noise” that needs to be identified and removed before the content can be analysed. For this reason, a pipeline is executed which consists of (i) unstructured data acquisition (ii) data cleaning, (iii) data storage (iii) natural-language processing analysis, (iv) statistical analysis and (v) Results storage.

This GE has been developed based on the experiences obtained in the FIRST [FIRST] and ALERT [ALERT] projects.

Example Scenario

Unstructured data is defined as all the set of data that do not follow a predefined data model, these data comes in different formats (text, document, image, and video). The amount of unstructured data is by far greater than structured data. According to a 2011 IDC study[IDC2011], the 90 percent of all data created in the next decade will be unstructured data. This scenario aims the creation of novel tools to handle and analyze these data.

The GE provides a method to analyze text-based unstructured data. Analyzing web data (RSS and web pages) in order to detected emerging tendencies key words. A tendency is defined as the meaningful n-grams (a subsequence of n words) which have becoming relevant in a community during a specific time period. This functionality is useful for market surveillance and others related.

For use these functionalities, the user (an expert in media analysis) will define the sources to be explored (RSS feeds), creating a project and adding the data sources to it using a REST API provided by the GE. Once the user has added a new source, the GE will start to monitor the feeds obtaining the new entries, getting the web pages related to those entries, preprocess the data, store it and start the analysis. The results of the analysis and the retrieve data can be accessed also through the API.

The ALERT project provides an example scenario of this. This project aims to develop tools for the analysis of open sources (FLOSS) communities, which are composed of developers, users and community manager and others. The results of these analysis are valuable information to community managers in the decision making process. One of the results of this project was the OCELOt component,a tool for the analysis of unstructured data contained in FLOSS communities (forums, mailing list, bug tracker system, source code management) in order to detect emerging concepts (key terms) to be used to extend ontological resources used in other analysis (see video demo[1]). This component was improved the FI-WARE project and it is included inside the Unstructured Data Analysis GE.

Basic Concepts

This section introduces the basic concepts related to the Unstructured Data Analysis GE in order to facilitate the understanding of this description.

  1. Really Simple Syndication (RSS): RSS is a XML-based format used to publish new contents in order to provide a standard way to publish and consume those contents. A RSS feed is a XML file published through a HTTP server and using RSS format. This feed is composed by entries each represents new content. An entry could be associated a web page.
  2. Web Crawling: Crawling is the process to automatically detect and retrieve new of web resources in order to be gathered and processed for a specific purpose. The software used for the crawling process is known as web bots.
  3. Emerging term (Tendency): An emerging term is a meaningful n-gram (a subsequence of words) which has been gained relevance in a community; the emerging terms are a key element in the detection of new tendencies.
  4. Natural Language Processing (NLP): NLP is the convergence of linguistic, computer sciences and statistic to provide strategies to extract knowledge from data expressed in natural language (written or spoken). This GE uses strategies for the tokenization, lemmatization and Part-of-Speech Tagging of the gathered data.


Unstructured Data Analysis GE Architecture

The objective of the Unstructured Data Analysis GE (also known as UDA GE) is to facilitate the monitoring process of web sources in order to detect potential tendencies providing tool for gathering, preprocess, store and analyze the unstructured data contained in those sources

In order to satisfy the previous objective, the UDA GE is composed by a set of components. Next figure presents the UDA GE Infrastructure architecture.

UDA architecture

The Unstructured Data Analysis GE is composed by a set of modules, each one play a specific role in order to provide the functionalities of the GE.

The UDA REST API is the interface provided by the GE to interact with any user or Software component that requires the functionalities of this GE. This API provides the operations that allows creating new unstructured data analysis projects, adding sources (RSS feeds) to be analyzed. Also, this API allows obtaining the results of the analysis and the gathered documents.

Once the sources are included to a project, the GE begins the crawling process. This task is performed by the UDA Framework, which get the RSS entries from the feeds included in the project in order to process its content and also to retrieve the web pages contained in the feeds. The UDA Framework uses a customized version of Apache Nutch to execute this operation. All the retrieve content is cleaned in order to extract the plain text from the RSS/HTML raw data. The data (raw and cleaned data) is stored in an Hbase No-SQL database and also in a Lucene index. The Hbase data is used by the tendencies detection analysis and also it is available for other analysis. On the other hand, the data stored in the Lucene index is used to facilitate the retrieval of the documents gathered by the GE via the REST API.

Finally, the OCELOt (Online Semantic Concept Extractor based on Linked Open Data) component analyze unstructured data in order to detect emerging tendencies, these tendencies could be used to get awareness about relevant challenges in a specific domain. OCELOt uses different natural language processing strategies (tokenization, lemmatization, Part-of-Speech Tagging) and statistical analysis to detect the tendencies. The results of the analysis are stored in a relational database, which are given to the user through the REST API.

Main Interactions

Modules and Interfaces

This section reports on the description of the Unstructured Data Analysis GE main functionality. The description of this functionality is based on the functionality provided by the baseline assets. Section Backend functionality describes functionality (methods) provided to agents in a service like style.

Backend Functionality

Backend functionality describes functionality provided by the GE as service invocation methods for both human or computer agents. As described in Architecture section, this functionality is accessible by means of REST Web Services API, which provides the next operations:

  1. Create project: Creates an unstructured data analysis project, a software abstraction that represent the monitoring and analysis of a set of data sources in a specific domain. To invoke the operation, a POST http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]
  2. Get project configuration: Obtains the configuration (name, description, analyzed sources) of a specific project. To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]
  3. Delete project: Remove a project; this will stop the analysis of the sources contained in the project. To invoke the operation, a DELETE http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]
  4. Get project’s data sources configuration: Obtains all the sources analyzed in a specific project. To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/sources
  5. Add data source to a project: Add a new source to be analyzed in a specific project. To invoke the operation, a POST http request should be sent to http://<ge url location>/uda-service/uda/[PROJECT_NAME]/sources/[SOURCE_NAME]
  6. Get data source configuration: Obtain the configuration of a source analyzed in a specific project. To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/sources/[SOURCE_NAME]
  7. Remove a data source from a project: Remove a source from a project; this will stop the data gathering and analysis from that source. To invoke the operation, a DELETE http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/sources/[SOURCE_NAME]
  8. Search in a project: Executes a textual search over the documents gathered in a specific project using the Lucene query format[LuceneQuery] . To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/search

All methods described can be invoked by means of regular HTTP requests either using a web browser (for those ones who rely on GET requests) or by an APIs such as Jersey.

Frontend Functionality

This GE does not provide a user interface.

Design Principles

The GE design principle is to use on well-known tools and standards in all phases of the processing:

  1. The data is gathered from RSS/HTML standards sources, which are widely recognized and used across the world to publish unstructured data.
  2. A web crawler based on Apache Nutch tool (An widely used crawler) is used to the retrieve the data, and also executes a preprocessing, the data is stored the data into an HBase (well-known NoSql database).
  3. The linguistic and statistical analysis uses Stanford NLP framework for the analysis of English texts.
  4. The GE provides a API based on the REST de facto standard to interact with other components

References

[IDC2011] J. Gantz, D. Reinsel. Extracting Value from Chaos. IDC IVIEW [2]
[StanfordNLP] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003.Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259.
[REST2002] R. T. Fielding, R. N. Taylor. Principled Design of the Modern Web Architecture. ACM Transactions on Internet Technology, Vol. 2, No. 2, May 2002, Pages 115–150
[LuceneQuery] Lucene Query format, [3]
[FIRST] The FIRST Project, [4]
[ALERT] The ALERT Project, [5]


Detailed Specifications

Following is a list of Open Specifications linked to this Generic Enabler. Specifications labeled as "PRELIMINARY" are considered stable but subject to minor changes derived from lessons learned during last interactions of the development of a first reference implementation planned for the current Major Release of FI-WARE. Specifications labeled as "DRAFT" are planned for future Major Releases of FI-WARE but they are provided for the sake of future users.

Open API Specifications


Re-utilised Technologies/Specifications

The Unstructured Data Analysis Generic Enabler will be based on the outcomings of the FIRST (FP7-257928) project, a European Commission FP7 funded Specific Targeted Research Project started on October 1st 2010, specifically in the FIRST Analytical Pipeline. A high level description of this pipeline can be found in FIRST Analytical Pipeline high level description.

Terms and definitions

This section comprises a summary of terms and definitions introduced during the previous sections. It intends to establish a vocabulary that will be help to carry out discussions internally and with third parties (e.g., Use Case projects in the EU FP7 Future Internet PPP). For a summary of terms and definitions managed at overall FI-WARE level, please refer to FIWARE Global Terms and Definitions

  • Data refers to information that is produced, generated, collected or observed that may be relevant for processing, carrying out further analysis and knowledge extraction. Data in FIWARE has associated a data type and avalue. FIWARE will support a set of built-in basic data types similar to those existing in most programming languages. Values linked to basic data types supported in FIWARE are referred as basic data values. As an example, basic data values like ‘2’, ‘7’ or ‘365’ belong to the integer basic data type.
  • A data element refers to data whose value is defined as consisting of a sequence of one or more <name, type, value> triplets referred as data element attributes, where the type and value of each attribute is either mapped to a basic data type and a basic data value or mapped to the data type and value of another data element.
  • Context in FIWARE is represented through context elements. A context element extends the concept of data element by associating an EntityId and EntityType to it, uniquely identifying the entity (which in turn may map to a group of entities) in the FIWARE system to which the context element information refers. In addition, there may be some attributes as well as meta-data associated to attributes that we may define as mandatory for context elements as compared to data elements. Context elements are typically created containing the value of attributes characterizing a given entity at a given moment. As an example, a context element may contain values of some of the attributes “last measured temperature”, “square meters” and “wall color” associated to a room in a building. Note that there might be many different context elements referring to the same entity in a system, each containing the value of a different set of attributes. This allows that different applications handle different context elements for the same entity, each containing only those attributes of that entity relevant to the corresponding application. It will also allow representing updates on set of attributes linked to a given entity: each of these updates can actually take the form of a context element and contain only the value of those attributes that have changed.
  • An event is an occurrence within a particular system or domain; it is something that has happened, or is contemplated as having happened in that domain. Events typically lead to creation of some data or context element describing or representing the events, thus allowing them to processed. As an example, a sensor device may be measuring the temperature and pressure of a given boiler, sending a context element every five minutes associated to that entity (the boiler) that includes the value of these to attributes (temperature and pressure). The creation and sending of the context element is an event, i.e., what has occurred. Since the data/context elements that are generated linked to an event are the way events get visible in a computing system, it is common to refer to these data/context elements simply as "events".
  • A data event refers to an event leading to creation of a data element.
  • A context event refers to an event leading to creation of a context element.
  • An event object is used to mean a programming entity that represents an event in a computing system [EPIA] like event-aware GEs. Event objects allow to perform operations on event, also known as event processing. Event objects are defined as a data element (or a context element) representing an event to which a number of standard event object properties (similar to a header) are associated internally. These standard event object properties support certain event processing functions.
Personal tools
Create a book