We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.ArchitectureDescription.Data.UnstructuredDataAnalysis R3 - FIWARE Forge Wiki

FIWARE.ArchitectureDescription.Data.UnstructuredDataAnalysis R3

From FIWARE Forge Wiki

Jump to: navigation, search
FIWARE WIKI editorial remark:
This page corresponds to Release 3 of FIWARE. The latest version associated to the latest Release is linked from FIWARE Architecture

Contents

Copyright

Legal Notice

Please check the following Legal Notice to understand the rights to use these specifications.

Overview

The Unstructured Data Analysis is thee generic enabler focusing on gathering and analysing unstructured data. The target user as well as a detailed description are provided in the following.


Target usage

Target users are mainly data analysis and data visualization developers that need to analyze unstructured web resources.

Unstructured Data Analysis GE Description

The Unstructured Data Analysis is a composition of technologies in order to gather and analyse unstructured data. In a specific way, this generic enable is responsible for acquiring unstructured data from several data sources, preparing it for the analysis, and executed the linguistic and statistical analysis required by the tendencies detection). This GE is running continuously, polling the Web (through RSS feeds) for recent content, turning it into a stream of processed text documents.

The documents gathered by this GE are provided in a semi-structured fashion (following the RSS structure) such as titles, publication dates, and other metadata are clearly indicated. Furthermore, the Web pages related with the RSS feeds are also gathered. Finally, the trending topics are identified in texts and article bodies using linguistic and statistical analysis. The GE provides a REST API that allows define the data sources and get the results of the analysis.

The web content contains a “noise” that needs to be identified and removed before the content can be analysed. For this reason, a pipeline is executed which consists of (i) unstructured data acquisition (ii) data cleaning, (iii) data storage (iii) natural-language processing analysis, (iv) statistical analysis and (v) Results storage.

This GE has been developed based on the experiences obtained in the FIRST [FIRST] and ALERT [ALERT] projects.

Example Scenario

Unstructured data is defined as all the set of data that do not follow a predefined data model, these data comes in different formats (text, document, image, and video). The amount of unstructured data is by far greater than structured data. According to a 2011 IDC study[IDC2011], the 90 percent of all data created in the next decade will be unstructured data. This scenario aims the creation of novel tools to handle and analyze these data.

The GE provides a method to analyze text-based unstructured data. Analyzing web data (RSS and web pages) in order to detected emerging tendencies key words. A tendency is defined as the meaningful n-grams (a subsequence of n words) which have becoming relevant in a community during a specific time period. This functionality is useful for market surveillance and others related.

For use these functionalities, the user (an expert in media analysis) will define the sources to be explored (RSS feeds), creating a project and adding the data sources to it using a REST API provided by the GE. Once the user has added a new source, the GE will start to monitor the feeds obtaining the new entries, getting the web pages related to those entries, preprocess the data, store it and start the analysis. The results of the analysis and the retrieve data can be accessed also through the API.

The ALERT project provides an example scenario of this. This project aims to develop tools for the analysis of open sources (FLOSS) communities, which are composed of developers, users and community manager and others. The results of these analysis are valuable information to community managers in the decision making process. One of the results of this project was the OCELOt component,a tool for the analysis of unstructured data contained in FLOSS communities (forums, mailing list, bug tracker system, source code management) in order to detect emerging concepts (key terms) to be used to extend ontological resources used in other analysis (see video demo[1]). This component was improved the FI-WARE project and it is included inside the Unstructured Data Analysis GE.

Basic Concepts

This section introduces the basic concepts related to the Unstructured Data Analysis GE in order to facilitate the understanding of this description.

  1. Really Simple Syndication (RSS): RSS is a XML-based format used to publish new contents in order to provide a standard way to publish and consume those contents. A RSS feed is a XML file published through a HTTP server and using RSS format. This feed is composed by entries each represents new content. An entry could be associated a web page.
  2. Web Crawling: Crawling is the process to automatically detect and retrieve new of web resources in order to be gathered and processed for a specific purpose. The software used for the crawling process is known as web bots.
  3. Emerging term (Tendency): An emerging term is a meaningful n-gram (a subsequence of words) which has been gained relevance in a community; the emerging terms are a key element in the detection of new tendencies.
  4. Natural Language Processing (NLP): NLP is the convergence of linguistic, computer sciences and statistic to provide strategies to extract knowledge from data expressed in natural language (written or spoken). This GE uses strategies for the tokenization, lemmatization and Part-of-Speech Tagging of the gathered data.


Unstructured Data Analysis GE Architecture

The objective of the Unstructured Data Analysis GE (also known as UDA GE) is to facilitate the monitoring process of web sources in order to detect potential tendencies providing tool for gathering, preprocess, store and analyze the unstructured data contained in those sources

In order to satisfy the previous objective, the UDA GE is composed by a set of components. Next figure presents the UDA GE Infrastructure architecture.

UDA architecture

The Unstructured Data Analysis GE is composed by a set of modules, each one play a specific role in order to provide the functionalities of the GE.

The UDA REST API is the interface provided by the GE to interact with any user or Software component that requires the functionalities of this GE. This API provides the operations that allows creating new unstructured data analysis projects, adding sources (RSS feeds) to be analyzed. Also, this API allows obtaining the results of the analysis and the gathered documents.

Once the sources are included to a project, the GE begins the crawling process. This task is performed by the UDA Framework, which get the RSS entries from the feeds included in the project in order to process its content and also to retrieve the web pages contained in the feeds. The UDA Framework uses a customized version of Apache Nutch to execute this operation. All the retrieve content is cleaned in order to extract the plain text from the RSS/HTML raw data. The data (raw and cleaned data) is stored in an Hbase No-SQL database and also in a Lucene index. The Hbase data is used by the tendencies detection analysis and also it is available for other analysis. On the other hand, the data stored in the Lucene index is used to facilitate the retrieval of the documents gathered by the GE via the REST API.

Finally, the OCELOt (Online Semantic Concept Extractor based on Linked Open Data) component analyze unstructured data in order to detect emerging tendencies, these tendencies could be used to get awareness about relevant challenges in a specific domain. OCELOt uses different natural language processing strategies (tokenization, lemmatization, Part-of-Speech Tagging) and statistical analysis to detect the tendencies. The results of the analysis are stored in a relational database, which are given to the user through the REST API.

Main Interactions

Modules and Interfaces

This section reports on the description of the Unstructured Data Analysis GE main functionality. The description of this functionality is based on the functionality provided by the baseline assets. Section Backend functionality describes functionality (methods) provided to agents in a service like style.

Backend Functionality

Backend functionality describes functionality provided by the GE as service invocation methods for both human or computer agents. As described in Architecture section, this functionality is accessible by means of REST Web Services API, which provides the next operations:

  1. Create project: Creates an unstructured data analysis project, a software abstraction that represent the monitoring and analysis of a set of data sources in a specific domain. To invoke the operation, a POST http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]
  2. Get project configuration: Obtains the configuration (name, description, analyzed sources) of a specific project. To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]
  3. Delete project: Remove a project; this will stop the analysis of the sources contained in the project. To invoke the operation, a DELETE http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]
  4. Get project’s data sources configuration: Obtains all the sources analyzed in a specific project. To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/sources
  5. Add data source to a project: Add a new source to be analyzed in a specific project. To invoke the operation, a POST http request should be sent to http://<ge url location>/uda-service/uda/[PROJECT_NAME]/sources/[SOURCE_NAME]
  6. Get data source configuration: Obtain the configuration of a source analyzed in a specific project. To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/sources/[SOURCE_NAME]
  7. Remove a data source from a project: Remove a source from a project; this will stop the data gathering and analysis from that source. To invoke the operation, a DELETE http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/sources/[SOURCE_NAME]
  8. Search in a project: Executes a textual search over the documents gathered in a specific project using the Lucene query format[LuceneQuery] . To invoke the operation, a GET http request should be sent to http://<ge url location>/ uda-service/uda/[PROJECT_NAME]/search

All methods described can be invoked by means of regular HTTP requests either using a web browser (for those ones who rely on GET requests) or by an APIs such as Jersey.

Frontend Functionality

This GE does not provide a user interface.

Design Principles

The GE design principle is to use on well-known tools and standards in all phases of the processing:

  1. The data is gathered from RSS/HTML standards sources, which are widely recognized and used across the world to publish unstructured data.
  2. A web crawler based on Apache Nutch tool (An widely used crawler) is used to the retrieve the data, and also executes a preprocessing, the data is stored the data into an HBase (well-known NoSql database).
  3. The linguistic and statistical analysis uses Stanford NLP framework for the analysis of English texts.
  4. The GE provides a API based on the REST de facto standard to interact with other components

References

[IDC2011] J. Gantz, D. Reinsel. Extracting Value from Chaos. IDC IVIEW [2]
[StanfordNLP] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003.Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259.
[REST2002] R. T. Fielding, R. N. Taylor. Principled Design of the Modern Web Architecture. ACM Transactions on Internet Technology, Vol. 2, No. 2, May 2002, Pages 115–150
[LuceneQuery] Lucene Query format, [3]
[FIRST] The FIRST Project, [4]
[ALERT] The ALERT Project, [5]
Personal tools
Create a book