We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.OpenSpecification.Data.SocialSemanticEnricher - FIWARE Forge Wiki

FIWARE.OpenSpecification.Data.SocialSemanticEnricher

From FIWARE Forge Wiki

Jump to: navigation, search
Name FIWARE.OpenSpecification.Data.SocialSemanticEnricher
Chapter Data/Context Management,
Catalogue-Link to Implementation [This GE is not present on the catalogue since it was discontinued Social Semantic Enricher]
Owner Telecom Italia, Fabio Luciano Mondin (TI)


Contents

Preface

Within this document you find a self-contained open specification of a FIWARE generic enabler, please consult as well the FIWARE Product Vision, the website on http://www.fiware.org and similar pages in order to understand the complete context of the FIWARE platform.

Copyright

Copyright © 2010-2014 by TI. All Rights Reserved.

Legal Notice

Please check the following Legal Notice to understand the rights to use these specifications.

Overview

The evolution of the traditional Web into a Semantic Web and the continuous increase in the amount of data published as Linked Data open up new opportunities for annotation and categorization systems to reuse these data as semantic knowledge bases. Accordingly, Linked Data has been used by information extraction systems to exploit the semantic knowledge bases, which can be interconnected and structured in order to increase the precision and recall of annotation and categorization mechanisms.

The goal of the Semantic Web (or Web of Data) is to describe the meaning of the information published on the Web to allow retrieval based on an accurate understanding of its semantics. The Semantic Web adds structure to the resources accessible online in ways that are not only usable by humans, but also by software agents that can rapidly process them. Linked Data (LD) refers to a way for publishing and interlinking structured data on the Web. LD is part of the design of the Semantic Web and represents the foundation “bricks” needed to build it. Although the concept of “linked data” was already present in the theory of the Semantic Web, it came into vogue later in computer science due to the too low computing capacity at that time.

However, in the recent years due to the growing number of datasets based on LD, it has been possible to exploit their implicit knowledge through text classification and annotation processes in order to build semantic applications.

The Social Semantic Enricher is a generic enabler offering different set of functionalities which can be driven in order to interoperate or not, basing on the final user needs. The basic idea is to have a text in input. The Text is processed by the enabler who tries to perform a "semantic classification" by extracting some concepts which are related to the text itself. For Each of these concepts, a similarity score is computed, saying implictly how much this concept "is about" the given text and then call to external "enrichment" providers are made, in order to get visual or media representations of the extracted concepts.

In order to perform the described operations, several modules have to work together: the proposed architecture allows to use the enabler both by making different calls to the separated modules or by making a single call using the modules in an "ordered", smart way. So if on one side the classifying API can be used on a plain text or on a file to get enriched concepts saying "this text is about...", on the other side ti is also possible to use the modules separately to do simple operations such as extracting text from a file, get conecpts without enrichment, enrich concepts get in other ways and so on.

Basic Concepts

Technologies Involved

From a Technological point of view, the Social Semantic Enricher covers a wide set of Technological edges. Namely:


  • Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions - the Web of Data.[1]
  • Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform
  • Dbpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. [2]
  • Text Classification is the assignment of a text into one or more pre-existing classes (also known as features). This process determines the class membership of a text document given a set of distinct classes with a profile and a number of features. The criterion for the selection of relevant features for classification is essential and is determined a priori by the classifier (human or software).
  • Semantic Classification takes place when the elements of interest in the classification refer to the “meaning" of the document. Text annotation refers to the common practice of adding information to the text itself through underlining, notes, comments, tags or links. The annotation of text can be also semantic when the text of a document is added with information about its meaning or the meaning of individual elements that compose it. This is done primarily using links that connect a word, an expression or a phrase to an information resource on the Web or to an unambiguous entity present in a knowledge base.

General Information

The adoption of Linked Data best practices for exposing and connecting information on the Web has a considerable success in several areas: multimedia publishing, open government, health care. Moreover, a specific line of research explores the points of convergence of Linked Data and Natural Language Processing (NLP): DBpedia, a central interlinking hub for the Linking Data project, has proven to be a very suitable knowledge base for text classification, according to both technical reasons and more theoretical considerations. Furthermore, DBpedia is directly linked to the arguably largest multilingual annotated corpus ever created, which is Wikipedia: thus, it is technically perfect for automated tasks in the fields of NLP.

The Enabler intends to leverage Linked Data and NLP technologies to extract the main topics from texts in the form of DBpedia resources, retrieving new information from the Web. In order to maintain an effective classification process, in particular on texts concerning recent topics (covered, for instance, by newspapers), the DBpedia knowledge base employed for topics extraction shall be updated to the latest version.

The Social Semantic Enricher implements a memory-based learning approach to semantic classification, a subcategory of lazy learning. A distinctive feature of this approach, also known as instance-based learning, is that the system does not create an abstract model of the classification categories (profiles) before the process of text categorization. Instead, it assigns the target document to a class on the basis of a local comparison between the pre-classified documents and the target.

How It Works

Dataset Generation

It is the first phase. The Core data updater generates lucene indexes from wikipedia and dbpedia dataset, which will be queried by the Core Module.


In this phase, the whole dbpedia and wikipedia dataset are treated so that for each entry in dbpedia/wikipedia (they are mapped 1:1 natively), a "Lucene Document" is saved into the Lucene Dataset. Each Document has the followin Entries:

  • URI(key): Identifying DBPEDIA Resource in a unique way.
  • A certain number of CONTEXT fields. There will be one context for each paragraph in wikipedia in which the target entry (the one we are saving in the dataset) appears as link. A Context corresponds then to a wikipedia paragraph containing a link to the target resource.
  • URI COUNT: The number of contexts
  • Other structured metadata coming out from Dbpedia.


Text Pre-Processing

This phase is necessary when the text is contained into a web page or a file. The article extractor module is responsible of extracting plain text from different file formats and scraping text from webpages.

Text Classification

Takes Text as Input and via text processing and Lucene Analysis returns a URI list.

When a Text comes as input, some lemmatisation and stemming operations are performed and after that, the SSE compares the input text with ALL of the contexts fields saved into the Lucene Dataset (which means, it compares the text with all the paragraphs in wikipedia containing at least one link). For each context, a similarity metric, offered by the Lucene Technology is computed. The Set of N(which can be specified to the system) URIs corresponding to the contexts with the higher level of similarity is returned.

Enrichment Phase

The Enrichment phase takes the URI List as input and searches into external data sources for videos/image/multimedia or even other text/document to expand.

Social Semantic Enricher GE Architecture

The next figure shows the Social Semantic Enricher General Architecture.


Even if most of the modules were already cited above, the following paragraph provides a functional description for each software module.


  • Article Extractor: This Modules Extracts Text from different file formats, web pages and so on, in order to pass plain Text to the other modules


  • Core + Core API: The Core Module contains the datasets and it receives text as input via Interface Modules (Article Extractor, GUI, API) and it finds out the “list of related concepts”. The API Part gives REST-like access to the core functionalities
  • Enricher + Enricher API: It enriches the concepts extracted by the core (or any other input in the same format) with external media. It accepts a specific format from the core but it will work for any data given in the same format via API.
  • GUI(Classify + Enrich): It is a User Friendly GUI to give an easy access to the enabler's functionalities.
  • Core Updater(Offline Processing): it treats the datasets so that they are ready to be queried.The goal is to obtain an automatic processor able to regenerate the index for any new version of dataset released by dbpedia/wikipedia.

Main Interactions

This "Main Interactions" section will be divided into two main subsections. The "User To System Interactions" will explain the main interactions between the modules into the enabler, subsequent to a typical user-call to the overall system, while the second section, named "User To Module Interactions" will briefly describe the possible interactions with the single modules, made possible by the modular architecture of the system.

User to System Interactions

The modules inside the generic enabler are actually software module which are capable of interacting among each other if enabled to do that. The normal usage of this module is in fact to be triggered by an API call or by a user interaction with the GUI in order to obtain a list of enriched concept from a text in some format. In this sense, the Typical interaction, involving all modules is described by the next sequence diagram:

This is the most common interaction flow between the modules. An external call to the API or GUI triggers, sending the file or the text to be processed to the enabler starts everything. In case of Need, The text is proccesed by the "Article Extractor" Module in order to extract plain text. The plain text is then passed to the Core, which will extract the concepts in the form of URI list. The URI is then passed to the enrichment module which leveraging on external data sources enriches each URI with external media and sends back the response to the API/GUI which is responsible of responding with a JSON or a Visual Representation to the caller.

User to Module Interactions

Due to the modular architecture, Each one of the module performing operations is provided with a set of API, so that every single module can be used as free-standing, rather than an overall system. So in this case:

  • Core Module can be called directly sending plain text to Core API and obtaining as output a URI list.
  • Article Extractor can be call directly sending via API a File or Web URL and obtaining as output a plain TEXT
  • Enrichment Module can be called directly sending an input coherent with the URI list coming out from the core module and it will give an enriched list as output. The enrichment will be performed according to parameters send to the API.

The Core Updater Module already works as a freestanding module.

Basic Design Principles

The Social Semantic Enricher GE was designed with the following design principles:

  • Modular Architecture: The GE can be used both as an overall system performing a wide set of tasks by following a specific operation path or as a set of single modules performing specific tasks.
  • Scalability: The GE offers a set of APIs, which are used both for direct interaction with the single modules or with the overall system. Both APIs and the overall system are stateless (except for an optional caching stage which can anyway improve performance when the load increases). So the system can scale in a native way.
  • Suitable for Cloud Environment: The choice to remain stateless make the GE suitable for cloud environments.
  • Application Development: Application Developers does not need to know much about the internals of the GE, can just explore the set of APIs and understand well their I/O.
  • Extensible System:: Almost each one of the modules can be easily extended, the set of external sources for the enricher can be easily extended. The set of file formats which can be treated by the article extractor can be extended too. Even The core, which is more difficult to extend, can be totally substituted with another module giving the same I/O format.
  • Updatable System: Effort is being spent to keep the system updatable in terms of core data in an easy way.

Detailed Specifications

Re-utilised Technologies/Specifications

The Project Makes use of:

Terms and definitions

The description should not contain specific terms with the need to be explained. However, if this may happen, please contact GE owner for any clarification

Personal tools
Create a book