We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.OpenSpecification.Data.SemanticAnnotation - FIWARE Forge Wiki

FIWARE.OpenSpecification.Data.SemanticAnnotation

From FIWARE Forge Wiki

Jump to: navigation, search
Name FIWARE.OpenSpecification.Data.SemanticAnnotation
Chapter Data/Context Management,
Catalogue-Link to Implementation [ <N/A>]
Owner Telecom Italia, Mondin Fabio Luciano


Contents

Preface

Within this document you find a self-contained open specification of a FIWARE generic enabler, please consult as well the FIWARE Product Vision, the website on http://www.fiware.org and similar pages in order to understand the complete context of the FIWARE platform.


FIWARE WIKI editorial remark:
This page corresponds to Release 3 of FIWARE. The latest version associated to the latest Release is linked from FIWARE Architecture

Copyright

Legal Notice

Please check the following FI-WARE Open Specification Legal Notice (essential patents license) to understand the rights to use these specifications.

Overview

The principle standing behind Semantic Web is to evolve the "link" concept from an unspecified element describing the relationship between two elements into a "named relationship". This should clarify which is(are) the relationship(s) between those elements.

That is the main reason why RDF (Resource Description Framework), the language of Linked Open Data was invented. RDF is based on Triples, in the form of<SUBJECT><PREDICATE><OBJECT>.


The Subject is a URI, identifying uniquely a particular resource to be described, while the predicate (and sometimes the object) can describe objects and their relationships. The Semantic Annotator is basically a tool which tries to identify important entities (places,persons,organizations) and associate them a text and describe them with Linked Open Data.


This GE provides a general-purpose text analyzer to identify and disambiguate LOD (Linked Open Data) resources related to the entities in the text. It is built following a modular approach to optimize and distribute text processing & LOD sources (plug-in). Also it allows RDF triple generation that easily links to LOD resources.

The main conceptual idea of the Semantic Annotation GE is shown in the Figure below.

Conceptual Model of Semantic Annotation GE


Target usage

This GE may be used in the augmenting of content (news, books, etc.) with additional information and links to LOD. It provides filtering and search based on LOD resources used as categories/tags.


Target users are all stakeholders that want to enrich textual data (tags or text) with meaningful and external content.

In the media era of the web, much content is text-based or partially contains text, either as media itself or as metadata (e.g. title, description, tags, etc.). Such text is typically used for searching and classifying content, either through folksonomies (tag-based search), predefined categories, or through full-text based queries. To limit information overload with meaningless results there is a clear need to assist this searching process with semantic knowledge, thus helping in clarifying the intention of the user. This knowledge can be further exploited not only to provide the requested content, but also to enrich results with, additional , yet meaningful content, which can further satisfy the user needs.

Semantics, and in particular Linked Open Data (LOD), is helpful in both annotating & categorizing content, but also in providing additional rich information that can improve the user experience.

As end-user content can be of any type, and in any language, such enabler requires a general purpose & multilingual approach in addressing the annotation task.

Typical users or applications can be thus found in the area of eTourism or eReading, where content can benefit from such functionality when visiting a place or reading a book. For example, being provided with additional information regarding the location or cited characters.

The pure semantic annotation capabilities can be regarded as helpful for editors to categorize content in a meaningful manner thus limiting ambiguous search results (e.g. an article wouldn’t be simply tagged with apple, but with its exact concept, i.e. a fruit, New York City or the brand)


Basic Design Principles

The Enabler has been designed following a modular approach, as it is shown in the figure above. This way each component in the enabler can be developed or changed, given that it provides the same input/output format.

The Semantic Annotation reasoner (SANr), communicates with a full text based resolver, in order to identify entities in text and with Semantic Data Storages to link these identities with candidates.


This leaves open the road to change data sources in order to have other data sources than Dbpedia [1] or Geonames [2] or to change the process standing behind the candidate's choice for each entity.

Basic Concepts

The GE has a web API, supports multilingual texts (Italian, English, Spanish, Portuguese) and includes "candidate” LOD resources and performs disambiguation. As a result the GE creates external links and HTML snippets showing in a user-friendly way LOD information.

The API processes the input text with a language processor in order to identify entities in text which are basically persons, places and organizations. This is performed by crossing grammatical and syntactic information.

Once the entities are identified, the system tries to associate a list of candidates to each entity. Candidates are entries coming from Dbpedia and Geonames which are the most used general purpose semantic databases. Candidate association is performed by comparing each entity with the Dbpedia Labels, the most similar ones area chosen as candidates.

For each candidate, the system computes a score based on a syntactic similarity metric (e.g. if the entity is “foo”, a candidate with label “foo” will have higher score than another one with label “foo bar”). This score is then mixed with another score coming from an algorithm trying to evaluate how each candidate semantically fits in the context. To understand well a candidate structure check the example in “Main Interactions” section.

External Modules (such as Semantic Data Repositories) are parametric, so one can decide to replicate semantic datasets (such as DBPedia) locally, in order to improve performance. A typical usage, with Semantic Annotation used jointly with a local semantic data storage and a Relational-to-Semantic Converter, is shown in the Figure below.


Semantic Annotation Typical Usage

Main Interactions

The enabler basically consists of an API, which can be called by a simple HTTP GET request to this URL, so the interaction is a simple CALL->RESPONSE.

http://semantican.lab.fi-ware.eu/ajax/extract_words.php?text=

with a text to analyze as input which has to be passed as "text" parameter as shown in the link above.

This system will:


1. Identify Text Language

2. Identify Entities (People, Places, Organizations) in the Text

3. For each found entity It searches over Semantic Data Sources (DBPedia and Geonames) for related Linked Open Data Objects.

4. The found LOD objects for each entity are returned in JSON Format (since it is more versatile than XML) as "candidates". Each candidate has a score. The candidate with the highest score is flagged as "preferred".

5. The query is logged into a Database with an ID.


Here's an example of the return result in JSON format.

{
    "queryId": "12143",
    "lang": "it",
    "keywords": "Mario+Monti",
    "extags": "Mario Monti",
    "freeling": "Mario_Monti",
    "proc_time": "13",
    "terms": [
        {
            "id": "tc-Mario+Monti",
            "term": "Mario Monti",
            "candidates": [
                {
                    "id": "tag--Mario_Monti--http://dbpedia.org/resource/Mario_Monti",
                    "label": "Mario Monti",
                    "uri": "http://dbpedia.org/resource/Mario_Monti",
                    "type": "user",
                    "ext": "Mario Monti",
                    "extra": [],
                    "wrapper": "dbpedia",
                    "lev": "2",
                    "sim": "0.909090909091",
                    "sis": "1",
                    "jw": "0.963636363636",
                    "sc": "1",
                    "class": "empty",
                    "preferred": "true"
                }
            ],
            "html": "<fieldset><div class=panel><div class=header>A proposito di <b>Mario Monti</b></div><div class=panel_body></div></div><div class=panel><div class=panel_body><img src='http://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Il_Presidente_del_Consiglio_incaricato_Mario_Monti_(cropped).jpg/200px-Il_Presidente_del_Consiglio_incaricato_Mario_Monti_(cropped).jpg' height=160 /><br><div class=info>È senatore a vita dal 9 novembre 2011 e dal successivo 16 novembre assume, per la prima volta, l'incarico di Presidente del Consiglio dei Ministri della Repubblica Italiana e allo stesso tempo di Ministro dell'Economia e delle Finanze dello stesso governo. Presidente dell'Università  Bocconi dal 1994, Monti è stato c...<ul><li><a href='http://www.guardian.co.uk/world/mario-monti' target='_blank'>Link utile</a></li></ul></div></div></div></fieldset><fieldset><legend>Concetti associati a <strong>Mario Monti</strong></legend><ul><li><img src='img/user.png' alt='user' title='user'> <a href='http://dbpedia.org/resource/Mario_Monti' target='_blank' title='[2-0.909090909091-0.963636363636/1]' >Mario Monti</a> (dbpedia)</li></ul></fieldset>",
            "class": "empty"
        }
    ]
}



Moreover, by setting the 'html_snippet=on' parameter in the request URL, an HTML snippet for the preferred DBPedia entry is returned if possible. The HTML Snippet contains a Picture and Short Abstract for the resource.



Re-utilised Technologies/Specifications

Here is a list of Re-utilised Technologies for the enabler:

- Freeling 2.2 The enabler uses Freeling as a language processing tool in order to perform Named Entity Recongition. [3]

- Dbpedia One of the most important general data sources used by the enabler. [4]

- Geonames Reference data source for places [5]

Terms and definitions

This section comprises a summary of terms and definitions introduced during the previous sections. It intends to establish a vocabulary that will be help to carry out discussions internally and with third parties (e.g., Use Case projects in the EU FP7 Future Internet PPP). For a summary of terms and definitions managed at overall FI-WARE level, please refer to FIWARE Global Terms and Definitions

  • Data refers to information that is produced, generated, collected or observed that may be relevant for processing, carrying out further analysis and knowledge extraction. Data in FIWARE has associated a data type and avalue. FIWARE will support a set of built-in basic data types similar to those existing in most programming languages. Values linked to basic data types supported in FIWARE are referred as basic data values. As an example, basic data values like ‘2’, ‘7’ or ‘365’ belong to the integer basic data type.
  • A data element refers to data whose value is defined as consisting of a sequence of one or more <name, type, value> triplets referred as data element attributes, where the type and value of each attribute is either mapped to a basic data type and a basic data value or mapped to the data type and value of another data element.
  • Context in FIWARE is represented through context elements. A context element extends the concept of data element by associating an EntityId and EntityType to it, uniquely identifying the entity (which in turn may map to a group of entities) in the FIWARE system to which the context element information refers. In addition, there may be some attributes as well as meta-data associated to attributes that we may define as mandatory for context elements as compared to data elements. Context elements are typically created containing the value of attributes characterizing a given entity at a given moment. As an example, a context element may contain values of some of the attributes “last measured temperature”, “square meters” and “wall color” associated to a room in a building. Note that there might be many different context elements referring to the same entity in a system, each containing the value of a different set of attributes. This allows that different applications handle different context elements for the same entity, each containing only those attributes of that entity relevant to the corresponding application. It will also allow representing updates on set of attributes linked to a given entity: each of these updates can actually take the form of a context element and contain only the value of those attributes that have changed.
  • An event is an occurrence within a particular system or domain; it is something that has happened, or is contemplated as having happened in that domain. Events typically lead to creation of some data or context element describing or representing the events, thus allowing them to processed. As an example, a sensor device may be measuring the temperature and pressure of a given boiler, sending a context element every five minutes associated to that entity (the boiler) that includes the value of these to attributes (temperature and pressure). The creation and sending of the context element is an event, i.e., what has occurred. Since the data/context elements that are generated linked to an event are the way events get visible in a computing system, it is common to refer to these data/context elements simply as "events".
  • A data event refers to an event leading to creation of a data element.
  • A context event refers to an event leading to creation of a context element.
  • An event object is used to mean a programming entity that represents an event in a computing system [EPIA] like event-aware GEs. Event objects allow to perform operations on event, also known as event processing. Event objects are defined as a data element (or a context element) representing an event to which a number of standard event object properties (similar to a header) are associated internally. These standard event object properties support certain event processing functions.
Personal tools
Create a book