We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.OpenSpecification.Security.Optional Security Enablers.DBAnonymizer - FIWARE Forge Wiki

FIWARE.OpenSpecification.Security.Optional Security Enablers.DBAnonymizer

From FIWARE Forge Wiki

Jump to: navigation, search
Name FIWARE.OpenSpecification.Security.Optional Security Enablers.DBAnonymizer
Chapter Security,
Catalogue-Link to Implementation [N/A ]
Owner SAP, Francesco Di Cerbo

Contents

Preface

Within this document you find a self-contained open specification of a FIWARE generic enabler, please consult as well the FIWARE Product Vision, the website on http://www.fiware.org and similar pages in order to understand the complete context of the FIWARE platform.

Copyright

  • Copyright © 2014 by SAP

Legal Notice

Please check the following FI-WARE Open Specification Legal Notice (essential patents license) to understand the rights to use this open specification. As all other FI-WARE members, SAP has chosen one of the two FI-WARE license schemes for open specifications.

To illustrate this open specification license from our SAP perspective:

  • SAP provides the specifications of this Generic Enabler available under IPR rules that allow for a exploitation and sustainable usage both in Open Source as well as proprietary, closed source products to maximize adoption.
  • This Open Specification is exploitable for proprietary 3rd party products and is exploitable for open source 3rd party products, including open source licenses that require patent pledges.
  • If the owner (SAP) of this GE spec holds a patent that is essential to create a conforming implementation of the GE spec (i.e. it is impossible to write a conforming implementation without violating the patent) then a license to that patent is deemed granted to the implementation.

Overview

Large organizations held thousands of terabytes of datasets about their customers or their activities. They often have to release data files containing private information to third parties for data analysis, application testing or support. To preserve individuals’ privacy and comply with privacy regulations, part of released datasets have to be hidden or anonymized using various anonymization techniques.

However, two different problems may arise: first to decide if a piece of data has to be considered private or not, and second, to assess whether the exposure of non-private data could be used by correlation algorithms to infer hidden private data. The second task is particularly challenging, and cannot be handled manually for large datasets, where the potential number of combinations of different fields is extremely large. In fact, disclosure policies are typically described by human users (security experts and others) that are not able to predict all the possible combinations of the data that could ease the guess of private data contained in the dataset. In some other cases, policy authors are not necessarily security experts and could expose sensitive data without being aware of the impact of such exposure.

DB Anonymizer is a database re-identification risk evaluation and anonymization service; it can be used as a support tool in case of dataset disclosure operations. DB Anonymizer deals with the estimation of the re-identification risk associated to information disclosures, which is the risk that an attacker can reconstruct exactly a dataset's content. This estimation is then used for providing DB Anonymizer users with a number of functionalities connected to dataset anonymization. For instance, the service exposes a function that calculates a value, that represents the likelihood (from 0 - impossibility to 1 - certainty) that an attacker can reconstruct exactly a dataset's content that is anonymized using a certain obfuscation policy.

Albeit privacy risk estimators have already been developed in some specific contexts (statistical databases), they have had limited impact, since they are often too specific for a given context, and do not provide the user with the necessary feedback to mitigate the risk. In addition, they can be computationally expensive on large datasets. DB Anonymizer is specifically designed to address all these issues, exposing a simple RESTful API that can be easily integrated in any application.

DB Anonymizer uses a special algorithm to estimate the re-identification risk. Details on this algorithm can be found in the following article:

  • Trabelsi, S.; Salzgeber, V.; Bezzi, M.; Montagnon, G.; , "Data disclosure risk evaluation," 2009 Fourth International Conference on Risks and Security of Internet and Systems (CRiSIS), pp.35-72, 19-22 Oct. 2009. DOI: 10.1109/CRISIS.2009.5411979


Target usage

The service can be used by information owners or responsible persons to evaluate the re-identification risk associated to an information disclosure operation of their data; by suggesting the safest configurations according to a specified upper-bound and finally to perform the dataset anonymization operation according to a disclosure policy. More precisely, through the methods of its API, the service provides the user with an estimation of the re-identification risk when disclosing certain information, and proposes safe combinations in order to minimize the risk that an attacker can reconstruct the original dataset. For instance, the service can estimate the re-identification risk associated to all attributes of a dataset (i.e., its columns); this functionality helps the users in defining the anonymization policies that better suit their business needs and that minimize the re-identification risk.

DB Anonymizer at this stage supports DB dumps in basic SQL syntax. It is however recommended to use MySQL SQL instructions.

Basic Concepts

Relevant Concepts and Ideas

To operate, DB Anonymizer needs as input from users:

  1. a dump of a MySQL table, containing the full dataset to disclose, together with
  2. a disclosure policy (also known as obfuscation policy).

Both inputs are mandatory to let the service's algorithm to be able to evaluate the effectiveness of the disclosure policy, and for any other supported operations. In fact, the disclosure policy defines the structure of the dataset, and in particular, the sensitivity of each dataset element type (i.e., each column of the input table). Once the policy is evaluated, the table is dropped from the DB and the file dump is erased. The application server encapsulation model permits a complete isolation of each request data, and any intermediate result created during the algorithm's execution associated to real dataset contents is deleted immediately at the end of the computation.

Input Format

Generally, DB Anonymizer functions have two parameters:

  1. an SQL table dump (e.g., using MySQL dialect), containing all information to be disclosed: this file shall contain only a table definition and a set of elements to populate it;
  2. a policy file encoded in XML, that describes which information of the previously specified table is going to be disclosed: the policy file is described by the following XML Schema directives:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="Type">
        <xs:simpleType>
            <xs:restriction base="xs:string">
                <xs:enumeration value="identifier" />
                <xs:enumeration value="sensitive" />
            </xs:restriction>
        </xs:simpleType>
    </xs:element>

    <xs:element name="Column">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Name" type="xs:string" />
                <xs:element ref="Type" />
                <xs:element name="Hide" type="xs:boolean" />
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:element name="Policy">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="Column" maxOccurs="unbounded" />
            </xs:sequence>
        </xs:complexType>
    </xs:element>
 
</xs:schema>
 

A sample policy file is the following:

<Policy>
  <Column>
    <Name>Gender</Name>
    <Type>identifier</Type>
    <Hide>false</Hide>
  </Column>
  <Column>
    <Name>Wine</Name>
    <Type>sensitive</Type>
    <Hide>false</Hide>
  </Column>
</Policy>

The “Type” information shall be “identifier” or “sensitive”, in order to allow the service algorithm to distinguish them. Please refer to the Glossary ( at this link: FIWARE.Glossary.Security.Optional Security Enablers.DBAnonymizer) for an explanation of the two terms.

Use Case

A company holds information about people structured in dataset records. Each record has many attributes, such as birthday, address, marital status and occupation that are useful for company's purposes, but usually are not sensitive, if considered in isolation. Other attributes related to the connection between an individual and the company, such as customer purchases, debts, and credit rating, may be sensitive. Suppose that one of such dataset has to be released with a third party: it has to be modified, in order to protect the privacy of subjects described in the dataset, according to privacy protection regulations. Therefore, certain elements will be omitted, like for instance obvious identifiers such as social security number, name and address; other attributes such as occupation and marital status can be left intact, and other key and sensitive attributes modified to preserve confidentiality. For example, salaries might be truncated, ages grouped more coarsely, and zip codes swapped on pairs of records. Furthermore, some attributes on some records might be missing or intentionally removed. However, if this anonymization process is not carefully designed, it could be possible for attackers to use techniques to reconstruct the original dataset, as a whole or in parts, also by cross-comparing it with other datasets (e.g., a similar dataset of a competitor). The DB Anonymizer allows evaluating an anonymization policy, in order to measure its robustness to dataset reconstruction techniques.

Let us consider the following example.

Use Case for DB Anonymizer
  1. The IT-Security Expert, on behalf of the Dataset Owner, creates the Disclosure Policy.
  2. The IT-Security Expert, on behalf of the Dataset Owner, creates the DB Dump.
  3. DB Dump and Disclosure Policy are sent to DB Anonymizer using the evaluate policy.
  4. The DB Anonymizer sends back the Result Identifier (GID).
  5. The Dataset Owner asks for the evaluation result to DB Anonymizer, using the GID.
  6. The DB Anonymizer sends back the evaluation result.
  7. The Dataset Owner modifies the DB data, according to the accepted policy.
  8. The modified DB dump is sent to the Consulting Company.

Main Interactions

DB Anonymizer Architecture

FMC Block Diagram of DB Anonymizer: User System (on the left) and DB Anonymizer service (on the right side)

The previous block diagram shows the different elements that compose the DB Anonymizer service. Starting from the DB Anonymizer block (on the right side of the diagram), the core of DB Anonymizer is the Anonymization Algorithm[1] component, which interacts closely with an internal MySQL database. The Anonymization Algorithm interacts with users through a ReSTful interface. More precisely, the RESTful interface component is responsible for invoking the Anonymization Algorithm operations, and providing them with user inputs.

In the left part of the block diagram, a user is depicted together with a RESTful client component, for interacting with the DB Anonymizer RESTful interface. The RESTful client can also be implemented by a traditional web browser.


UML Use Case Diagram of main DB Anonymizer functionalities

The previous use case diagram represents the main functionalities of DB Anonymizer. They can be used for analysing and reviewing a dataset's disclosure policies and finally to perform the anonymization operation on a dataset. More details on each functionality can be found in the FIWARE.OpenSpecification.Security.DBAnonymizer.Open_RESTful_API_Specification page.


UML Sequence Diagram of two main DB Anonymizer operations: evaluatePolicy and getPolicyResult


The previous sequence diagram shows the order with which the main DB Anonymizer operations should be invoked; the entities depicted are the same as for the previous block diagram.

The DB Anonymizer API encloses a number of methods; the core functionalities have to be invoked by users in the following order:

  1. evaluate<function name>;
  2. get<function>Result.

Example:

  1. evaluatePolicy;
  2. getPolicyResult;

The first method allows for starting the analysis of an anonymization policy together with the associated dataset. The RESTful interface component exposes this method, and any incoming request gets routed and served by the Anonymization Algorigthm component, that creates a new computing process. The Anonymization Algorithm component returns immediately a request identifier (GID) to the ReSTful component and thus to the user, which can be used to retrieve the analysis result. Each computation process performs its analysis on the received policy and dataset, and then writes a result to the DB. At that point, the process terminates, deleting any used data. The second method can be invoked by users to retrieve the result of a computation, identified by a GID. The result of getPolicyResult is either the analysis result when available (from 0 - impossibility to 1 - certainty), or an error code (result is not ready, error in receiving parameters and so on; please refer to the RESTful API documentation for a detailed error code list and explanation).


Other DB Anonymizer operations follow the same structure. Please refer to FIWARE.OpenSpecification.Security.DBAnonymizer.Open RESTful API Specification for a complete list of supported operations.

Basic Design Principles

The service manipulates user data in a secure way; dataset and policies are deleted just after their use, to keep confidential any information transmitted. A temporary MySQL table is created at the beginning of the operations, and destroyed just before returning the final result to the caller. A new process is created for each user request, to ensure data isolation during computation phases.

Detailed Specifications

Following is a link to Open Specification for this Generic Enabler. Specifications labelled as "PRELIMINARY" are considered stable but subject to minor changes derived from lessons learned during last interactions of the development of a first reference implementation planned for the current Major Release of FI-WARE.

Open API Specifications

References

  1. Trabelsi, S.; Salzgeber, V.; Bezzi, M.; Montagnon, G.; , "Data disclosure risk evaluation," 2009 Fourth International Conference on Risks and Security of Internet and Systems (CRiSIS), pp.35-72, 19-22 Oct. 2009. DOI: 10.1109/CRISIS.2009.5411979

Re-utilised Technologies/Specifications

The Repository GE is based on RESTful Design Principles. The technologies and specifications used in this GE are:

  • RESTful web services
  • HTTP/1.1
  • JSON and XML data serialization formats


Terms and definitions

This section comprises a summary of terms and definitions introduced during the previous sections. It intends to establish a vocabulary that will be help to carry out discussions internally and with third parties (e.g., Use Case projects in the EU FP7 Future Internet PPP). For a summary of terms and definitions managed at overall FI-WARE level, please refer to FIWARE Global Terms and Definitions


  • Data Disclosure: A release of information to a third party or to public. Data can be confidential, so the operation might require the adoption of techniques that aim at preserving confidentiality and privacy of involved subjects, like, for instance, the hiding a part of the dataset, like for instance names, surnames, social security numbers and so on, generally referred as "identifiers".
  • Identifier: A piece of information that can identify unambiguously a person or an entity: for instance, names, surnames, social security and passport numbers, and so on.
  • Quasi-Identifier: Attributes such as birth date, gender and postal code, that cannot identify unambiguously a person or an entity if they are considered in isolation, but they could, if considered aggregated with enough similar attributes.
  • Re-identification risk: An estimation of the risk that an attacker can reconstruct the contents of a dataset, disclosed without identifier information, by linking it with other external data sources with overlapping attributes with the released dataset.
  • Sensitive information: A piece of information that reveals "racial or ethnic origin, political opinions, religious or philosophical beliefs or trade union membership, data concerning health or sex life, and data relating to offences, criminal convictions or security measures.", from Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, Nov. 23, 1995, available here)
Personal tools
Create a book