We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
FIWARE.OpenSpecification.Data.CKAN R5 - FIWARE Forge Wiki

FIWARE.OpenSpecification.Data.CKAN R5

From FIWARE Forge Wiki

Jump to: navigation, search
Name FIWARE.OpenSpecification.Data.CKAN
Chapter Data/Context Management,
Catalogue-Link to Implementation CKAN
Owner CKAN, Jo Barratt

Contents

Preface

Within this document you find a self-contained open specification of a FIWARE generic enabler, please consult as well the FIWARE Product Vision, the website on http://www.fiware.org and similar pages in order to understand the complete context of the FIWARE platform.

Copyright

Copyright © 2016 by Open Knowledge International

Legal Notice

Please check the following Legal Notice to understand the rights to use these specifications.

Overview

Data is a collection of values, both qualitative and quantitative, that represent or code existing information or knowledge in a form that is suitable for better usage or processing, most often with the aid of computers or other technology (although they are not necessary). The suitable processing form can be anything from simple text files, to spreadsheets, to large databases. The processing forms would generally be referred to as data resources. Data processing allows humans to make better informed decisions about the future, and monitor and measure the past.

Datasets (collections of data resources and associated information) are protected by a copyright comparable set of rights, called sui generis database rights. These rights restricts use and distribution of the data and allows a limited number of people or entities access to the monitoring, measuring and decision making. Governments, who represent and serve in the name of their constituents, measure and collect large amounts of data. In many countries, this would grant them explicit rights, the sui generis database rights, over that data. This makes it possible to legally restrict access to the data, so citizens, various organisations, societies, and companies cannot process or make use of the data.

Open Data

It is easy to imagine the corruption and centralisation of power that restricting access to data enables; It can become legally impossible to monitor those who work in our name and analyse their decisions. However it is even more important to focus on the lost opportunities of restricting access to data. Data creates value. Governments do not have to provide all services themselves. They can empower citizens, organisations, and companies to make use of the data in new ways and unlock a lot of value for everyone. A McKinsey Global Institute Report estimates that access and re-usability of data across seven sectors of government can unlock between $3 trillion and $5 trillion in economic value annually.

That's where Open Data comes in. It's a way of unlocking all that value by removing the default restrictions of the sui generis database rights. In short Open Data is data that:

[A]nyone can freely access, use, modify, and share the data for any purpose (subject, at most, to requirements that preserve provenance and openness).”
See more about the definition of open at the Open Definition

Data Management Systems

Merely opening up the data is not enough. It is important to make it accessible and findable by those who can or want to use and process it. Data Management Systems are designed to do exactly that, make data discoverable. Normal Data Management Systems would most often target internal use of the data. Open Data Management Systems on the other hand target and connect two audiences; Publishers of Data (as Open Data) and the users of the data (who create value).

As such Open Data Management Systems usually, in addition to, providing publishing, search and storage features, also provide means of licensing the data (i.e. put a license on the data that ensures that users are legally not restricted by the sui generis database rights) and connecting users and governments in various ways (both for questions and to show what the data is being used for.

Basic Concepts

Data management platform manage datasets and resources, just like content management platform manage web pages and blog posts. In content management platforms one has publishers of content and consumers of content. In the same way Open Data Management platforms have publishers of data and consumers of data (open data). Open Data Management platforms exist to make the lives of both groups easier and more interoperable. It does that by organising data into datasets which consists of metadata and resources. These datasets can then be accessed by searching for them, accessing them through collections like organisations, groups and a few other ways.

Dataset

A dataset is a collection of data related to the same piece of information. There can be a dataset for traffic lights in Madrid which can combine traffic lights data from all roads in the city. Another example of a dataset would be Slovenian national parks which could provide geographic map data of national parks in Slovenia, it might have other data related to the national parks like animal registry, budget data etc.

A dataset is the entry point for users who are trying to access data. Dataset titles are commonly used as the first point of contact and should describe the dataset in such a way that the potential user will understand that this is where you get the data about something they're interested in.

Metadata

A dataset title is just one piece of data about the dataset. This data about data is usually referred to as metadata. Other metadata for a dataset are things like description, maintainer (of dataset), version, tags (list of keywords to help users find the dataset or similar datasets) and more. Basically metadata helps users understand better what the dataset is about. It's a description of the data without having to dig through the data.

Resources

The other part of a dataset is the data itself or the data resources because even if the data files themselves are the most important part of the dataset, there are other resources that a dataset can include. The other resources include documentation about the data, websites, source repositories for software used to handle the data and more.

Each dataset can have many data resources; many data files and many extra resources to help make the data more usable to users. It is a good common practice to break resources up into files based on how updates are handled. Each dataset can be updated multiple times over its lifetime.

The traffic lights dataset example could break up its data files showing locations of traffic lights, by areas in Madrid so when one area gets new traffic lights, only that data file (resource) has to be updated and all the other ones are left unchanged. This means those who consume this data don't have to re-fetch all the data they already have about the unchanged areas. Budgets are usually split into fiscal years. The most recent fiscal year is likely to change while older fiscal years will probably not receive any updates.

Collections

Datasets can be made accessible via different means but the most common ways to access them are via organisations and groups, especially for governments, as that reflects the government structure quite nicely.

Organisations

An organisation is an entity responsible for datasets, this could be a government entity like public institutions or agencies. If we take a government as an example, a local user very likely knows what organisation is responsible for a specific dataset that this user is interested in, e.g. crime rates are very likely managed by the Police (the police could be split into national, regional and local police but we use Police here as a general term to describe a police organisation the local user knows about). This local user could then either go directly to datasets managed by the Police, could filter search by this organisation or do other things to find the right dataset about crime rates.

Groups

Sometimes many different organisations manage different parts of related datasets. Crime rates might be handled by the Police, but the judiciary system might manage a dataset about criminal convictions. These two datasets are distinct but they're still connected on a different level. Groups are a way to connect distinct datasets together under a common label (the group name). So crime rates and criminal convictions might be grouped together in a group called Crime. This again makes it very easy to find relevant datasets and make connections between them for someone who is interested in crimes in general.

Generic Architecture

High level architecture

A web based open data management platform is designed to sit between a web server/proxy and data systems. The web server or proxy are set up just like for a normal website. Like with any website it depends on its use how the setup for the web server is. It might be enough to have a simple web server to serve that platform. However for for high traffic sites, things like load balancers, caching, and thorough security setup is be needed. Open data management platforms may be able to run on their own (internal web server built in) and use nginx as a proxy to make it accessible.

When it comes to the open data management platform itself, as opposed to the server setup to make it accessible, it can generally be split into two layers:

  • Data layer
  • Applications layer

Data layer

The data layer is responsible for organising the data resources in a structured way. It will consist of various backend services to achieve that. The most likely backend service uses is a primary data store (e.g. a database) where data and associated metadata is kept, the data files can also be stored in a file system. Another likely backend service is a search engine which allows for a random access to the data without going through the presupposed structure.

It's also possible to have various workers, i.e. software that runs in the background to perform long running jobs. This could be things like a service that makes copies of data files from external websites, preview creators etc.

Solutions

Database

Open Data Management platforms can use whatever database system. Since these platforms manage structured data, it is very likely that they use a relational database rather than for example a document driven database. What specific relational database is used depends, but it may be possible that the platform is database agnostic if it is abstracted out. However optimising or fine-tuning the database is often achieved by using specific features of a specific database solution. It is therefore also likely that one specific solution is officially supported or that the platform is heavily coupled with a specific database solution.

Search engine

Open Data Management platform very likely require a specific search engine, this can be abstracted out like with databases but is still very likely tied to a specific solution. Sometimes the engine is separated from the search platform (which may, for example, expose the search index via HTTP calls). Search engines can be tricky to use and require a lot of computer memory to operate. The reason is that the whole search index needs to be able to fit in memory in order to achieve search speed that surpasses database queries.

Structured data storage

Some Open Data Management platforms may provide a way to store structured data in an ad hoc database. As such it would allow for example appending of new rows and updating of existing without re-uploading and overwriting files. This may be implemented as a separate service that exposes an HTTP endpoint. This endpoint can then be uses by the Open Data Management platform (or an extension) to add CSV and Excel files to structured data storage.

Application layer

The application layer is responsible for making the data from the data layer accessible to user. The application would have ways of accessing dataset collections, e.g. through organisations and groups, or expose a search interface. The application interface usually also exposes an API that can be used to systematically access the data. That way software, not only humans, can use the data and update e.g. when new data is released.

At its core the open data management platform is a very simple system, but it can be extended in many ways, e.g. to handle a specific file format, to expose the data in a different way or allow for user accounts on other systems.

Solutions

Web framework

Web based Open Data Management platforms can be developed in any programming language that supports network connections. The programming language uses (it could even be many different programming languages) is likely to have many web frameworks which all have their own style of configuring and running the application. Another option would be the architecture design approach which could be monolithic application or a set of microservices that deal with separate domains. There are a lot of these implementation details, which vary in as many ways as there are platforms. The key thing is that they should support HTTP and HTML.

API

The Open Data Management platforms are also very likely to expose an API. There are two common approaches to APIs. One is based on RPC (Remote Procedure Calls) and exposes logical functions from within the platform like creating datasets, adding resources etc. via the web in a secure way. Another approach is a RESTful API which exposes resources and uses HTTP verbs like GET, POST, PUT, DELETE, PATCH etc.). The API approach is very likely maintained as part of the platforms codebase and an integral part of it.

Extensions

Open Data Managment platforms need to be able to support various use cases which cannot all be tackled by the core codebase. So it's very likely that the core application is a relatively small web application but that it includes hooks in various places to allow developers and site maintainers to develop extensions and seamlessly integrate them with application. For example, the previously mentioned structured datastore might be implemented as an extension that creates ad hoc database for structured data on top of the platform. Then it hooks into dataset creation to seamlessly behind the scenes push data to the structured data store just as if it were a part of the dataset creation process.

For Open Data Management Platforms that allow for extensions, they're likely to have extensions that are both maintained outside of the core project as well as by the core project.

File:CKAN architecture overview.png

Main Interactions

Main interactions with go through the application layer, either via the web interface or the API. The web interface is designed for for humans while the API is designed for systems. The interactions revolve around dataset management (creating, reading, updating and deleting) and accessibility (searching and browsing). The exact details of how this is implemented will depend on the open data management platform. For creating and updating datasets certain metadata is required (which is made accessible when reading. Because the implementation varies the metadata might vary but it is very likely to require at least the metadata described below.

Dataset Management

Creating/updating datasets

The most common parameters required for dataset creation/editing are:

  • name (string): The name of the new dataset, must be between 2 and 100 characters long and contain only lowercase alphanumeric characters, - and _, e.g. 'warandpeace'.
  • title (string): The title of the dataset (optional, default: same as name)
  • author (string): The name of the dataset’s author
  • author_email (string): The email address of the dataset’s author
  • license_id (license id string): The id of the dataset’s license (required for open data)
  • notes (string): A description of the dataset
  • url (string): A URL for the dataset’s source
  • resources (list): The dataset’s resources. These require parameters of their own (see parameters for resources below)
  • tags (list): The dataset’s tags to make it more easily findable (usually optional)
  • groups (list): The groups to which the dataset belongs (usually optional).
  • owner_org (string): The id of the dataset’s owning organization (usually optional)

As mentioned, each resource, will have its own set of metadata:

  • dataset_id (string): Id of dataset that the resource should be added to.
  • url (string): Url of the resource
  • description (string): A description of the resource

Deleting datasets

Deleting a dataset requires 'edit' permissions on the dataset. This can be implemented in different ways; Most often the user who created the dataset can also delete it, but it's also possible that the user deleting the dataset can/must be an organization editor/administrator or a system administrator, or the dataset must be a dataset not in an organization.

It is very likely that the “deleted” datasets are not completely deleted. These would just be hidden, so they do not show up in any searches, etc. However, by visiting the URL for the dataset’s page, it can still be seen (by users with appropriate authorization), and “undeleted” if necessary.

Accessibility

Reading datasets

Datasets can be public or private (e.g. within an organisation a dataset might be private until a responsible person within the organisation approves it and thereby makes it public). Therefore sometimes when reading datasets, user authentication is necessary (to access the private datasets). Irrespective of whether a public or a private dataset is accessed, all of the metadata listed above will be accessible.

Searching for datasets

Finding datasets in data management platforms is one of it's key features and thus search is a core service a data management platform provides. Searching for any combination of search words (e.g. “health”, “transport”, etc) in a search box should present relevant results from which it is possible to:

  • View more pages of results
  • Repeat the search, altering some terms
  • Restrict the search to datasets with particular tags, data formats, etc using the filters (also known as facets)

If there are a large number of results, the filters can be very helpful. It should be possible to combine filters, selectively adding and removing them, and modify and repeat the search with existing filters still in place.

Listing Datasets

Searching is very likely to be the main way for users to find datasets, but sometimes users just want to browse through so a listing of datasets needs to be accessible where users can browse datasets, order them in a different way or apply facets, just like in search, to restrict the browsing. The most common facets would be organisation and groups (cross-organisational categories of data).

Basic Design Principles

Basic design principles will vary based on the open data management platform, but for a web based platform that can be installed by anyone and needs to be able to work for different use cases these are main principles one is likely to encounter:

Small core application

The main application should only foucs on allowing users to create, read, update and delete datasets, along with browsing and searching for them.

Extensions provide more functionality

It should be possible to extend the main application and provide various support based on the data needs of each user/maintainer.

De-centralized deployment

Each user/maintainer is different and needs different ways of accessing the data. Instead of having one instance for everyone, users with similar needs and data requirements (most often the same government region) set up their own. This means there will be many different instances but it's possible to have them all talk to each other and exchange data.

Low-level API makes web site interactions programmable

Even though many users will at first come via the web interface to find the data, they are very likely to build software around that data. For that reason they need a powerful API in order to have the software access the data, update itself and more.

Detailed Specifications

Official CKAN documentation

Re-utilised Technologies/Specifications

Reusable technology components for the data layer would be ready-made software for the different solutions:

  • PostgreSQL for the database as well as the structured data storage
  • Solr for the search engine (on top of Lucene)

Reusable technology components for the application layer are harder to replace with reusable technology except for http server solutions would be ready-made software for the different solutions:

Terms and definitions

This section comprises a summary of terms and definitions introduced during the previous sections. It intends to establish a vocabulary that will be help to carry out discussions internally and with third parties (e.g., Use Case projects in the EU FP7 Future Internet PPP). For a summary of terms and definitions managed at overall FIWARE level, please refer to FIWARE Global Terms and Definitions

  • Data refers to information that is produced, generated, collected or observed that may be relevant for processing, carrying out further analysis and knowledge extraction. Data in FIWARE has associated a data type and avalue. FIWARE will support a set of built-in basic data types similar to those existing in most programming languages. Values linked to basic data types supported in FIWARE are referred as basic data values. As an example, basic data values like ‘2’, ‘7’ or ‘365’ belong to the integer basic data type.
  • A data element refers to data whose value is defined as consisting of a sequence of one or more <name, type, value> triplets referred as data element attributes, where the type and value of each attribute is either mapped to a basic data type and a basic data value or mapped to the data type and value of another data element.
  • Context in FIWARE is represented through context elements. A context element extends the concept of data element by associating an EntityId and EntityType to it, uniquely identifying the entity (which in turn may map to a group of entities) in the FIWARE system to which the context element information refers. In addition, there may be some attributes as well as meta-data associated to attributes that we may define as mandatory for context elements as compared to data elements. Context elements are typically created containing the value of attributes characterizing a given entity at a given moment. As an example, a context element may contain values of some of the attributes “last measured temperature”, “square meters” and “wall color” associated to a room in a building. Note that there might be many different context elements referring to the same entity in a system, each containing the value of a different set of attributes. This allows that different applications handle different context elements for the same entity, each containing only those attributes of that entity relevant to the corresponding application. It will also allow representing updates on set of attributes linked to a given entity: each of these updates can actually take the form of a context element and contain only the value of those attributes that have changed.
  • An event is an occurrence within a particular system or domain; it is something that has happened, or is contemplated as having happened in that domain. Events typically lead to creation of some data or context element describing or representing the events, thus allowing them to processed. As an example, a sensor device may be measuring the temperature and pressure of a given boiler, sending a context element every five minutes associated to that entity (the boiler) that includes the value of these to attributes (temperature and pressure). The creation and sending of the context element is an event, i.e., what has occurred. Since the data/context elements that are generated linked to an event are the way events get visible in a computing system, it is common to refer to these data/context elements simply as "events".
  • A data event refers to an event leading to creation of a data element.
  • A context event refers to an event leading to creation of a context element.
  • An event object is used to mean a programming entity that represents an event in a computing system [EPIA] like event-aware GEs. Event objects allow to perform operations on event, also known as event processing. Event objects are defined as a data element (or a context element) representing an event to which a number of standard event object properties (similar to a header) are associated internally. These standard event object properties support certain event processing functions.
Personal tools
Create a book