We use proprietary and third party's cookies to improve your experience and our services, identifying your Internet Browsing preferences on our website; develop analytic activities and display advertising based on your preferences. If you keep browsing, you accept its use. You can get more information on our Cookie Policy
Cookies Policy
BigData Analysis Open RESTful API Specification - FIWARE Forge Wiki

BigData Analysis Open RESTful API Specification

From FIWARE Forge Wiki

Jump to: navigation, search

Contents

Introduction to the Big Data API

This document is intended to describe the Open RESTful API Specificacion of the Big Data GE. Nevertheless, please notice not all the APIs planned for the GE are REST.

The Big Data GE is mainly based on the Hadoop ecosystem, where several tools, APIs and other interfaces have largely been specified and used along the time by analysts and data scientists. These so-called Hadoop Open API Specification is not always supported by a REST style, and thus, by extension, not all the Big Data GE APIs are REST.

Other interfaces, especially those related to proprietary modules added to the Hadoop ecosystem, are, however, full RESTful APIs.

Big Data API Core

The Big Data GE API is a set of sub-APIs, each one addressing one of the different aspects of a Big Data platform:

  • Admin API:
    RESTful, resource-oriented API accessed via HTTP that uses JSON-based representations for information interchange. This API is used for administration purposes, allowing the management of users, storage or computing clusters.
  • Hadoop commands:
    Command-line style interface used both to administrate HDFS and launching MapReduce jobs.
  • Hadoop API:
    Java-based programming API used both to administrate HDFS and launching MapReduce jobs.
  • HDFS RESTful APIs:
    RESTful, resource-oriented API accessed via HTTP that uses JSON-based representations for information interchange. This API is used for HDFS administration. Includes: WebHDFS, HttpFS and Infinity Protocol.
  • Querying systems:
    SQL-like languages such as HiveQL executed locally in a CLI interpreter, or remotelly through a server.
  • Data injectors:
    A mix of technologies used to inject data in the storage mechanisms of the GE.
  • Oozie API:
    Java-based programming API used to orchestrate the execution of MapReduce jobs, Hive tasks, scripts and user-defined applications.
  • Oozie Webservices API:
    RESTful-like, resource-oriented API accessed via HTTP that uses JSON-based representations for information interchange. This API is used for Oozie-based job and task orchestration.

Intended Audience

This specification is intended for both software developers and reimplementers of this API. For the former, this document provides a full specification of how to interact with the GE in order to perform Big Data operations. For the latter, this specification provides a full specification of how to write their own implementations of the APIs.

In order to use this specifications, the reader should firstly have a general understanding of the appropriate Big Data GE architecture supporting the API.

API Change History

This version of the Big Data API replaces and obsoletes all previous versions. The most recent changes are described in the table below:

Revision Date Changes Summary
Apr 11, 2014
  • First version

How to Read This Document

All FI-WARE RESTful API specifications will follow the same list of conventions and will support certain common aspects. Please check Common aspects in FI-WARE Open Restful API Specifications.

For a description of some terms used along this document, see the Basic Concepts section in the Big Data GE architecture.

Additional Resources

N.A.

General Big Data GE API Information

Resources Summary

The following applies to the Admin API of the Big Data GE. For other legacy RESTful APIs please refer to the appropiate link. For non-RESTful APIs, nothing has to be provided.

Representation Format

The Big Data GE API, in its RESTful parts, supports JSON-based representations for information interchange.

Resource Identification

  • Task identifiers follow the TaskID specification in the Hadoop API.
  • Cluster identifiers are strings without a fixed format.
  • Realm identifiers are strings without a fixed format.
  • User identifiers are string without a fixed format.

Links and References

N.A.

Limits

N.A.

Versions

There is a unique version of this API, thus nothing has to be said about deprecated operations.

Extensions

N.A.

Faults

Synchronous Faults

N.A.

Asynchronous Faults

N.A.

API Operations

Admin API

This administration API, which is mostly wrapped by the CLI, is aimed to manage the platform in terms of user and cluster creation/modification/deletion and task and storage management. It is served by the Master node of the platform, thus it is common to all the users.

/info

Verb URI Description Request payload Response payload
GET /cosmos/v1/info Get general information about the user whose credentials are used -
 {
   "resources": {
     "availableForUser": "int",
     "available": "int",
     "groupConsumption": "int",
     "availableForGroup": "int",
     "individualConsumption": "int"
   },
   "individualQuota": "Object",
   "handle": "string",
   "clusters": {
     "accessible": [
       {
         "id": "string"
       }
     ],
     "owned": [
       {
         "id": "string"
       }
     ]
   },
   "profileId": "long",
   "group": {
     "guaranteedQuota": "Object",
     "name": "string"
   }
 }

/profile

Verb URI Description Request payload Response payload
GET /cosmos/v1/profile Get the user profile details -
 {
   "handle": "string",
   "email": "string",
   "keys": [
     {
       "name": "string",
       "signature": "string"
     }
   ]
 }
PUT /cosmos/v1/profile Update the user profile details
 {
   "handle": "string",
   "email": "string",
   "keys": [
     {
       "name": "string",
       "signature": "string"
     }
   ]
 }
 {
   "handle": "string",
   "email": "string",
   "keys": [
     {
       "name": "string",
       "signature": "string"
     }
   ]
 }

/storage

Verb URI Description Request payload Response payload
GET /cosmos/v1/storage Persistent storage connection details -
 {
   "location": "URI",
   "user": "string"
 }

/task

Verb URI Description Request payload Response payload
GET /cosmos/v1/task/{id} Get task details -
 {
   "id": "int",
   "status": "TaskStatus",
   "resource": "string"
 }

/cluster

Verb URI Description Request payload Response payload
GET /cosmos/v1/cluster List clusters -
 {
   "clusters": [
     {
       "clusterReference": {
         "assignment": {
           "creationDate": "Date",
           "ownerId": "long",
           "clusterId": {
             "id": "string"
           }
         },
         "description": "ClusterDescription"
       },
       "href": "string"
     }
   ]
 }
POST /cosmos/v1/cluster Create a new cluster
 {
   "optionalServices": [
     "string"
   ],
   "name": {
     "underlying": "string"
   },
   "size": "int"
 }
-
POST /cosmos/v1/cluster/{id}/add_user Add users to existing cluster
 {
   "user": "string"
 }
-
GET /cosmos/v1/cluster/{id} Get cluster machines -
 {
   "id": "string",
   "users": [
     {
       "publicKey": "string",
       "enabled": "boolean",
       "sudoer": "boolean",
       "username": "string",
       "hdfsEnabled": "boolean",
       "sshEnabled": "boolean"
     }
   ],
   "services": [
     "string"
   ],
   "stateDescription": "string",
   "name": {
     "underlying": "string"
   },
   "slaves": [
     {
       "hostname": "string",
       "ipAddress": "string"
     }
   ],
   "state": "string",
   "href": "string",
   "master": {
     "hostname": "string",
     "ipAddress": "string"
   },
   "size": "int"
 }
POST /cosmos/v1/cluster/{id}/remove_user Remove a user from an existing cluster
 {
   "user": "string"
 }
-
POST /cosmos/v1/cluster/{id}/terminate Terminate an existing cluster
 {
   "message": "string"
 }
-

/services

Verb URI Description Request payload Response payload
GET /cosmos/v1/services List services that can be deployed in the private cluster - The response is a bare array of strings (e.g. ["service1", "service2"])

/maintenance

Verb URI Description Request payload Response payload
GET /cosmos/v1/maintenance Check if some maintenance is being done on the platform - boolean
PUT /cosmos/v1/maintenance Change the maintenance status boolean boolean

/user

Verb URI Description Request payload Response payload
POST /admin/v1/user Create a new user account
 {
   "sshPublicKey": "string",
   "authRealm": "string",
   "handle": "string",
   "email": "string",
   "authId": "string"
 }
 {
   "handle": "string",
   "apiSecret": "string",
   "apiKey": "string"
 }
DELETE /admin/v1/user/{realm}/{id} Delete an existent user account -
 {
   "message": "string"
 }

Hadoop commands

Hadoop can be completely governed by commanding it through a Shell. The complete list of Hadoop commands can be found in the official documentation.

It is worth mentioning the file system subcommand (hadoop fs). This command is used for HDFS administration (file or directory creation, deletion, modification, etc), and offers a complete suite of options that can be found in the official documentation as well.

Hadoop API

The Hadoop API isa Java-based programming API addressing several Hadoop topics, byt it is mainly used for:

  • Programming custom HDFS clients.
  • Programming custom MapReduce applications.

You can explore this API in the official Javadoc.

HDFS RESTful APIs

These API specifications are given by the Hadoop ecosystem, since WebHDFS/HttpFS have been already defined by Apache and the Infinity Protocol is an augmented version of WebHDFS.

WebHDFS

While the Hadoop commands are useful for users with authorized access to the machines within the cluster, or applications running within it, sometimes it is necessary to access the HDFS resources by an external entity. To achieve this, Hadoop provides a HTTP Rest API supporting a complete interface for HDFS. This interface allows for creating, renaming, reading, etc. files and folders, in addition to many other common operations on file systems (permissions, owning, etc.). Is is available in the 50070/TCP port of the NameNode.

The whole details for this API can be found in the official documentation [1].

Please observe some actions, such as CREATE or APPEND, are two-step operations, sending a first request to the NameNode of the cluster, and receiving a response containing a redirection URL to the final DataNode where the data must be stored.

Please notice as well there is no operation for uploading and running jobs.

HttpFS

HttpFS is another implementation of the WebHDFS API (14000/TCP port of the server, which usually runs on the NameNode). While the WebHDFS implementation implies a total knowledge and accesibility to the whole cluster, HttpFS hides these details by behaving as a gateway between the Http client and WebHDFS. Two-step operations contiune working in the same way but now the redirection URL points to the same HttpFS server; the first and second operations are distinguished by adding a new parameter, 'data=true', in the redirection.

HttpFS is not natively available for CDH3 (it is distributed from CDH4), but there is a backporting for CDH3 at GitHub [2].

Infinity Protocol

The Infinity Server is an Ambari service that runs on all nodes in the Infinity cluster. Deployment of this service installs the corresponding executables and blocks all traffic (except traffic from any Infinity server) to the following ports:

  • Namenode metadata service: 8020 and 9000
  • Datanode data transfer: 50010
  • Datanode metadata operations: 50020
  • WebHDFS ports: 50070 (namenode), 50075 (datanode)

The Infinity Server listens on a port of your choice and provides an augmented WebHDFS REST API that uses HTTPS to avoid data sniffing and an authentication mechanism. The authentication mechanism can be one of these two:

  • A query parameter called secret to authenticate the user through its cluster secret. The cluster secret is a long ASCII string which authenticates the user in the context of a given cluster:
 http://infinity:123/webhdfs/v1/user?user.name=turing&op=LISTSTATUS&secret=bac69680bafb11e19fd7c2b027b06d18
  • The API key and API secret of the user through an HTTP basic auth header:
 Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==

Querying systems

These API specifications are given by the Hadoop ecosystem, since Hive adn Pig have been already defined by Apache

Hive

The clusters created with the Big Data platform provide the Hive querying system. As stated in the Hive home page [3], the Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

The Apache Hive wiki provides full information about how to use this tool. Among other interesting topics you could find:

Pig

Pig is another querying tool similar to Hive. From its home page [4] we found that Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

The getting started guide will lead you on how to run Pig, how to write Pig scripts, etc.

Data injectors

Sqoop

From its user guide [5], Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

Full details can be found, as said, in the Sqoop user guide.

Cygnus

Cygnus is the connector allowing for Publish Subscribe Context Broker GE context data persistence in the Big Data GE. This is a component that is designed to receive notifications about certain data context for which Cygnus has previously subscribed to. Thus, the interface for this connector is just a simple Http server listening for REST-based notifications coming from the Publish Subscribe Context Broker GE.

Orchestration

Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts) [6].

Writting Oozie applications is as simple as includind in a package the MapReduce, Hive/Pig scritps, etc. going to be run and definifing a Workflow. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Full documentation about Oozie can be found here, including:

Personal tools
Create a book