ThingSpan Technical Features

ThingSpan logo

ThingSpan Components

There are four ThingSpan components:

  • ThingSpan Metadata Store
  • ThingSpan for HDFS
  • ThingSpan for Spark
  • ThingSpan REST

The diagram shows how the components interact with each other and with the Spark framework and storage.

ThingSpan 

 

The ThingSpan platform is a graph of objects and connections that will generally index external data stored in HDFS files or other repositories. The schema that it uses can be defined:

  • Via Spark DataFrames
  • Using the ThingSpan REST server and API

The objects can be distributed amongst ThingSpan using a Placement Management Directives file (XML format). The databases are in a federated database (federation) that provides a Single Logical View of all of their contents. The objects can be created, read, updated, deleted, linked, unlinked, indexed and queried via any of the above interfaces, though the preferred ones are DataFrames and the REST API.

ThingSpan is hosted on HDFS using ThingSpan for HDFS. It is a repository, not a software component in its own right, so creating and accessing it is described in the following three sections of this document.

ThingSpan for HDFS

COMPONENTS

The ThingSpan for HDFS component allows for the usage of files stored in HDFS. This increases data availability and means that ThingSpan can be installed in a standard Hadoop/Spark environment. It is also possible to store on a standard POSIX filesystem, e.g. on a SAN attached to the Hadoop/Spark equipment.

The diagram below shows the high level component of ThingSpan HDFS. The user specifies HDFS in the ThingSpan bootfile and when placing database or container files. The components are loaded into the ThingSpan kernel via the plugin module.

There are also two unit testing interfaces that are used to verify functionality and for performance benchmarking. Users will only use those interfaces if Objectivity Customer Support or Systems Engineering need to investigate an issue.

File To Block Mapping

FILE MAPPINGS

 

Background

HDFS supports append-only writes and random-access reads. ThingSpan needs random-access writes as well. At the lowest level, when the ThingSpan Storage Manager writes an updated page to a disk controlled by a POSIX filesystem it is either appended to a container or database file or written to an unused physical page in the appropriate file.

 

Using HDFS

In HDFS each ThingSpan or Objectivity/DB file is divided into Segments. Each Segment is written into an HDFS file. Each HDFS file consists of multiple fixed size Blocks.


When a Segment is modified we read it from disk, modify it and write it back. Several caching techniques are used to improve performance. The actual writing of Segments to disk is performed when:

  • A file is closed
  • File synchronization (fsync) is requested
  • The system is running out of cache space.

File Placement

The location of a ThingSpan file is specified using one of the following formats:

  • hdfs://<host>/<path>      //  Standard HDFS syntax, or
  • hdfs://<host>::<path>     //  Standard Objectivity/DB use of ::

For example, to create a ThingSpan federation on an HDFS System:

  • objy createFD -fdDirHost hdfs://172.16.132.147:9000 -fdname hello_hdfs -fdDirPath /data -jnlDirHost 127.0.0.1 -jnlDirPath /tmp -lockServerHost 127.0.0.1

CONFIGURATION

ThingSpan for Spark

PURPOSE

ThingSpan for Spark enables users that have either built databases using the ThingSpan Metadata Store or Objectivity/DB to access their federations from Spark components, such as:

  • Spark MlLib - machine learning and analytics
  • Spark GraphX - graph queries
  • Spark SQL - for datamining

 

DEFINITIONS

RDD - A Resilient Distributed Dataset, the main abstraction for a dataset in Spark. The RDD presents to the Spark user as a single collection of objects, however it may be transparently distributed across a large computing cluster in memory. Complex filtering, transformations and grouping operations can be performed on the RDD and may be parallelized by Spark where possible. Spark also keeps track of an RDD’s lineage so that in the event of a node failure in the cluster, individual partitions can be recomputed to allow processing to complete.   

DataFrame - A construct that is conceptually equivalent to a table in a relational database. It describes its contents using a schema and materializes data as a Row of information comprising of a value for each of the tables columns. The DataFrame is a core concept of SparkSQL, it is an abstraction that allows a database backend to interact with the wider Spark environment without exposing details of its own data model or implementation. Several DataFrames (potentially backed by different data sources) may be combined in a SQL statement to produce virtual DataFrames which may then be converted and used as an RDD in typical Spark workflows. RDD’s may also be transformed to DataFrames and subsequently bound and saved to a physical data store.

 

ThingSpan for SPARK FEATURES

ThingSpan for Spark provides:

  • Wrappered Objectivity/DB servers that can be run as Spark batch tasks.
  • Tools for extracting the schema from an existing federation (ThingSpan or Objectivity/DB) and creating DataFrames that can be accessed by Spark components.

 

Advantages
ThingSpan for Spark makes it possible to run advanced analytics and datamining tasks in parallel on the ThingSpan Metadata Store or an Objectivity/DB federation. This combines the power of the many libraries provided by the open source Spark community with the scalability, flexibility and navigational performance of ThingSpan and Objectivity/DB.

Spark, running with YARN, can monitor the health of application tasks and Objectivity/DB processes, such as lock, data and query servers. If a process fails it can be automatically restarted. If the data is stored via the ThingSpan HDFS layer then both the service availability and data availability of a system will be significantly improved.

ThingSpan for SPARK Conceptual Model

The Spark Adapter is delivered as one or more jar files that can be loaded into an existing Spark environment. There are no specific Objectivity/DB or ThingSpan API’s exposed directly to the user. Users interact with the feature via the Spark DataFrame API and use a small set of configuration options on this API to configure its use for ObjectivityDB.

At a high level, the following diagram depicts the basic functional components of the Spark Adapter and the software layers involved. Note that the user will only interact with Spark and SparkSQL API’s and configure the adapter via the DataFrame options :

Spark Adaptor Components

The main goals of the adapter are to :

  • Enable a Data Scientist (Spark user) to interact with Objectivity without requiring an in-depth knowledge of the database.
  • Allow DataFrames to be created from Objectivity classes and queried in SQL via the Spark SQL API
  • Discover partitioning of the underlying Objectivity dataset to allow parallel read/write operations
  • Allow resulting datasets (as RDD or Dataframes) to be written back to Objectivity/DB

Spark provides a set of API’s for constructing DataFrames and binding them to a datasource. Details of the underlying database are hidden from the Spark user and perform the following functions :

  • Query Translation - When the user specifies a SQL statement that involves a DataFrame representing the database, Spark provides a set of filters and column requirements to the adapter. The adapter translates these into a native query, in our case as ThingSpan Predicate Query Language statements.
  • Schema Mapping - The Spark user views schema in the form of a StructType (see also DataTypes). A mapping is provided between this representation and the underlying schema system in Objectivity
  • Connection Management - Since Spark is a distributed data framework reading and writing occurs from multiple nodes. Objectivity connections and sessions are provided to the Spark framework as required by each of the workers.
  • Partitoning - Translates physical data partitions (such as a Container in Objectivity) into logical partitions Spark can use to parallelize workloads on the dataset. Location “hints” are also provided to assist Spark in assigning work to nodes containing the physical data for that partition.
  • Options - Any functionality specific to ThingSpan that requires information from the user can be conveyed via standard options on the SparkSQL Reader and Writer interfaces.

 

ThingSpan for SPARK Setup and Configuration

Spark makes integration of third party software relatively simple. The Spark Adapter is packaged as a jar file which in turn has a dependency on the Objectivity/Java jar and an Objectivity runtime installed on each machine in the Spark cluster. The ThingSpan installer performs the following tasks :

  • Installing and configuring an Objectivity/Java runtime
  • Optionally installing service for the master node (lock server)
  • Requesting the user’s Spark installation path
  • Installing dependency jars in the {spark_home}/lib directory
  • Uninstalling a specific version of ThingSpan.
  • Federated Database Deployment

Spark does not directly place constraints on the deployment  of the Federated Database. The Spark Adapter requires only a standard bootfile and uses this to make connections in the same way a standard client would (i.e. the lock server and data nodes for the federation must be contactable via the network).

There are, however, significant performance benefits to co-locating the federation within the Spark cluster to allow the worker nodes local access to databases within the federation. The Spark Adapter is capable of recognizing partitions of the target data and providing hints to Spark in a way that allows the assignment of worker nodes to local partitions. For example, if a DataFrame is constructed to read all instances of a given type in Objectivity, Spark can do this in parallel across the cluster. Furthermore it will prefer to assign tasks in a way that ensures each worker node reads data on its local storage.

ThingSpan REST

ThingSpan REST is a new language binding that makes it possible to dynamically manipulate ThingSpan Metadata Store object definitions and data via a REST interface. ThingSpan REST has two components:

  • The REST Server - it accesses a ThingSpan federation.
  • The REST API - which can be called from web clients.

 

The ThingSpan REST API

The REST API provides these resources:

  • REST
  • Schema
  • Index
  • Object
  • Placement
  • Query
  • Transaction
  • Tool Runner

 

The REST Resource

The REST resource provides generic information about the current REST implementation. This information will typically be the first resource accessed as it contains the Server and API versions, the supported resources and other such information. This resource is unique in that it does not require a transaction, unlike most other operations.

 

The Schema Resource

This resource is responsible for CRUD schema operations. This includes adding types, adding and removing members, looking up types and their members as well as deleting types and/or members.

 

The Index Resource

The index resource is responsible for CRUD index operations.

 

The Object Resource

This resource is responsible for CRUD object related operations. This includes creating objects, deleting objects, looking up objects and deleting objects.

 

The Placement Resource

This resource is responsible for RU operations related to placement. There is no deletion operation on placement and no creation since a default placement exists. Every update creates a new version of the placement document.

 

The Query Resource

This resource is responsible for querying operation. The results of these queries could be a list of objects that match a given criteria or an empty list (if there are no results). To protect the user there will be two versions of the query resource, one that is meant for read-only operations (that will error out if a write operation is requested) and one that is capable of either read-only, write-only, or read-write operations.

 

The Transaction Resource

The transaction resource allows the several Objectivity/DB REST operations to be ran together in a single transaction.

 

The Tool Runner Resource

This resource allows the user to execute ThingSpan tool runner commands, e.g. to create a new Metadata Store.