METADATA CONNECT TECHNICAL OVERVIEW
Most database applications are built to store, manage and search large amounts of structures (usually tabular) information. Objectivity Metadata Connect is different in that it focuses on the connections between items and leverages advanced distributed database and processing technology. This White Paper describes the goals for Metadata Connect and outlines the solutions deployed within the Metadata Connect cloud-based application.
Goals and Constraints
The primary goals of Metadata Connect are to:
- Be very easy to use, particularly for users whose main task is to query the data.
- Perform consistently as the total volume and arrival rates of data scale.
- Support simultaneous updates arriving in fast flowing streams, batch updates and online queries.
- Provide a REST API to allow updates by external tools.
- Support advanced connection navigation and pathfinding queries.
- Provide flexible deployment options, e.g. elasticity as demands fluctuate.
- Provide a safe place to store both metadata and data.
- Be easily expansible, even by relatively novice users.
- Leverage standard technologies, particularly platform choices and analytic capabilities.
The primary constraint is the performance of the underlying database and infrastructure. Metadata Connect is deployable in-house, but most users will access one of the cloud-based versions, with the first one being the Microsoft Azure Market Place. The cloud infrastructure is distributed, so it is important to use a distributed database that meshes seamlessly with the underlying software and hardware.
Handling Complex Connections
Any database can store data. The most common kinds are relational (RDBMS) and NoSQL databases, most of the latter distinguishing themselves by not being constrained by a rigid table, row and column model. Connections and information type variety are major challenges for RDBMSs. Handling one to many and many to many connections requires complex “Join” operations, optimally supported by Join tables. This is inefficient because of the number of operations involved.
Consider the simple object model below, with a Customer information type connected to a Product information type with a many to many connection type, i.e. a Customer can be connected to many Products and vice versa.
The connection can be handled by adding a “Product_Code” column to the Customer table and a “Customer_Code” column to the Product table. However, this requires the dynamic instantiation of working structures, known as Join Tables, so it is normal to add an actual Join Table to handle the connections. It need only have two columns for the Customer_Code and Product_Code. All of the tables will also have B-Tree indices, which have to be maintained and looked up.
If we want to find the names of all of the Customers that buy a particular Product we must:
- Lookup the B-Tree Index entries in the Join Table that contain the Customer name.
- Lookup the actual rows in the Join_Table that correspond to those index entries.
- For each match, obtain the corresponding Customer name and lookup the B-Tree index to the Customer table.
- Read the Customer row from the Customer table.
If there are N customers that buy the product we will have performed 2*N B-Tree index lookups, N Join table row reads and N Customer table row reads, i.e. a total of 2*N B-Tree lookups and 2*N table row reads.
If the time to lookup a B-Tree Index entry is Tb and the time to read a row is Tr then the total time taken is 2*N * (Tb + Tr).
Metadata Connect is based on a general purpose object/graph database that directly represents connections. Each item in the database has a unique identifier. Each item also has a list of all the identifiers of all of the items that it is connected to and the connection type name. For simplicity’s sake we only show a single item and its connections in the diagram below. We can use B-Tree index or other, faster, mechanisms to lookup a Customer or Product item by name.
We can see that Customer A, which has identifier #1, is linked to items #10, #11 and #12. Likewise, each of those items is linked back to item #1, i.e. Customer A. If we want to perform the same query as before the steps are:
Lookup the Product B-Tree Index entries for the Product Name. It will give us the item identifier for that product.
Read the item with that identifier. This also obtains the list of the identifiers of connected items.
Read each of the items using their identifiers and obtain the Customer names.
Again, if there are N customers for a given product we will perform 1 B-Tree index lookup, 1 Product item lookup and N Customer item lookups. Note that there is no need to lookup anything in the Customer B-Tree index as we have the exact identifiers of the connected Customer items. So, the total number of operations is 1 B-Tree index lookup and (1 + N) item reads. Using the same terminology as above, the total time taken for Metadata Connect to find details for the connected items is Tb + (1 + N) * Tr.
So, the relational technology takes 2 * N * (Tb + Tr) whereas Metadata Connect takes Tb + (1 + N) * Tr, i.e. Metadata Connect takes (2*N-1) * Tb + (N-1) * Tr less time. For simplicity’s sake, if Tb and Tr are both 1 second (about 300x more than real life) and there are 10,000 customers for the product then the RDBMS would take 40,000 seconds versus 10,002 seconds for Metadata Connect.
This effect is magnified exponentially when navigating complex chains of connections, which is what Metadata Connect is all about.There have been benchmarks that require traversing billions of connections per query to find the shortest path between items, totally defeating conventional technologies.
Handling Multiple Variants Of An Information Type
Similar arguments can be used to prove that Metadata Connect is more efficient and faster than conventional databases when information types with multiple variants are involved. As an example, suppose that the items in the Digital Chain of Custody were individual lifeform species. There might be a basic “Living Thing” information type with associated details, such as Latin_Name_Classification. We could derive “Mammals” from “Living Thing”, “Primates” from “Mammals” and “Humans” from “Primates”, along with their additional detail types.
There are multiple ways to handle storing these kinds of information in relational databases, none of them particularly efficient in terms of storage handling or query performance. This is mainly because Join Tables are involved in most of them.
The database within Metadata Connect handles queries by storing only the information required for an information type and by looking for all of the types that might be involved in a query. For example, a query looking for “Humans” will ignore “Living Things”, “Primates” and “Mammals”, whereas a query for “Living Things’ will look for that and all of its derived types. When you combine this capability with the ultra-fast navigation and pathfinding query performance, it is clear that Metadata Connect is much better suited to handling large scale Digital Chain of Custody problems than conventional technologies.
Scaling To Handle Larger Volumes Of Data and More Users
The Issues And Solutions
Handling larger volumes of data requires at least two things:
A large enough address space to be able to label and find things without resorting to searching multiple potential sources, i.e. a “Single Logical View” of the information.
A storage infrastructure that can rapidly expand and contract dynamically according to needs.
Metadata Connect achieves 1) by using a 64-bit identifier for each individual information item. The number of items that it can handle is larger than any realistic Digital Chain of Custody would ever require. It achieves 2) by leveraging cloud storage infrastructure, which can be allocated and released on demand, or have portions backed off to lower cost storage options (which may incur an access time penalty).
Metadata Connect is based on distributed processing infrastructure, namely the underlying ThingSpan distributed analytics and database platform and the Apache Spark distributed processing platform. Many operations, including data loading and query handling, can be performed in parallel, generally invisibly to the user. The amount of Microsoft Azure cloud processing power and networking bandwidth can be quickly varied in response to rising, peak/average or emergency demands.
The choice of Apache Spark as the distributed processing platform was largely driven by scalability, elasticity and analytics requirements. Besides the built-in Metadata Connect queries it will also be possible to deploy open source Spark components, such as the Machine Learning Library.
Interfacing External Tools With Metadata Connect
Many small Metadata Connect deployments will be handled interactively, or with occasional bulk data imports. However, if an organization required greater control of digital assets it is possible to use the Metadata Connect REST API to link tools and other applications directly to Metadata Connect, e.g. to record the creation or updating of a new document using Microsoft Office. These tools have to be built by qualified software developers at the moment, but we anticipate a wide variety of adaptors eventually becoming available via the Azure Marketplace.
Metadata Connect is currently available on the Microsoft Azure Cloud Marketplace running on Windows. Users can choose from several licensing configurations or contact us for specific options. The Metadata Connect application is in a container along with ThingSpan and its servers, an HTTP server and an interface to Active Directory security mechanisms.
Objectivity ThingSpan also runs on Linux, MacOS and UNIX, so future Metadata Connect releases may support those platforms across Amazon AWS, Azure, Google Cloud and IBM Bluemix, depending upon demand. Users who need to run Metadata Connect on their own infrastructure should contact us for more information.
We have shown how Objectivity Metadata Connect leverages mature distributed database, analytics and processing platforms to achieve its goals. Its performance and scalability are unmatched in its domain. The user interface is simple and efficient and it is easy to extend the kinds of information that can be handled or to interface it to external tools.
Metadata Connect is based on Objectivity ThingSpan running on the highly scalable Microsoft Azure Cloud platform. It has the scalability to handle Digital Chain of Custody problems of any known magnitude efficiently and with ease. Metadata Connect has the performance and throughput to handle batch updates, interactive tasks and fast flowing data feeds concurrently with complex queries.