DATABASE DESIGN MANTRAS vs PHYSICS – Part 2

DATABASE DESIGN MANTRAS vs PHYSICS – Part 2

The Role of Data Clustering in High Performance, Scalable Database Systems   Background In the first article in this series we looked at the need to reduce the number of I/Os required to perform database operations and the role of smart caching, including cross-transaction caching. Reducing I/Os improves response times and increases system throughout. In this article we’ll look at the role of data clustering, which aims to reduce I/Os by physically grouping together data that is frequently accessed together.   Fine Grain Clustering Caching can dramatically reduce the number of I/Os needed to support some operations. The next line of attack is clustering objects that are generally used together, such as Word objects with the Sentence object that they are a logical part of. You can only cluster things once, unless some of the objects are replicated, but in the example that we used of a document object the number of I/Os will be not much more than the size of the document divided by the I/O size. In ThingSpan the default page size is 16 KB, so reading a 40KB document would require 3 I/Os. If the same document were stored in a classically normalized relational database there would be 1 I/O for the Document table, 1 for each Chapter, 1 for each Paragraph and ( #Sentences / IO_Size) for the sentences etc. It would clearly be more than 3.   Objectivity’s ThingSpan allows the application or database designer to cluster objects in two ways, explicitly or using placement directives. The two mechanisms should generally not be mixed, because the placement directives can narrow the scope of...
DATABASE DESIGN MANTRAS vs PHYSICS – Part 1

DATABASE DESIGN MANTRAS vs PHYSICS – Part 1

The Role of Smart Caching in High Performance, Scalable Database Systems   Background One of the mantras of most mainstream Database Management Systems (DBMSs) is that the physical organization of the data should be hidden under the hood. The argument is that the user should not have to know anything about the underlying mechanics in order to store and retrieve data. This is a worthy goal, but, in reality, this task is often delegated to a trained Database Administrator.   When we architected ThingSpan, powered by Objectivity/DB, we took a different approach. We gave the application engineer the power to decide how to best cache, cluster and distribute data. However, once placed, ThingSpan presents a “Single Logical View” of the data. The kernel works out where data is stored and communicates with remote data servers if it isn’t on the client machine. Part 1 of this blog series describes the advantages of this approach.   The Logical and Physical Environments The key thing about distributed database environments is to make disparate physical resources look like a single logical environment to clients. This can be done in multiple ways, such as hiding the databases behind a single server interface, using a federation layer that sends appropriate requests to multiple database servers, sometimes in parallel, or by making the physical resources appear to be a part of a single address space. Objectivity uses the latter model, termed a “Single Logical View”. We will return to this topic later. Figure 1 - The Single Logical View   Smart Caching Background One of the primary goals of any database system is to make...
Why Graph Databases Built On The Wrong Foundation Cannot Compete

Why Graph Databases Built On The Wrong Foundation Cannot Compete

Overview As graph databases become more widely adopted it is inevitable that other databases add some kind of graph capability to their APIs. In this article I explain why using that approach is never going to produce a system that performs as well as a true graph database, such as the one within Objectivity’s ThingSpan. I’ll explain the main requirements then look at the number of logical and physical operations needed to perform a simple navigation query using relational, NoSQL and graph database technologies.   Graph Database Requirements Almost every graph query starts with a single node (a Vertex) and then navigates through relationship objects called Edges to a connected Vertex. This process is repeated until the tree or graph of objects has been traversed. There is also a more complex kind of query termed pathfinding, which finds the shortest or all paths between two or more objects. All current databases use combinations of three basic mechanisms: Scanning Link traversal Lookups speeded up by hash keys or indices [The indices consist of linked entries.]   The graph queries described above start by using a key or index to find the origin vertex(es) then use link traversal to navigate the graph. Any DBMS can perform these operations, but as the majority of the query is taken up with link traversal, the inefficiencies of this underlying mechanism dominates performance numbers.   Building a Graph Layer on a RDBMS Any RDBMS is going to be reasonably fast at performing the initial lookup(s) to find the origin Vertex(es), or, more likely, the correct row of the join table. Traversing to the N connected...
How Smart Are Your Connected Devices? Using Spark and ThingSpan to Provide IIoT Predictive Analytics for Smart Homes.

How Smart Are Your Connected Devices? Using Spark and ThingSpan to Provide IIoT Predictive Analytics for Smart Homes.

The Industrial Internet of Things covers a very wide range of devices and systems that interact with one another or dedicated services over the Internet. Although such systems have been deployed by specialist companies, such as building control system suppliers, there has been a recent upsurge in interest in developing unified protocols and standards for IIoT infrastructure. IIoT covers a wide range of disciplines, but they can be grouped as follows:

Infrastructure:
IIoT Cloud Platforms
Network Infrastructure & Sensors
Configuration Management
IIoT Cybersecurity
Techniques:
Big Data Learning
Machine Analytics
Application Sectors:
Manufacturing & Supply Chain
Extraction & Heavy Industry
Utilities and Smart Grid/City/Home
Transportation & Fleet.

The infrastructure and techniques share a lot in common with the consumer/retail IoT domain, so in this first look at applying Spark and ThingSpan in IIoT applications we will look at a simple Smart Home application as the techniques employed are applicable to both domains.

Using Spark and ThingSpan for Intelligence Analytics

Using Spark and ThingSpan for Intelligence Analytics

Human Intelligence (HUMINT) consists of a huge graph of connected snippets of information about criminals and terrorists, plus analyst reports and a wealth of background information. In this example, we will deal with data that is primarily about telephone metadata, which includes Call Detail Records and the people involved in the calls.

We will look for suspicions patterns of calls, and, if we find any, we will try to determine whether any of the people involved has been seen sighted near a potential target, such as an important government facility.