DATABASE DESIGN MANTRAS vs PHYSICS – Part 2

DATABASE DESIGN MANTRAS vs PHYSICS – Part 2

The Role of Data Clustering in High Performance, Scalable Database Systems   Background In the first article in this series we looked at the need to reduce the number of I/Os required to perform database operations and the role of smart caching, including cross-transaction caching. Reducing I/Os improves response times and increases system throughout. In this article we’ll look at the role of data clustering, which aims to reduce I/Os by physically grouping together data that is frequently accessed together.   Fine Grain Clustering Caching can dramatically reduce the number of I/Os needed to support some operations. The next line of attack is clustering objects that are generally used together, such as Word objects with the Sentence object that they are a logical part of. You can only cluster things once, unless some of the objects are replicated, but in the example that we used of a document object the number of I/Os will be not much more than the size of the document divided by the I/O size. In ThingSpan the default page size is 16 KB, so reading a 40KB document would require 3 I/Os. If the same document were stored in a classically normalized relational database there would be 1 I/O for the Document table, 1 for each Chapter, 1 for each Paragraph and ( #Sentences / IO_Size) for the sentences etc. It would clearly be more than 3.   Objectivity’s ThingSpan allows the application or database designer to cluster objects in two ways, explicitly or using placement directives. The two mechanisms should generally not be mixed, because the placement directives can narrow the scope of...
DATABASE DESIGN MANTRAS vs PHYSICS – Part 1

DATABASE DESIGN MANTRAS vs PHYSICS – Part 1

The Role of Smart Caching in High Performance, Scalable Database Systems   Background One of the mantras of most mainstream Database Management Systems (DBMSs) is that the physical organization of the data should be hidden under the hood. The argument is that the user should not have to know anything about the underlying mechanics in order to store and retrieve data. This is a worthy goal, but, in reality, this task is often delegated to a trained Database Administrator.   When we architected ThingSpan, powered by Objectivity/DB, we took a different approach. We gave the application engineer the power to decide how to best cache, cluster and distribute data. However, once placed, ThingSpan presents a “Single Logical View” of the data. The kernel works out where data is stored and communicates with remote data servers if it isn’t on the client machine. Part 1 of this blog series describes the advantages of this approach.   The Logical and Physical Environments The key thing about distributed database environments is to make disparate physical resources look like a single logical environment to clients. This can be done in multiple ways, such as hiding the databases behind a single server interface, using a federation layer that sends appropriate requests to multiple database servers, sometimes in parallel, or by making the physical resources appear to be a part of a single address space. Objectivity uses the latter model, termed a “Single Logical View”. We will return to this topic later. Figure 1 - The Single Logical View   Smart Caching Background One of the primary goals of any database system is to make...
Why Graph Databases Built On The Wrong Foundation Cannot Compete

Why Graph Databases Built On The Wrong Foundation Cannot Compete

Overview As graph databases become more widely adopted it is inevitable that other databases add some kind of graph capability to their APIs. In this article I explain why using that approach is never going to produce a system that performs as well as a true graph database, such as the one within Objectivity’s ThingSpan. I’ll explain the main requirements then look at the number of logical and physical operations needed to perform a simple navigation query using relational, NoSQL and graph database technologies.   Graph Database Requirements Almost every graph query starts with a single node (a Vertex) and then navigates through relationship objects called Edges to a connected Vertex. This process is repeated until the tree or graph of objects has been traversed. There is also a more complex kind of query termed pathfinding, which finds the shortest or all paths between two or more objects. All current databases use combinations of three basic mechanisms: Scanning Link traversal Lookups speeded up by hash keys or indices [The indices consist of linked entries.]   The graph queries described above start by using a key or index to find the origin vertex(es) then use link traversal to navigate the graph. Any DBMS can perform these operations, but as the majority of the query is taken up with link traversal, the inefficiencies of this underlying mechanism dominates performance numbers.   Building a Graph Layer on a RDBMS Any RDBMS is going to be reasonably fast at performing the initial lookup(s) to find the origin Vertex(es), or, more likely, the correct row of the join table. Traversing to the N connected...
The Fastest Platform on the Planet  for Connections and Intelligence

The Fastest Platform on the Planet for Connections and Intelligence

Introduction Successful Big and Fast Data projects require a platform that can:             Scale massively             Perform in near real-time The Objectivity ThingSpan platform is unique in its ability to meet those needs. No other platform comes close. As an example, this blog highlights one use case that far exceeds anything else available.   This example also illustrates ThingSpan’s ability to grow linearly as computing resources are added. Further, it shows the value of ThingSpan’s user language, DO, to aid the user in analyzing and visualizing the desired results. Finally, this example is available in the Amazon Cloud, so that you can confirm ThingSpan’s performance and scale for yourself.   Does today’s massive quantities of data from multiple sources fit neatly into rows and tables for enterprises and governments to analyze and get the answers needed to make critical operational decisions in near real-time?  Have open source technologies proven that they can handle the performance and scale demanded of mission critical operational systems? Can critical operations afford to wait for hours and days for after the fact analytics?   Objectivity’s ThingSpan is the only data management platform that can ingest billions of complex, interrelated records while simultaneously building the graph in order to gain new insights in near real-time for operational systems.  The use case further described below is a very simple one for ThingSpan yet the customer’s existing relational based system was not even able to provide an answer back.  ThingSpan not only ingested the data 5 times faster while also connecting the data, but was able to perform complex queries that their existing relational system could not...
ThingSpan Performance Blog Series – Part III

ThingSpan Performance Blog Series – Part III

Graph Analytics with Billions of Daily Financial Transaction Events   Introduction In this third blog of the series we updated the performance results to include the numbers from running on a 64 node Amazon EC2 cluster. Also, we continue to look for different ways to optimize the performance of ingest and query. In this update, we further optimized the way ThingSpan stores relationships for large degree fan out connectivity. As a reminder, the requirement for this use case was to ingest one billion financial transaction events in under 24 hours (about 12,000 transactions a second) while being able to perform complex query and graph operations simultaneously. ThingSpan has proven to surpass this requirement to ingest while performing these complex queries at the same time at speed and scale. It is important to note that the graph in this case is incrementally added to as new events are ingested. In this way the graph is always up to date with the latest data (nodes) and connections (edges) and is always available for query in a transactional consistent state, what we call a living graph. We used Amazon EC2 nodes for the tests and found that “m4x2large” gave the best resources for scalability, performance, and throughput. We initially surpassed the performance requirements using a configuration consisting of 16 “m4x2large” nodes. For this test we used the native Posix file system. Full details of the proof of concept can be downloaded from: ThingSpan in Financial Services White Paper   Updated Results Below are the updated results when running on a 64 node Amazon EC2 cluster, with the performance optimization described above. We...
ThingSpan Performance Blog Series – Part II

ThingSpan Performance Blog Series – Part II

Graph Analytics with Billions of Daily Financial Transaction Events   Introduction In this second blog of the series we updated the performance results to include the numbers from running on a 32 node Amazon EC2 cluster. As a reminder, the requirement for this use case was to ingest one billion financial transaction events in under 24 hours (about 12,000 transactions a second) while being able to perform complex query and graph operations simultaneously. ThingSpan has proven to handle this requirement to ingest while performing these complex queries at the same time at speed and scale. It is important to note that the graph in this case is incrementally added to as new events are ingested. In this way the graph is always up to date with the latest data (nodes) and connections (edges) and is always available for query in a transactional consistent state, what we call a living graph. We used Amazon EC2 nodes for the tests and found that “m4x2large” gave the best resources for scalability, performance, and throughput. We surpassed the performance requirements using a configuration consisting of 16 “m4x2large” nodes. For this test we used the native Posix file system. Full details of the proof of concept can be downloaded from: ThingSpan in Financial Services White Paper   Results Below are the updated results when running on a 32 node Amazon EC2 cluster. We can use these results to show near linear scale up as the volume of data increases and scale out to achieve better throughput performance. The blue bar represents the time to ingest 250 million events.  Therefore, by doubling the number of...