Objectivity’s New Way to Help Build Advanced Data Governance Systems – Part 2

Objectivity’s New Way to Help Build Advanced Data Governance Systems – Part 2

In Part 2 of this blog series, we continue to look at the mechanisms that influence the items under control and on the workflows that need to be enforced to ensure that data quality is not lost or that changes are correctly rippled through a system in an advanced data governance system.   Representing Organizational Structures in D*ChoC D*ChoC has a set of pre-defined User information types (User, Person, Team, Organization and Role) plus an associated connection type called Org_Structure. They can be used to create a representation of just about any kind of organization within the D*ChoC. The diagram below shows a publishing organization.   Note how Arthur Grump is employed within the Editorial Team, but is also involved in another project that might involve whole teams in addition to other individuals. This kind of matrix organization is hard to represent in conventional databases, but D*ChoC is all about connections, so it is very easy to construct the model of the organization and adapt it over time.   Capabilities and Actions A Capability can be represented by an information type derived from Item. A User must be granted a Capability in order to perform certain actions on all or selected information items. Typical Actions might include the ability to do such things as: Create a new [ Item => Publication ]. Delete an abandoned [ User => Team => Project ]. View an unpublished [ Item => Audit_Report ]. Connect a [ User => Person ] to a [ User => Team ].   Create, Delete, Edit, View, Connect and Disconnect are known as Actions. Most governance systems...
Objectivity’s New Way to Help Build Advanced Data Governance Systems – Part 1

Objectivity’s New Way to Help Build Advanced Data Governance Systems – Part 1

“Data governance is the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data.” - Wikipedia   Background There are many aspects to data governance that are outside of the scope of a Digital Chain of Custody application. However, Objectivity’s latest product, D*ChoC, has powerful features that make it easier to record, audit and enforce many procedures that could jeopardize the quality of data. It provides Digital Chain of Custody (D*ChoC) lineage and provenance mechanisms. In this article we will focus on the mechanisms that influence the items under control and on the workflows that need to be enforced to ensure that quality is not lost, or that changes are correctly rippled through a system.   Standard D*ChoC Information and Connection Types The main D*ChoC Information Types are shown in the diagram below. “Item” represents anything in the Digital Chain of Custody. We will take a quick look at Sequences, as making sure that the right mechanisms are applied in the right order is essential to any data governance regime. Sequences are lists or networks of connected Mechanisms (Applications, Tools, Procedures and Workflows).   Mechanisms D*ChoC Mechanisms are standard Information Item Types. There are four derived types. (See diagram above). Sequences link individual Items together. One kind of Mechanism, Workflow, is special in that it can “include” other Workflows, as well as the other three Mechanisms when it appears in a Sequence. Mechanisms operate on Items and are generally used in conjunction with one or more User items, e.g. [User => Person] “Author X” used [Mechanism => Application] “Office” to...
DATABASE DESIGN MANTRAS vs PHYSICS – Part 2

DATABASE DESIGN MANTRAS vs PHYSICS – Part 2

The Role of Data Clustering in High Performance, Scalable Database Systems   Background In the first article in this series we looked at the need to reduce the number of I/Os required to perform database operations and the role of smart caching, including cross-transaction caching. Reducing I/Os improves response times and increases system throughout. In this article we’ll look at the role of data clustering, which aims to reduce I/Os by physically grouping together data that is frequently accessed together.   Fine Grain Clustering Caching can dramatically reduce the number of I/Os needed to support some operations. The next line of attack is clustering objects that are generally used together, such as Word objects with the Sentence object that they are a logical part of. You can only cluster things once, unless some of the objects are replicated, but in the example that we used of a document object the number of I/Os will be not much more than the size of the document divided by the I/O size. In ThingSpan the default page size is 16 KB, so reading a 40KB document would require 3 I/Os. If the same document were stored in a classically normalized relational database there would be 1 I/O for the Document table, 1 for each Chapter, 1 for each Paragraph and ( #Sentences / IO_Size) for the sentences etc. It would clearly be more than 3.   Objectivity’s ThingSpan allows the application or database designer to cluster objects in two ways, explicitly or using placement directives. The two mechanisms should generally not be mixed, because the placement directives can narrow the scope of...
DATABASE DESIGN MANTRAS vs PHYSICS – Part 1

DATABASE DESIGN MANTRAS vs PHYSICS – Part 1

The Role of Smart Caching in High Performance, Scalable Database Systems   Background One of the mantras of most mainstream Database Management Systems (DBMSs) is that the physical organization of the data should be hidden under the hood. The argument is that the user should not have to know anything about the underlying mechanics in order to store and retrieve data. This is a worthy goal, but, in reality, this task is often delegated to a trained Database Administrator.   When we architected ThingSpan, powered by Objectivity/DB, we took a different approach. We gave the application engineer the power to decide how to best cache, cluster and distribute data. However, once placed, ThingSpan presents a “Single Logical View” of the data. The kernel works out where data is stored and communicates with remote data servers if it isn’t on the client machine. Part 1 of this blog series describes the advantages of this approach.   The Logical and Physical Environments The key thing about distributed database environments is to make disparate physical resources look like a single logical environment to clients. This can be done in multiple ways, such as hiding the databases behind a single server interface, using a federation layer that sends appropriate requests to multiple database servers, sometimes in parallel, or by making the physical resources appear to be a part of a single address space. Objectivity uses the latter model, termed a “Single Logical View”. We will return to this topic later. Figure 1 - The Single Logical View   Smart Caching Background One of the primary goals of any database system is to make...
Why Graph Databases Built On The Wrong Foundation Cannot Compete

Why Graph Databases Built On The Wrong Foundation Cannot Compete

Overview As graph databases become more widely adopted it is inevitable that other databases add some kind of graph capability to their APIs. In this article I explain why using that approach is never going to produce a system that performs as well as a true graph database, such as the one within Objectivity’s ThingSpan. I’ll explain the main requirements then look at the number of logical and physical operations needed to perform a simple navigation query using relational, NoSQL and graph database technologies.   Graph Database Requirements Almost every graph query starts with a single node (a Vertex) and then navigates through relationship objects called Edges to a connected Vertex. This process is repeated until the tree or graph of objects has been traversed. There is also a more complex kind of query termed pathfinding, which finds the shortest or all paths between two or more objects. All current databases use combinations of three basic mechanisms: Scanning Link traversal Lookups speeded up by hash keys or indices [The indices consist of linked entries.]   The graph queries described above start by using a key or index to find the origin vertex(es) then use link traversal to navigate the graph. Any DBMS can perform these operations, but as the majority of the query is taken up with link traversal, the inefficiencies of this underlying mechanism dominates performance numbers.   Building a Graph Layer on a RDBMS Any RDBMS is going to be reasonably fast at performing the initial lookup(s) to find the origin Vertex(es), or, more likely, the correct row of the join table. Traversing to the N connected...
The Fastest Platform on the Planet  for Connections and Intelligence

The Fastest Platform on the Planet for Connections and Intelligence

Introduction Successful Big and Fast Data projects require a platform that can:             Scale massively             Perform in near real-time The Objectivity ThingSpan platform is unique in its ability to meet those needs. No other platform comes close. As an example, this blog highlights one use case that far exceeds anything else available.   This example also illustrates ThingSpan’s ability to grow linearly as computing resources are added. Further, it shows the value of ThingSpan’s user language, DO, to aid the user in analyzing and visualizing the desired results. Finally, this example is available in the Amazon Cloud, so that you can confirm ThingSpan’s performance and scale for yourself.   Does today’s massive quantities of data from multiple sources fit neatly into rows and tables for enterprises and governments to analyze and get the answers needed to make critical operational decisions in near real-time?  Have open source technologies proven that they can handle the performance and scale demanded of mission critical operational systems? Can critical operations afford to wait for hours and days for after the fact analytics?   Objectivity’s ThingSpan is the only data management platform that can ingest billions of complex, interrelated records while simultaneously building the graph in order to gain new insights in near real-time for operational systems.  The use case further described below is a very simple one for ThingSpan yet the customer’s existing relational based system was not even able to provide an answer back.  ThingSpan not only ingested the data 5 times faster while also connecting the data, but was able to perform complex queries that their existing relational system could not...