In Part 1 of this blog series, I looked at the fundamental principles behind all database technologies and the evolution of DBMSs as system requirements changed. In this concluding article, I’ll address the enormous changes in requirements that Objectivity is seeing and suggest some ways of attacking the problems that they are introducing.
The Rise of Big Data
Dramatically increased and still growing use of the WWW has made it necessary for companies to gather and analyze a much wider variety of data types than ever before. They also need to store and process more data in order to garner business intelligence and improve operations. This introduces an additional data generator, the data center and communications infrastructure, which can produce voluminous logs from multiple sources.
Many of these “Big Data” systems operate on huge volumes of relatively simple data, much of it requiring conversion, filtering or consolidation before it can be used for analytical purposes. In the early days, much of this new data was stored in structured files. Hadoop, with its MapReduce parallel processing component and the scalable, robust Hadoop Distributed File System (HDFS), rapidly gained momentum making it the mostly widely used framework for Big Data systems.
As with the engineering and scientific systems, simply handling the raw data isn’t enough. Companies found that extracting metadata as the incoming data was being processed, or supplementing the individual data objects with indices or interlinked metadata objects, made the data more valuable for data-mining queries and analytics.
The Polyglot Approach Reworked
Today, it is accepted that Hadoop was designed to handle batched file operations in parallel, not to process streaming data rapidly or handle more complicated interactive and mixed-mode workflows.
This deficiency led to the development of the Apache Spark framework, which was designed to handle streaming, interactive and batch processes. The streams are divided into microbatches to allow subsecond responses to incoming data. Hadoop is also notoriously weak at handling exceptions in the workflow, but Spark can be used to monitor processes and control recovery gracefully while reporting the problem to users or other systems. Spark can improve the service availability of a system while HDFS improves the data availability.
As the amount of data and the number of different kinds of data increased, it became clear that storing the data wasn’t enough. Hadoop framework providers have started adding simple filelevel metadata components to their offerings. However, there is often more value in the kinds of connections between objects than in the objects themselves. This is prime territory for object and graph databases.
Although users can write applications that run under Spark and access multiple kinds of databases or files (the polyglot approach), they still need something better than a relational or columnstore database for handling graph problems. Objectivity’s ThingSpan makes it much easier to do this by leveraging object data modeling and providing simple adaptors that present Spark DataFrames to Spark components.
ThingSpan allows Spark Streaming or Big Data applications to store highly interconnected metadata about information stored in HDFS or conventional file systems in its Metadata Store. The metadata can be accessed by Spark SQL, Spark MLlib and Spark GraphX, making it possible to do advanced data-mining and analytics that touch both the metadata and the original data stored in Hadoop or other databases.
The Metadata Store can also be used via the ThingSpan REST Server, which services a REST API. It supports schema, object, indexing and graph data operations. It also supports CRUD and special path-finding operations that are serviced by the underlying database kernel. ThingSpan isn’t just a bolton over conventional technology. It is using both mature and newly developed distributed DBMS capabilities in conjunction with popular and robust open-source components.
ThingSpan is the missing link between Big Data and Fast Data. It integrates all of the database components necessary to quickly build complex systems that can be controlled by YARN and seamlessly accessed by other Spark components. It can even form a bridge between existing Objectivity/DB databases and the data-mining and machine-learning capabilities of Spark. ThingSpan can be deployed in private clusters or on cloud infrastructure.
I’ve covered the basic principles that underlie all databases and the evolution of DBMSs as system requirements and architectures changed. To solve the most challenging information fusion problems facing us today, it isn’t enough to hack new software layers over old technologies. Organizations must continue to invent and innovate for the future, and Objectivity is excited to pave the way in tackling the challenges of Big and Fast Data.
CTMO and Founder