In the year 2000, it seemed that database technology had matured to the point where changes were incremental, at least in the enterprise. Today, there is such a wide choice of database management systems (DBMSs) and other storage technology that it is hard to determine the best fit for a particular problem. Some wonder whether we have a surfeit of databases. In this blog series, I’ll look back at how we arrived where we are today, examine the new demands and challenges that we are facing, and suggest some lines of attack using Objectivity’s suite of database platforms.
It’s worth remembering that there are only four main ways to find data that matches particular criteria:
1. Scan all of the places where data is stored and return the items that match the criteria.
2. Use a supplementary structure, such as an index or hash table, to narrow the scope of the search to items that may match all of the criteria.
3. Follow relevant links between data items until all matching items are found and there are no more links to follow.
4. Use content addressable hardware that can “prompt” matching items to identify themselves.
These techniques can be combined in many ways at multiple levels to query complex data structures. Below is a list of the most common DBMSs:
Structured file variants were the earliest form of searchable datastore. They combined scanning with hash tables or indices to locate candidate records within a file, often using file offsets or individually-identified, variable-sized slots within fixed size pages to home in on a particular record. Scanning is heavily used today to process log data and semistructured files.
Hierarchical DBMSs, such as IBM’s IMS, relied on navigating typed links between records, supplemented with hash key access and indexing.
Relational databases are heavily dependent on BTree indices and join tables that logically link rows in a table or across tables.
Column stores, such as Google’s Bigtable, use the same kind of indices as relational databases but physically store columns in a compact format instead of individual rows. The rows can also be accessed via hash keys.
Object databases, such as Objectivity/DB, and graph databases, such as InfiniteGraph, use navigation as the prime search technique, but also use hash tables and indices.
KeyValue stores either hash directly, or close to, the target data or use a hash table to get directly or close to it. They then follow links or scan a hash bucket.
Document databases optimize the storage of text, supplemented with indices, hash tables and links.
Early computing systems were single user systems, with multiple jobs being processed concurrently to make better use of resources, such as tape drives and printers. True realtime systems used to monitor and control complex equipment, such as power stations or metal working machines, were and mostly still are serviced by local, dedicated processors, often coordinated via a central processing node. Today’s mobile devices are essentially single user, but complex applications may be running many threads in parallel to accomplish a task, often in background mode.
In the 1970s, the banks, airlines and large retailers were amongst the first to provide multiple remote users access to centralized systems that ran on mainframes. The earliest paradigm was the client-server model, where specialized applications serviced requests from multiple clients, generally within tight time restraints. A one or two second response would generally be tolerable. DBMS technology evolved to suit the transactional input. Batch processing systems mined the data, generally overnight, for marketing or operational purposes. The DBMSs evolved from hierarchical to relational, which is still the most widely deployed database technology.
The advent of cheap, standardized networking in the mid-1980s caused a surge in the deployment of workgroup systems, often linked to corporate servers for database access. Many of these systems were used for scientific and engineering projects, generally using complex algorithms that were more CPU than data intensive. Most of these systems were built by specialized software suppliers that used proprietary data structures instead of commercially available DBMSs. Queries were generally point-and-click driven, and conventional query capabilities were mostly limited to finding candidate files using fairly limited metadata, though some vendors and establishments started using relational databases to store the metadata.
In the late 1980s, there was a surge of interest in objectoriented languages. Relational databases could not efficiently handle engineering and scientific applications, represent inheritance, or manage complex tree and graph structures without cumbersome mapping layers. Removing this “impedance” mismatch led to the development of object databases (ODBMSs). Our main product, Objectivity/DB, has always won benchmarks against RDBMSs when access is primarily navigational or there are many variants on prevalent data types, which is easily handled using inheritance. Storage overheads tend to be around 20%, versus 200% or more for RDBMSs. These advantages have been carried forward into InfiniteGraph, our distributed graph database, and ThingSpan, our advanced Information Fusion platform.
Where Are We Headed?
In Part 2 of this blog series, I’ll look at the dramatic changes in technical requirements that we are seeing and suggest some ways of attacking the problems that they are introducing.
CTMO and Founder