Welcome to Objectivity, Inc. -- makers of the industry leading Objectivity/DB object-oriented database management platform, Grid Certified (Levels 1 through 6), and SOA compliant Twitter LinkedIn YouTube RSS Feed

.

Saturday, November 7, 2009

DBMS Design Decisions

As with any complex software, designing and implementing a database management solution involves setting goals and making decisions and tradeoffs to meet those goals as well as you can. When we designed Objectivity/DB we set down ten major goals. The first four are relevant to this series of posts, namely - Reliability, Performance, Scalability and Distributability.

Transaction management, particularly controlling the in-memory state of data and the persistent data on disk, is fundamental to achieving reliability. However, writing every operation or database change to a separate log file can degrade performance. That becomes increasingly important with high throughput systems in constrained environments. Imagine a one Terabyte database in a satellite. If I start a transaction that scans through every object, updating a single field, then creating new objects and linking them to the original objects I may have to create a journal that is a Terabyte or more in size. What if I want to do even more in the transaction, such as using the old and new objects to generate even more objects using an iterative algorithm that may cycle through the objects many times? I may not have enough disk space in my constrained hardware environment to complete the transaction.

So, we decided to use a form of distributed, hybrid, shadow paging instead of a conventional log. The technique has actually been used in quite a few DBMSs as you can always predict the maximum amount of scratch space that a transaction will require. Updated database pages are written to a different physical location than the original logical page. The transaction commit operation safeguards the database by writing the old page map and the new page map to a separate log file. If all of the data remaining in memory is successfully flushed to disk and made persistent by calling fsync, or something similar, the new page map can also be made persistent and the log file can then be deleted. If the transaction is interrupted at any point, right up to the final removal of the log file, it can be rolled back by the owner process or another one that has sufficient permissions. The old physical pages can then be returned to a free file space pool. In the scenario I described above, we can write and rewrite the Terabyte of data many times and we'll never need more than two Terabytes plus a small log file.

Dealing with distributed databases slightly complicates the mechanism, as you may successfully update several database files on various machines and then fail to reach another one, so the whole transaction must be rolled back. We decided to use lock servers to help control transactions. If a client process dies while it is holding locks you can set a lock server to automatically rollback the transactions that the failed process owned.

Labels: , , ,

Lock Granularity

Performance is always an issue and perhaps the most visible trade-off was the decision to lock bunches of objects rather than individual objects. In an Objectivity/DB federation that translates to locking at the container level rather than the object level. Doing that dramatically reduces the number of lock requests. However, there's a potential downside - a long running update transaction could prevent other clients from reading the objects in the containers it is updating.

The solution was quite simple and it exploits the underlying shadow page recovery mechanism. When you start updating a transaction the Objectivity/DB Storage Manager, a major component of the kernel, creates a new version of the container. Updates and new objects are written to new physical pages. The old ones are untouched until the commit occurs. The introduction of the Multi-reader, One-Writer (MROW) transaction mode makes it possible for existing or new readers to access the previous version of the container while the new one is being written. In some cases the new version will be rolled back, freeing the old versions for updates by subsequent transactions from any client.

The picture is made slightly more complex because of the need to maintain a consistent view for long running readers. Suppose that process A opens version 1 of a container. Then process B starts updating it, creating version 2. Then process B commits the transaction while A is still running. Process C then opens the container and it sees version 2. That's not a problem unless processes A and C have to collaborate in any way that involves this container, as they are looking at different versions. When process A eventually finishes its transaction there will be two versions of the container, but only the latest one will be picked up by new readers. This mechanism can cause some version creep, with multiple versions of a container existing in the database file. To prevent that being a problem, the kernel deletes the old, not currently accessed, versions as soon as a new updater comes along.

Object server DBMSs generally log individual operations or page updates and lock at the object level. The problem with that approach is that you not only have more lock server traffic, you also run a big risk of deadlocks as clients navigate through complex networks of objects. The subsequent queueing can significantly degrade performance. I was very amused at the recent ICOODB conference to hear an engineer from another ODBMS supplier describing that object level locking within their system data had been causing bottlenecks and limiting server throughput. He then proudly described the system of latches (transient, within transaction locks) and data redesign that had allowed them to defer, but not eliminate the problem. What he was saying, out of the earshot of their sales and marketing people, was that their highly touted object level locking was actually hurting the performance of their server.

Every time that we assembled a list of desired features for new releases in our early days, the sales people would request "Object Level Locking." However, our users were finding that the performance of our distributed MROW implementation was much better than centralized servers with object level locking.

Labels: , ,

Thursday, October 15, 2009

What's Old is New Again - Database Sharding

There has been a lot of interest lately, from both the relational database and the "NoSQL" communities, in a technique known as sharding, which involves splitting data across multiple devices or servers to improve performance and scalability. Systems such as BigTable, Amazon SimpleDB, Hadoop, MapReduce and Cassandra employ sharding.

Why is it used?

As the number of rows in a relational table grows, performance suffers. This is primarily due to additional index node accesses, but it can also be related to file sizes plus server and disk I/O loads. Likewise, in a distributed file system, the index to particular file can grow to unmanageable sizes.

One solution for Relational Database Management System (RDBMS) users is to split a table into horizontal logical partitions, e.g. use a Country-Name key. If the tables are all stored in a single central server, the tables may need unique names. That complicates queries, as the specific table name has to appear in the SELECT statement. Some RDBMSs support transparent horizontal partitioning, generally allowing segmentation across disk drives. However, partitioning is relatively crude in most RDBMSs. You usually can't group rows from one table with more than one other table in an ad hoc fashion. A Product-Order record might be stored with the Product table, but you couldn't specify the Customer table as the clustering host instead. Sharding can be used to overcome this limitation. Hibernate, the Object-Relational mapping product, has an extension that can make this easier (but still tedious) to manage.

With file-based systems, the client, or a layer of middleware, needs to deal with the sharding, directing file requests to the appropriate node, or to a dedicated service. There's a nice description of the advantages and techniques involved in this article by Curtis Jackon.

The Old New Way

Fortunately, Objectivity/DB has always used sharding as a fundamental component of its distributed, federated, "Single Logical View" architecture. Users can cluster objects into pages in any way that suits the application and data, thereby overcoming the partitioning limitations of RDBMSs. The pages can be clustered into Containers, which can themselves be clustered into Databases. The placement (sharding) strategy can be determined dynamically or be coded into policies. Objectivity/DB's Release 10 will provide enhanced placement management capabilities. Even better, the "Single Logical View" that the federation provides allows transparent navigation of links between objects in different containers or databases, something that is complicated in RDBMS and file-based sharding schemes.

The inherent sharding can be exploited by Objectivity/DB's Parallel Query Engine (PQE). Imagine a federation that has a database for each country in the world. If I want to query all of the databases for countries in Europe, I can build a range splitter that translates "Europe" to the names (or OIDs) of the 49 corresponding databases. PQE will then direct up to 49 parallel queries to the appropriate query agents. If I decide to query the whole federation, PQE can query all of the databases simultaneously and marshal the results.

Flashback/FlashForward

I'm always amused at the way that the computing community continues to reinvent things. Dividing databases up into logical partitions (equivalent to sharding) was a core part of the CODASYL DBTG network database standard way back in 1971. I implemented that standard in the ICL 2900 IDMS product in 1975. The same principle (divide and conquer) influenced the architecture of Objectivity/DB in 1988. Twenty years later, the technique is "hot" again, rechristened "sharding". That reminds me. I should talk about time sharing... I mean "cloud computing"... in a later article.

Labels: , , , ,