Welcome to Objectivity, Inc. -- makers of the industry leading Objectivity/DB object-oriented database management platform, Grid Certified (Levels 1 through 6), and SOA compliant We are the leader in scalable database management solutions for mission-critical, real-time and distributed applications.
Object Oriented Database Learning Center

Architecture Issues - Object Oriented Databases vs Relational Databases

Object Oriented Database vs Relational Database

Architecture Issues

Next we'll turn to architectural differences between ODBMS and RDBMS. Technically, the terms "ODBMS" and "RDBMS" say nothing about architecture, but ref er only to the information model (based on tables only, or supporting arbitrary structures and operations). However, in practice, the ODBMS and RDBMS products differ significantly in architecture, and these differences have a major impact on users. We'll see differences in clustering, caching, distribution, scalability, safety, integrity, and dynamic flexibility and tuning.

1

Customer Table Bottleneck

Since the RDBMS model is based entirely on tables, virtually all implementations are based on tables, too. For example, if there's a customer table, then all applications that need to access customers will collide, or bottleneck, on that one table. This bottleneck forces all such applications to wait for each other. The more users, the more applications, the longer the wait, defeating scalability. Instead, in an ODBMS, the customer objects may be separated and stored as desired; e.g., the customers in the USA may be placed in one database, on one server, while those in Asia may be placed in another database, on another server. Though access remains unchanged (see the discussion of distribution architecture, below), now those applications that access only USA customers (or only Asian customers), will not conflict at all, allowing them to run in parallel.

1

Caching is Up to 105x Faster

In addition to the above clustering flexibility, ODBMSs add the ability to cache. In an RDBMS, all operations are executed on the server, requiring interprocess (IPC, and usually network) messages to invoke such operations. The time for such IPC is measured in milliseconds. With an ODBMS, the objects can be brought from the server to wherever the application is executing, or cached directly in the application's address space. Operations within address spaces occur at native hardware speeds; with 100 MIPS, that means they're measured in hundredths of microseconds, fully 100,000x faster than IPCs, a gain in five orders of magnitude. Such overwhelming performance advantages are a big part of why ODBMSs can be much faster. Although the first operation, to cache the object, requires IPC and runs in milliseconds, any later operations on that object occur directly, 105x faster. This is typical of other operations, too, in the ODBMS, where the overhead is incurred in the first use, but then no extra overhead at all is suffered in later operations.

1

Combined Clustering and Caching

Combining caching and clus tering produces even greater benefits. The clustering discussed above can be applied across different types. While the RDBMS stores all the tuples of a given type together (in tables), the ODBMS adds the flexibility to cluster different types together. For example, together with a customer object, we might cluster the objects representing the products, services, or securities that customer has purchased. Then, when the customer object is accessed, the related objects come along into the cache for free. Later access to them occurs immediately at native hardware speeds.

In general, performance in a DBMS is limited by network transfers and disk I/Os, both occurring in the range of milliseconds. All other operations, measured in machine cycles, 105x faster, are a distant second-order effect. Caching and clustering, together, allow avoiding such slow network and disk operations, and can thereby dramatically, by orders of magnitude, increase performance.

The basic architecture of ODBMSs and RDBMSs differ, too, w hich brings us to distribution. RDBMSs are built around central servers, with all data, buffering, indexing, joining, projecting, selecting occurring on the server. The user simply sends in a request and gets back an answer. This architecture is, in fact, the same architecture as the mainframes of the 1960s, where the user, on a 3270 IBM terminal, sends requests to the mainframe, which processes all requests, and then returns the answers. DBMS technology grew up on the mainframes and almost all DBMS architects have continued to follow that same, basic architecture.

Meanwhile, the world of computing has changed. Today, with powerful workstations, PCs, and high-speed networks, a typical corporate environment has far more computing power (MIPs) spread around on the desktops than it has in the central glass house. The mainframe architecture is unable to take advantage of all these computing resources. Some ODBMSs, however, have been built with a new generation of distributed architecture that can take advantage of all computing resources, by moving objects to whatever machines desired and executing there.

1

RDBMS Central Server Bottleneck

To examine this in more detail, consider the central server architecture of virtually all RDBMSs. All users send requests into the server queue, and all requests must first be serialized through this queue. Once through the queue, the server may well be able to spawn off multiple threads, which can be quite useful, but first, to achieve serialization and avoid conflicts, all requests must go through the server queue. This means that, as new users are added, they wait longer and longer at that central queue. It becomes a bottleneck (different from the table bottleneck described above) that limit multiuser scalability. When the RDBMS starts to get too slow all the user can do is buy a bigger server and hope that it can move requests through that queue f aster. This is why RDBMS companies often encourage high-end users to buy very high-performance (and very expensive) parallel computers. Their mainframe architecture pushes them back towards mainframe machines.

1

Distributed Servers Architecture

Contrast this to a distributed ODBMS architecture (see figure). There are two main differences here to understand. First, the DBMS functionality is split between the client and server, allowing computing resources to be used, and, as we'll see, allowing scalability. Second, the DBMS automatically establishes direct, independent, parallel communication paths between each client and each server, allowing clients to be added without slowing down others, and servers to be added to incrementally increase performance without limit.

To see the client/server DBMS functionality split, consider first the server. Here, a page server moves a page at a time from the server to the client. This is advantageous because of the nature of network transport overhead; viz., the cost to send one byte differs little from the cost to send 1000 bytes. So, if there is any significant chance that any of the other 999 bytes will be used, it is a net gain to send the larger amount. A detailed measurement shows that any clustering overlap of greater than 10% yields better performance for page transport. Similarly, a measurement based on simulation of different architec tures, by David DeWitt, Univ. of Wisconsin, Madison, shows the same advantage to page transport. In this picture, the page size may be set (tuned) by the user, different for each database, to achieve the optimum performance. Such pages often contain 100s of objects. Therefore, compared to traditional tuple-servers or object-servers that send one tuple or object at a time, this page server can be up to 100s of times faster.

On the client side is the Object Manager. This allows application access (and method invocation) to objects within their own address space. As described above, such operations are in hundredths of microseconds, fully 10^5x faster than the millisecond range of cross-process operations, such as those that must communicate to the server. This Object Manager component of the ODBMS also supports object-level granularity, supporting a means to add functionality, guarantee integrity, and control access at the level of every object. (A detailed description of this mechanism appears below, under the Integrity discussion.) For example, if a page with 1000 objects is moved to the client, but the client uses only 10 of those objects, then overhead is incurred for only those 10, a savings of up to 100x in this example. Such overhead includes swizzling and heterogeneity translations.

Also, because access is controlled at the object level, various operations may apply to one object and not to another; e.g., one may version one object, but not another.

The client-to-server communication is also significantly different in this architecture. Instead of all users communicating via a single, shared central-server bottleneck, independent parallel communication paths are provided between each client and each server. (These are set up transparently, on-demand, and dynamically managed, cached, released, and reused, just as any other dynamic resource such as memory or open files.) The effect is that a server may simultaneously move pages to multiple clients at exactly the same time. If a new client is added, he will get his own, independent link, and the server will serve him and the previous clients entirely in parallel. With the central server bottleneck, adding clients necessarily forces other clients to wait. Here, however, adding new clients causes no extra waiting in other clients. The only time clients wait for each other is when they're attempting to access the same objects, in which case the usual locking protocols serialize them (see more, below, on concurrencies). Otherwise, the result is scalability in clients; i.e., clients may be added freely without slowing down others.

This scalability continues until (one of) the server(s) begins to approach its performance limits (in cycles per second, or I/O, etc.). Even, in this case, however, scalability can be maintained. The distributed ODBMS approach provides a single logical view of objects, independent of their location, database, server, etc. This logical view is transparently mapped to physical views. In particular, for full support of distribution, six criteria must be supported:

  1. Single Logical View of all Objects
  2. All operations work transparently across that view (e.g., atomic transactions, propagating methods, many-to-many relationships, etc.)
  3. Dynamic support for the single logical view (e.g., as objects are moved, applications continue to run)
  4. Heterogeneous support for the single logical view (e.g., objects and databases may be located on different hardware, networks, operating systems, and access via different compilers, interpreters, languages, and tools)
  5. Fault Tolerance (e.g., when faults occur, such as network or node failures, the remaining parts of the system continue to function and provide services normally, within the physical limitations of unavailable resources; this requires servers to automatically take over responsibility for missing servers, including recovery, locking, catalogues, and schema)
  6. Replication and Continuous Availability (which adds to fault tolerance by supporting replication of user objects, thereby allow ing continued availability of that information even during fault conditions)

Note that this separation of logical view from physical requires that the DBMS make no assumptions about what resides or executes on the client or the server or any intermediate tier, but rather support object access and execution anywhere. For example, if the DBMS asserts that certain operations always take place on "the server," it is certainly not distributed, because the addition of a second server makes such a statement meaningless.

There are two benefits of this distribution that immediately apply. First, consider scalability. In traditional server-centered DBMSs (including RDBMSs and even some ODBMSs), when the server capacity limit is reached, all the user can do is replace the server with a more powerful one. This holdover from the mainframe architecture tends to push users back into mainframe-like, massive, expensive, and proprietary servers. Instead, in a distributed architecture, when a server capacity is reached, the user may simply add another server, and move some of the objects from the first, overloaded server to the second. It will dynamically set up its own independent, parallel communication paths. Due to the transparent single-logical view, clients and users see no difference at all. The entire system continues to run, and continues to scale. Further, this can be achieved with commodity servers (e.g., inexpensive NTs, rather than massively parallel super-computers). The result is scalability in servers as well as clients.

A second benefit of this distribution is flexibility. In the traditional server approach, each user (or application) connects to the DBMS by first attaching to a specific server, then request objects or tuples or operations. If, however, it is later desired to move some objects from one server to another, all such applications (or users) break, because the objects are no longer on the server they expected. Instead, the distributed single logical view approach insulates the users and applications from any such changes or redistribution of objects. All applications continue to run normally while online changes are made. This provides a new level of flexibility.

The above-described caching in the client's address space raises an issue of integrity maintenance. In the traditional server architecture, all DBMS operations take place in a different process (and often a different physical machine). The operating system inter-process protection mechanism acts as a firewall, preventing any application bugs from inadvertently accessing and damaging the DBMS structures and buffers. With caching, along with the 105x performance advantage comes the potential that an application pointer bug could now directly access such DBMS structures, damage objects, and corrupt the database. In fact, with stored procedures, found in most RDBMSs, the same danger exists because those user-written procedures run directly in the DBMS process, and thus can freely destroy the DBMS.

1

Cache References (or Stored Procedures) can Corrupt Database

Most other systems (whether stored procedures on the server, or ODBMS methods on the server or client) provide the user code to directly access object internals via raw pointers. Eventually, any such pointer will become invalid; e.g., if the object moves, is swapped out, and certainly after commit, because commit semantics require that the object be available for other users, to move to other address spaces and caches. Any programmer attempt to use such pointers when invalid will result in either a crash (segment violation), or, worse, will de-reference into the middle of another object, reading wrong values, writing into the middle of the wrong object, and corrupting the database. Any programmers familiar with pointer programming will recognize such pointer bugs; they're very common and very difficult to avoid. With a single-user program, they're not so serious. The programmer simply brings up the debugger, finds the bad pointer, and fixes the code. However, in a database shared among 100s or 1000s of users, the result can be a disastrous loss of critical, shared information. The loss may not even be noticed for days or weeks, after backups have been overwritten, and information is irretrievably lost.

1

Reference Indirection Guarantees Integrity

The Object Manager, however, avoids this problem at least for the most common cases of pointer errors, using the mechanism mentioned above, for object-level granularity. This is done by providing, transparently, one level of indirection. The application pointer de-reference looks exactly the same (e.g., objectRef -> operation() ), but underneath, automatically, the Object Manager traps the reference, does a test, and a second indirection. Instead of a raw pointer, this is what's often called a "smart pointer." The pointer given to the application belongs to the application, may be stored in local variables, passed to functions, etc. In other words, once such a (raw) pointer is given out, it cannot be retrieved, and the DBMS has given up all control (and functionality). However, once this becomes a smart pointer, no matter what the application does with it, it always remains valid, because it always points to the intermediate (second) pointer (often called a handle). Since the application never sees that second pointer, it can automatically be maintained by the Object Manager, who automatically updates it whenever the object moves. Even after commit, the Object Manager can trap the reference at the handle, and as necessary re-fetch the object, lock it, cache it, perform heterogeneity transformations, then set up the second pointer, allowing operation to continue correctly and transparently.

The result of this Object Manager-supported handle indirection is to ensure integrity of every object reference, avoiding corrupt databases. The cost of this indirection is a test and pointer indirection, typically a couple cycles on a modern RISC machine, a couple hundredths of microseconds. Measurements of actual programs have shown this to be insignificant except in cases artificially constructed to measure it, because object de-reference is usually followed by other operations (comparing values, calculation, method invocation) which are much larger than a cycle or two. In any case, the benefit is always there, guaranteed integrity of object references. Returning to scalability, we see a second advantage to this handle indirection. In other approaches, with application-owned raw pointers directly into the object buffers, the DBMS is unable to swap objects out of virtual memory (VM). Once brought into the cache, since the DBMS knows nothing of where the application has retained such pointers, it simply cannot swap them. As VM fills, thrashing occurs. Worse, every operating system has a hard limit to VM. In some it's 1 GB, in others as small as .25 MB, and in practice it's the size specified for the swap file. In swizzling approaches, the swap file actually fills up, not just from objects actually accessed, but also from all the objects referenced by accessed objects; i.e., it fills according to the fanout of the actual objects accessed. Once that swap file fills up, no more objects may be accessed, at least within that transaction.

1

Swapping: 2-way Cache Management Enables Security

Instead, with the handle indirection, the Object Manager may dynamically support cache management or swapping in both directions. It swaps out the objects no longer being referenced, making room in the cache for new objects the application wishes to access. This effective reuse of the cache is the key enabler for scalability in objects, and can determine the difference between linear (with the Object Manager) versus exponentially slow scaling.

Scalability is also supported by other capabilities that fit nicely with this distributed architecture, including dynamic tuning of lock granularity and a variety of concurrency modes. To examine the first, note that most DBMSs pre-define and fix the locking granularity. In some it's an object, in others a page, but the choice forced upon all users. In some RDBMSs, the user may choose row or table locking, but row is almost always too slow, and table is almost always too coarse, creating too much blocking. This distributed architecture provides a means to dynamically tune the locking granularity. In the single logical view, locking is always (logically) at the object level. However, at runtime the ODBMS translates this into a physical lock at the level of a container, or a user-defined cluster of objects. When there are high-contention objects, they may be isolated into their own containers, to minimize blocking. On the other hand, when there are many objects often used together, then all of them (100s, 1000s, more...) may be placed into the same container, where they may all be accessed with a single lock. Since locks involve inter-process communication (milliseconds, 10^5x slower than normal memory operations), this can s ave 100s to 1000s of such slow messages and dramatically improve performance. In a production multiuser system, the optimum performance is achieved by tuning the granularity of such locks, some between finer granularity (to minimize blocking) and coarser granularity (for higher performance). Such tuning is not restricted, as in RDBMSs, to table (or type). Instead, objects of different types may be freely clustered together into locking containers to maximize throughput and performance. Also, the distributed architecture, because of its single logical view, allows such tuning to be done dynamically, without changing applications, while the system is online, simply by measuring statistics and moving objects among containers.

Concurrency modes may also dramatically affect performance. While some RDBMSs allow these, not all do, and even those that do are limited to applying them to tables. With the availability of all locking primitives, including locks with and without wait, with timed waits, etc., sophisticated users may customize their concurrency to meet their needs. Prepackaged concurrency modes help users in common situations, the most popular of these being MROW, or Multiple- Readers-One-Writer. In most systems, whenever one user is writing an object (or tuple or record), all other users are waiting, taking turns, which slows down everyone. With MROW, up to one writer and any number readers may run concurrently, simultaneously accessing the same object, with no blocking, no waiting, and hence much higher multiuser performance. Dirty reads, that allow readers to read what the writer is in the process of changing, give similar concurrency, but at the expense of lost integrity, because the readers may see partially changed, inconsistent information, dangling references, etc. Instead, MROW automatically keeps of pre-image of the object that the writer is writing, and all readers see this pre-image. Therefore, readers see a completely consistent image as of the previous commit. If the new writer commits, the readers have the choice of updating or not. For applications that find this semantics acceptable, and do more reads than writes, which describes by far most applications, MROW can be a huge performance boost.

Another capability, cross-transaction caching, can affect performance significantly. In most systems, at commit time the cache is flushed to server disk. Then, if some of the same objects are accessed in a later transaction, the DBMS must again fetch them from the server, costing (again) milliseconds. Instead, the Object Manager preserves the client-based cache, even after commit. If the application attempts to access any of those objects, the Object Manager does a timestamp check with the server(s), and if no one else has modified the objects in the meantime, the application is allowed to continue access to those objects directly in the cache, taking advantage of the 10^5x performance boost.

Finally, replication can be a major boost in performance, not to mention reliability and availability, of large, multiuser systems. While RDBMSs can, in principle, provide some replication, they are hampered in two ways. First, their old central-server architecture makes implementation of replication difficult and slow (similar to the way it makes implementation of referential integrity difficult and slow). All user access is directly addressed to the server. Therefore, any replication must first go to that server, who then might be able to pass the request on to another server managing a replica, etc. Such multiple server interaction, required by the lack of a single logical view, turns a single access into multiple, slow (millisecond) inter-process communications. Second, the RDBMS lacks any knowledge of the appropriate units for replication. All that's available to the RDBMS server is tables of flat data, so that's all it can replicate, either full tables or full databases. The distributed ODBMS architecture, instead, solves both those problems. The distributed single logical view makes it transparent where the objects are located, so the system can directly access whatever replicas are available. Also, the definition of objects at multiple levels allows the system to replicate those objects the user wishes to replicate, rather than all of the database or all of a given type. Finally, unlike most RDBMS approaches, all replicas are automatically kept in synch dynamically, at each transaction commit, with guaranteed integrity even under faults (node and network failure), and with improved read performance, typically no slower write performance. Many of the issues we've discussed have related to performance (avoiding the server bottleneck, 2-way cache management, dynamic tuning of locking granularity, etc.) We'll turn now to a brief discussion on integrity, flexibility, and extensibility. Integrity, aside from basic mechanisms (such as the handle indirection, above), arises from users defining rules or constraints that must be observed. In an RDBMS, the ability to do this is limited to the level of flat, primitive data. Further, invocation of such userdefined rules is limited to the small set of predefined system-provided operations (delete, update, etc.), and requires writing stored procedures which can be tricky for typical programmers. In the ODBMS, such integrity rules are directly implemented as methods (or operations) on any objects, at any levels. Such methods may implement any level of operation, from simple update to application-level (e.g., route cell-phone message among satellites, manufacturing process, etc.), with all the necessary interdependencies. Users may write (and change) such methods in whatever languages they like, from c++, to Java, Smalltalk, or even visual basic. The result is that a much higher level of integrity is supported. Rather than limited to primitive data level integrity, the ODBMS supports integrity all the way to the application level.

We've seen such examples of flexibility and extensibility in terms of physically moving objects, adding servers, etc., all dynamically and transparently. It's significant, also, to note that the ODBMS also supports, with a large amount of transparency, logical changes, or changes to the schema. No matter how proficient the designer, systems almost always undergo change as designers find ways to improve the system and to add functionality. Such changes often change the application's data structures. In the RDBMS, the concept of a schema (user-defined data types) is limited to only one data structure, the table, with no operations or relationships. Applications must build their own structures and operations out of these primitives. When these change, the changes at the low level multiply, requiring manually re-doing the mapping. Often these changes will make existing databases in the field invalid, and require painful changes, such as dump and reload, not to mention rewriting of applications.

ODBMSs support the concept of schema evolution with instance migration. The user may define any structures (with associated operations and relationships). Each such type definition is captured in the ODBMS as a type-defining object. When the user changes a type, the ODBMS automatically recognizes this, analyzes the change by comparing to the indatabase schema, and automatically, in most cases, generates code to migrate old instances of those types to the new type. The user may then decide whether to convert all such instances, only some (e.g., one database), none, or only convert objects on demand, as they're accessed via the new type. Application and system designer making such changes may customize such automatic migration by plugging in callback functions (e.g., to initialize new fields to some value calculated based on values of other fields). This may all be done dynamically, online, providing the ability to extend systems, add functionality, without breaking existing operations.

Finally, we'll close with a discussion of legacy integration. Few users have the luxury of throwing away all preexisting (legacy) systems and rebuilding from scratch. Instead, even as they wish to add new capabilities, new databases, and objects, they must continue to support and interoperate with these legacy systems. There are two main approaches to achieving this, the first based on SQL/ODBC, and the second on surrogate objects.

1

Legacy Integration with SQL and ODBC

Since the ODBMS now supports SQL and ODBC, user applications, tools, and even direct, interactive access may simultaneously access old, legacy RDBMSs along side newer, distributed ODBMSs. SQL supports a common language already familiar to many, in use by many applications and many tools. The ODBMS support for SQL is automatic. A class or type appears in SQL as a table; an instance as a row in the table; an attribute, operation, and relationship all appear as columns. Support of existing applications and tools requires ODBMS support for full SQL, not just predicate query, including DDL (create-table creates a new object type), DML (insert, update, delete, via methods), and security (grant and revoke user or group access at the level of every attribute and every method). This last, in fact, allows the ODBMS to selectively enforce encapsulation by requiring certain users to access objects only via certain operations. ODBC support allows off-the-shelf use of any of the popular GUI tools, including Visual Basic, PowerSoft, Microsoft Access, Crystal Reports, SQL Windows, Impromptu, Forest and Trees, etc. All such access, via ODBC, can interoperate simultaneously against both the legacy RDBMSs and the new ODBMSs. This also allows leverage of existing user training. Users familiar with any of these tools or legacy systems may immediately access the new ODBMSs, and, over time, they can learn more about and gain more benefit from the objects.

1

Surrogate Objects Integrate Legacy Systems

While this last, SQL/ODBC, approach leverages existing systems and user training, the other common approach, surrogate objects, provides a simple, consistent object view for users who prefer objects. Both approaches may be used simultaneously. First, the designer of the new, object system creates a surrogate object that stands for legacy data. It's up to the user to decide how those objects appear. Although it's very easy to map a row to an object, etc., it may be desirable to design the desired, future, ideal object view of all the systems, and then let the surrogate hide any complexity of mapping these to the legacy systems. This is achieved by implementing methods for the surrogate object type that can read and write the legacy database, sometimes doing so via legacy systems in order to continue the application-level integrity mechanism. Third party products can help in doing this mapping from objects to RDBMSs. Even proprietary flat files or mainframe files (ISAM, etc.) can be supported in this way. Once someone has written and installed these surrogate object methods, the ODBMS makes these objects appear exactly as any other ODBMS objects. They become part of the single logical view, so users and applications can access them just as they do any other objects . As they're accessed, they go off and read or write the legacy systems, but that's all transparent to the object users. The result is that users in the new, object world see the simple, object view only, yet they have full access to the legacy systems. The legacy systems remain functioning unchanged. Over time, if and when it makes business sense, some of that legacy information can be migrated into native objects. Of course, doing so requires changing (rewriting) the legacy systems, but it may be done incrementally, as necessary, and all object users so no difference at all.


Object Oriented Database Learning Center


    Objectivity, Inc. -- Complex Data-Management, Simplified • Level-6 Grid Certified • SOA Compliant GSA Schedule Contract