What's Old is New Again - Database Sharding
There has been a lot of interest lately, from both the relational database and the "NoSQL" communities, in a technique known as sharding, which involves splitting data across multiple devices or servers to improve performance and scalability. Systems such as BigTable, Amazon SimpleDB, Hadoop, MapReduce and Cassandra employ sharding.
Why is it used?
As the number of rows in a relational table grows, performance suffers. This is primarily due to additional index node accesses, but it can also be related to file sizes plus server and disk I/O loads. Likewise, in a distributed file system, the index to particular file can grow to unmanageable sizes.
One solution for Relational Database Management System (RDBMS) users is to split a table into horizontal logical partitions, e.g. use a Country-Name key. If the tables are all stored in a single central server, the tables may need unique names. That complicates queries, as the specific table name has to appear in the SELECT statement. Some RDBMSs support transparent horizontal partitioning, generally allowing segmentation across disk drives. However, partitioning is relatively crude in most RDBMSs. You usually can't group rows from one table with more than one other table in an ad hoc fashion. A Product-Order record might be stored with the Product table, but you couldn't specify the Customer table as the clustering host instead. Sharding can be used to overcome this limitation. Hibernate, the Object-Relational mapping product, has an extension that can make this easier (but still tedious) to manage.
With file-based systems, the client, or a layer of middleware, needs to deal with the sharding, directing file requests to the appropriate node, or to a dedicated service. There's a nice description of the advantages and techniques involved in this article by Curtis Jackon.
The Old New Way
Fortunately, Objectivity/DB has always used sharding as a fundamental component of its distributed, federated, "Single Logical View" architecture. Users can cluster objects into pages in any way that suits the application and data, thereby overcoming the partitioning limitations of RDBMSs. The pages can be clustered into Containers, which can themselves be clustered into Databases. The placement (sharding) strategy can be determined dynamically or be coded into policies. Objectivity/DB's Release 10 will provide enhanced placement management capabilities. Even better, the "Single Logical View" that the federation provides allows transparent navigation of links between objects in different containers or databases, something that is complicated in RDBMS and file-based sharding schemes.
The inherent sharding can be exploited by Objectivity/DB's Parallel Query Engine (PQE). Imagine a federation that has a database for each country in the world. If I want to query all of the databases for countries in Europe, I can build a range splitter that translates "Europe" to the names (or OIDs) of the 49 corresponding databases. PQE will then direct up to 49 parallel queries to the appropriate query agents. If I decide to query the whole federation, PQE can query all of the databases simultaneously and marshal the results.
Flashback/FlashForward
I'm always amused at the way that the computing community continues to reinvent things. Dividing databases up into logical partitions (equivalent to sharding) was a core part of the CODASYL DBTG network database standard way back in 1971. I implemented that standard in the ICL 2900 IDMS product in 1975. The same principle (divide and conquer) influenced the architecture of Objectivity/DB in 1988. Twenty years later, the technique is "hot" again, rechristened "sharding". That reminds me. I should talk about time sharing... I mean "cloud computing"... in a later article.
Why is it used?
As the number of rows in a relational table grows, performance suffers. This is primarily due to additional index node accesses, but it can also be related to file sizes plus server and disk I/O loads. Likewise, in a distributed file system, the index to particular file can grow to unmanageable sizes.
One solution for Relational Database Management System (RDBMS) users is to split a table into horizontal logical partitions, e.g. use a Country-Name key. If the tables are all stored in a single central server, the tables may need unique names. That complicates queries, as the specific table name has to appear in the SELECT statement. Some RDBMSs support transparent horizontal partitioning, generally allowing segmentation across disk drives. However, partitioning is relatively crude in most RDBMSs. You usually can't group rows from one table with more than one other table in an ad hoc fashion. A Product-Order record might be stored with the Product table, but you couldn't specify the Customer table as the clustering host instead. Sharding can be used to overcome this limitation. Hibernate, the Object-Relational mapping product, has an extension that can make this easier (but still tedious) to manage.
With file-based systems, the client, or a layer of middleware, needs to deal with the sharding, directing file requests to the appropriate node, or to a dedicated service. There's a nice description of the advantages and techniques involved in this article by Curtis Jackon.
The Old New Way
Fortunately, Objectivity/DB has always used sharding as a fundamental component of its distributed, federated, "Single Logical View" architecture. Users can cluster objects into pages in any way that suits the application and data, thereby overcoming the partitioning limitations of RDBMSs. The pages can be clustered into Containers, which can themselves be clustered into Databases. The placement (sharding) strategy can be determined dynamically or be coded into policies. Objectivity/DB's Release 10 will provide enhanced placement management capabilities. Even better, the "Single Logical View" that the federation provides allows transparent navigation of links between objects in different containers or databases, something that is complicated in RDBMS and file-based sharding schemes.
The inherent sharding can be exploited by Objectivity/DB's Parallel Query Engine (PQE). Imagine a federation that has a database for each country in the world. If I want to query all of the databases for countries in Europe, I can build a range splitter that translates "Europe" to the names (or OIDs) of the 49 corresponding databases. PQE will then direct up to 49 parallel queries to the appropriate query agents. If I decide to query the whole federation, PQE can query all of the databases simultaneously and marshal the results.
Flashback/FlashForward
I'm always amused at the way that the computing community continues to reinvent things. Dividing databases up into logical partitions (equivalent to sharding) was a core part of the CODASYL DBTG network database standard way back in 1971. I implemented that standard in the ICL 2900 IDMS product in 1975. The same principle (divide and conquer) influenced the architecture of Objectivity/DB in 1988. Twenty years later, the technique is "hot" again, rechristened "sharding". That reminds me. I should talk about time sharing... I mean "cloud computing"... in a later article.
Labels: Distribution, performance, Reliability, Scalability, Sharding

0 Comments:
Post a Comment
Links to this post:
Create a Link
<< Home