Objectivity/InfiniteGraph™ is a schema-ed database solution. Objectivity/InfiniteGraph users define their schema through their java class definitions. The class fields are introspected when an object is persisted and if the schema is unknown to the database, the class type and fields are persisted as schema, as well as the field values of the instance of that class. In today’s NoSQL market, schema is so foreign that that feels like a confession. Schema can be restrictive and at times, it does makes it difficult to write an application if:
- the schema is not well defined when the application code is developed or
- the data is not uniform, where each “type” has varying number of fields
But let’s be honest. There is a implicit schema even if an explicit schema is not required. This is forcing many schema-less NoSQL technologies to concoct a semblance of types either through labels or groupings. Also, technologies that have schema-ed solutions do not mean that the an application’s schema is invariant and can never be changed. Most schema-ed database technologies support schema evolution including Objectivity/InfiniteGraph. Flexible schema can be very useful, but as I will argue here, explicit schema, even though binding, can have many more advantages.
Disadvantages of Implicit Schema
In his book NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Martin Fowler writes “Schemaless is appealing, and it certainly avoids many problems that exist with fixed-schema databases, but it brings some problems of its own.” As he explains, if the schema is not explicitly defined, schema can only be found in the application code and there are no guarantees that it can be deduced from the code unless it is well structured. For sufficiently complex applications, this would then indicate that the burden is still on the application developer to clearly define the schema. Granted, the schema can evolve over time as the application code evolves, but the burden is still on the developer to clearly define the critical changes to schema over time. Without a clear definition, the application may be susceptible to languish in its ineffectiveness.
Here is an example of an explicitly defined user schema for a graph database application involving finding routes for supply chain trucking companies that take into account variable weather and traffic conditions to find optimal routes. Building an explicit schema model like the one below makes writing the application to perform the route recommendations tremendously easier.
Another disadvantage to implicit schema is that the schema, only being known to the application developer, would not be available to the database. Martin Fowler explains that without a schema, a database cannot use it to read and write more efficiently and perform “validations” on the data to ensure data integrity. On top of this, there are many more functions that a database can perform when the schema is explicitly defined.
Advantages of Explicit Schema
Along with maintaining data integrity (being able to know what properties to expect when reading records of data), there are many more advantages to have an explicitly defined schema in a database.
Writing Data: Custom Placement
For big data applications, placing your data in the right way is the key to optimized performance for your application. If all of the data is clustered together in one location, then not only will your application be limited by the storage size of your machine, but also high I/O and a contentious battle over locks may put a strangle hold on the speed of your application. On the other hand, if the data is randomly distributed over an N storage host cluster, to simply look up a group of logically related data (by type or by collection) would involve scanning each of the N hosts which can be significantly hampered by network speed. This usually leads developers to create indexes to group related data together.
Indexes are secondary data structures that hold the value of a field and a pointer to the data record. Reusable indexes are typically persisted and can make reading data very fast. Apache Lucene™ is a text based index that is very popular. Also, there are very types of collection based indexes that are appropriate in many different contexts. An index is only effective when it can perform lookups quickly which may not be the case as it grows larger. Also, if related data gets flattened out when placing it in indexes, you can loose metadata (like how the indexed records are related) that exists (this can be mitigated when using multidimensional indexes).
Alternately, instead of using indexes, read performance can still be optimized without creating index structures by placing data in a custom way. In other words, custom placement can mimic external indexes by grouping related data together. Placement doesn’t have to be updated if you add or remove objects and metadata is preserved and used by the placement management system. Also, no secondary data structures are created and therefore, no data is duplicated. With indexes, it becomes unmanageable and inefficient to create too many indexes, but with custom placement, no index creation is required and you can create as many groups as makes sense to optimize read/write performance. Finally, by using custom placement, you can exploit the idea of using native indexes on top of logical placement so you can get even more optimized read behavior.
Using explicitly defined schema, Objectivity/InfiniteGraph’s placement system can both place data in a custom way, but also having placed the data, the query engine can use the custom placement model to read the data efficiently. Also, by defining your placement model as you develop your data model, you take the burden of managing placement from the application developer and give it over to a system administrator where it seems most logical. We also support native indexing that works with our managed placement system. This allows us to get the highest degree of reading and writing of our data. Finally, we also support the versioning of schema models which allows the system administrator to tweek the placement of the data over time and tooling to migrate the data from the old to the new placement model. For more information about customing the placement of your data, see the page on Customizing Storage on the InfiniteGraph Developer’s Site.
Reading Data: Life is like a box of chocolates…
When the schema is known to the database, the data record is more clearly understood. Instead of being just a collection of properties, the data record is an instance of a class, or an object, which is clear to any object oriented language developer. When using a object database technology like Objectivity/InfiniteGraph, metadata like the class hierarchies are preserved through the persistence of the schema. Finally, there are also more complex schema designs that you can build including various collections and relationships.
Reading Data: Query Management
When schema is defined through the database, queries are more intuitive to write and finding the data is no more optimal than when the placement model has been developed alongside the data model. Imagine SQL, the query language of relational databases, without schema. For a lot of newer NoSQL technologies, query languages are what is driving them to adopt a kind of pseudo-type. This is so they can more closely align with what developers expect when they write a query. With schema and a custom placement model, a query engine with knowledge of the placement model can find the data quickly because it knows where it has been placed. This seems like the best argument for explicit schema because it leads to most optimal read performance and flexibility to write queries.
Reading Data: Graph Data
Distributing data can be highly impractical if you are uniformly distributing it randomly and navigating across multiple degrees of connectedness. This is because you are bouncing back and forth across systems and are again limited by slow network speeds and not using a single cache.
This can be highly mitigated by using custom placement and placing highly connected subgraphs in one location, but the speed of the distributed navigation can still be improved because subgraphs can never be completely isolated in highly connected databases. One way that Objectivity/InfiniteGraph exploits the explicit schema is that users can define filtered views, called GraphViews, that can be used in the context of a navigation.
Graph views can be used to disqualify paths that contain undesired types, and they can infer that certain paths cannot lead to any valid results using knowledge of schema relationships. This makes it possible for the navigation engine to bypass a lot of work. Because type is held as a first class property, the navigation engine does not need to open the object to determine whether it needs to be qualified or not. Graph views are configured by excluding types or by excluding a base type and including a derived type (excluding/including types can be done using an optional predicate as well). Graph views are a simple and intuitive way to perform path qualification when doing navigation on graph databases.
GraphView myView = new GraphView(); myView.excludeClass(myGraphDb.getTypeId(Highway.class.getName()), ì(weather.precipitation > precipitationX && weather.temperature < temperatureX) || traffic.speed < speedY || traffic.accidents > accidentsY î); myView.excludeClass(myGraphDb.getTypeId(City.class.getName()), ìlatitude >= Zî);
Schema: What’s Next?
What could be better than the flexibility to add or remove optional properties at runtime and still get the added benefits of having a base explicit schema available to the database to use? This is called schema hybrid solution and it is what Objectivity/InfiniteGraph is working towards.
For more information about custom placement, you can see an earlier blog that has been posted on writing a custom placement model. For more information about Objectivity or InfiniteGraph, feel free to also visit our website or contact Objectivity support at firstname.lastname@example.org. Happy Trails!