Traversing a large highly connected dataset can cause you major headaches. This is a problem for anyone who wants to analyze graph data. Consider the Internet Movie Database (IMDB) dataset. Imagine that we want to find all connections between two actors. Consider, for example, actors Luke Wilson and Kevin Bacon. An example of a path with the shortest length (1) between these two actors would be [Luke Wilson]-[Telling Lies in America]->[Kevin Bacon]. We might only be interested in the shortest path, but to find it, we might have to traverse a massive amount of data. When considering these kind of queries, it would be prudent to ponder a few questions:

  1. How large is my search space and what can I do to make it as small as possible?
  2. How do I define what qualifies as a result within the search space?

By considering these questions, you begin to frame what the requirements for navigational performance might be. Demanding real-time results requires that you find ways to filter large search spaces. If nothing is done to limit the set of available paths, then your navigation engine must traverse them all. For large connected datasets, this mean that you will attempt to qualify every single data point especially if your qualification algorithm is complex. This is not only inefficient but is unnecessary. In order to reduce the search space, you use ‘path qualification’ techniques. Path qualification involves examining a path and either continuing along it for further traversal or filtering/pruning it.

Consider a massive set of CDRs (Call Data Records). Maybe you don’t care to search through any phone records and instead choose to focus only on SMSEdges and EmailEdges. By filtering out edges of type PhoneRecordEdge, you avoid traversing a huge amount of data. Here is a visualization of all paths 3 degrees deep from a given mobile number with all of the edges of type PhoneRecordEdge highlighted. As you can see, many subgraphs would be avoided by disqualifying these edges as part of the navigation. Note: Because many of these subgraphs are connected by two different edges to the start vertex, you won’t exclude the subgraphs, but you will avoid traversing all of them twice for both the SMSEdge and PhoneRecordEdge and instead traverse them only once.

Path Pruning: Filtering out edge types

Path Pruning

Some might think that large-scale navigation performance is not a problem for graph technologies, but this is not uniquely a database problem, it is a data problem. There is no way to traverse a large number of paths in a highly-connected dataset without taking time. The larger the number of paths, the longer it will take. If the database provider is interested in performance AND usability, they will often offer ways to subvert the massive amount of work necessary to execute this query.


Consider the path described here: [Luke Wilson]-[Old School]->[Will Ferrell]-[Step Brothers]->[John C. Reily]-[The River Wild]->[Kevin Bacon] in which Luke Wilson is separated from Kevin Bacon by 3 degrees of separation. Since, on average, every actor in Hollywood is separated from Kevin Bacon by about 3 degrees, a query that traverses to that depth would hit virtually every actor in the entire IMDB dataset. If you wanted to search the entire dataset, this is no problem. InfiniteGraph even has a policy called MaximumMemoryUsePolicy that can limit the amount of memory that you use on any particular navigation call, so that you exit the navigation cleanly instead of overwhelming your system by consuming all available RAM.

On the other hand, if you want to find paths from Kevin Bacon to Luke Wilson and just want the shortest path or shortest five paths, but neglected to filter your search space, you are going to suffer some unnecessary grief. To avoid the pain associated with traversing this unending data maze, you need to mitigate this problem. InfiniteGraph provides tools specifically designed to perform path qualification so that you do not have to traverse data that is irrelevant or out of the bounds of your query, while at the same time managing to avoid bringing a massive set of data into memory. In summary, here are a few ways (not exhaustive) to reduce the search space associated with your navigational query:

  • Graph Views: Filter out vertex and edge types, as well as in IG 3.2, exclude subpaths,
  • Policies: Use navigation configuration settings to tune navigation such as by setting maximum path depth,
  • Custom Path Qualification: Usr custom logic written in Java to determine whether vertices should be qualified.
  • Distributed Query: Divide the work across multiple processors.

Note: For best performance, many of these techniques can be used in combination. For example, GraphView and Policies are commonly used together.

Path Qualification: GraphViews

Graph views are user-defined constructs that identify filtered types. These graph views are used in the context of a navigation to perform path qualification. You can choose to do any or all of the following:

  1. Exclude vertex or edge types, optionally with a predicate.
           GraphView myGraphView = myGraphDb.createGraphView();

    The navigator that uses this GraphView will discontinue traversal down a path when it discovers an instance of a Person vertex.

  2. Exclude a base vertex or edge type, then include one or more derived vertex or edge type(s), optionally with a predicate.
          GraphView myGraphView = myGraphDb.createGraphView();

    The navigator that uses this GraphView will discontinue traversal down a path when it discovers an instance of any vertex types except Person.

  3. Exclude a sub-path composed of a sequence of vertex and edge types using the Path Match Syntax.
          GraphView myGraphView = myGraphDb.createGraphView();

    The navigator that uses this GraphView will discontinue traversal down a path that contains any instances of this sub-path.

Because a GraphView disqualifies paths, the use of a GraphView in a navigation lets you completely avoid doing “unnecessary” work. The navigator will perform faster and leave a smaller memory footprint.

Path Qualification: Policies

Policies are an easy way to globally configure the behavior of the navigation. Several policies are useful for performing different versions of path qualification, including limiting the depth of the navigation (MaximumPathDepthPolicy) or limiting the number of results that can be retrieved (MaximumResultCountPolicy). Policies like these make it easy to perform a very powerful and rich navigational query without suffering the burden of waiting for it to return. The following list outlines the various policies that perform path qualification:

  • MaximumPathDepthPolicy: limits the depth of navigation from the starting vertex
  • MaximumResultCountPolicy: limits the number of results returned by a Navigator
  • MaximumBreadthPolicy: limits the maximum size of any level for a breadth-first navigation
  • FanoutLimitPolicy: controls the maximum number of outgoing edges that may be traversed from any node during a navigation
  • EdgeTraversalPolicy: controls the kinds of edges that may be traversed during a navigation (OUTGOING and BIDIRECTIONAL are the defaults)
  • MaximumMemoryUsePolicy: limits the memory used for navigation processing
  • PathDepthBeyondShortestPolicy: dynamically limits the depth of navigation from the shortest known result path at any point during the navigation

These can each be used to customize and improve the performance of your navigation by reducing the search space or pre-emptively completing the navigation based on a condition. The following example shows the before and after view of a large dataset pruned from a maximum path depth of 5 to a maximum path depth of 3. The results are signficant!

Policies: MaximumPathDepthPolicy set to 5

Max Result Count of 5

Policies: MaximumPathDepthPolicy set to 3

Max Result Count of 3

Path Qualification: Built-In Qualifiers

There are many default implementations of a Qualifier that you can use for path qualification. These built-in qualifiers are compatible with the native engine, so they will be more performant, in general, than using a custom qualifier in Java to qualify paths. Here is a list:

  • VertexPredicate: Built-in qualifier type that accepts an optional predicate string for qualifying vertices
  • EdgePredicate: Built-in qualifier type that accepts an optional predicate string for qualifying edges
  • VertexTypes: Built-in qualifier type that qualifies paths that contain vertices of the specified types
  • EdgeTypes: Built-in qualifier type that qualifies paths that contain edges of the specified types
  • AND, OR, and NOT: An expression evaluator for combining multiple qualifiers (built-in OR custom) into a boolean AND, OR, or NOT expression, respectively

Path Qualification: Custom Path Qualification

InfiniteGraph lets you customize the qualification of paths, conditionally deciding whether to keep traversing the path based on your own logic. This can be done by implementing the Qualifier interface and passing it into a call to create a Navigator. The Qualifier requires the user to implement only one method: Given the current path, return true if you want to qualify it or false otherwise.

public boolean qualify(Path currentPath);

There are many things that you can do with a custom path qualifier. (Many things that some used to do with a custom path qualifier can now be done with a policy or graph view. For example, limiting the path depth using the MaximumPathDepthPolicy, or excluding types or sequences of types (a.k.a. a path) using a GraphView.) The following shows an example of a simple custom path qualifier implementation that qualifies the path based on whether each vertex in the given path has at least some given number of edges of a certain given type.

public class MinEdgeTypeQualifier implements Qualifier
    public int mEdgeCount;
    private long mTypeId;

    public MinEdgeTypeQualifier(int minEdgeCount, long targetTypeId)
        mEdgeCount = minEdgeCount;
        mTypeId = targetTypeId;
    public boolean qualify(Path currentPath)
       Hop hop = currentPath.getFinalHop();
       VertexHandle vertexH = hop.getVertexHandle();
       // if total edges are less than required target count, return false
       if(vertexH.getEdgeCount() >= mEdgeCount)
          int targetCount = 0;
          for(EdgeHandle edgeH : vertex.getEdgeHandles())
              // if number of target edges found >= reqd count, return true
              if(targetCount >= mEdgeCount)
                  return true;
              // if edge has reqd type, iterate count
              if(edgeH.getTypeId() == mTypeId)
       return false;

This implementation will only traverse paths where each vertex in the path has at least the given instances of the given type of edges connected to it. This is a pretty specific example of a custom qualifier, but that is why one would use a custom implementation because path qualification for most of the common use cases will be covered by the built-in qualifiers, policies, and/or the graph view.

To use your custom qualifier as a path qualifier, you need to pass it in as a parameter in the creation of the navigator. Using the following method:

    public Navigator navigate(GraphView view, Guide guide, Qualifier pathQualifier,
	    Qualifier resultQualifier, PolicyChain policies, NavigationResultHandler resultHandler);

Note that most of the built-in qualifiers, policies or graph views will be more performant than the custom implementation of the path qualifier because they can be passed down and executed within our native C++ navigation engine rather than in Java. Also, if you need to use a custom qualification implementation, you can optimize the performance of the navigation by executing it in Java using the UseJavaOnlyNavigator policy.

Path Qualification: Distributed Query

If the size of your search space cannot be pruned using the previous approaches and/or the size of your search space is still too massive for one processor to handle, there is a way to distribute the load of navigating this large search space using InfiniteGraph’s distributed query server.

First, it is assumed that the data is distributed. To learn more about custom placement strategies, you can visit the InfiniteGraph Developer Site or read other blogs on customizing placement or distributing your data. Without distributing the data, the query cannot be distributed.

Second, the query server, ooqs.exe, must be running on each host in the cluster. The query servers can communicate with each other. A query server on the host from which the navigation was started can initiate work to be done on a remote host. If all of the data that is to be searched is on one machine, using distributed navigation will not help (but it doesn’t hurt either).

Third and finally, you must use the UseDistributedNavigation policy so the particular navigation that you are executing is managed by the query server. The query server is multi-threaded so that pieces of the navigation are executed in parallel. Small amounts of work can be duplicated, but during the gathering stage, all of the data is coalesced and the duplicate results are eliminated. Distributing the navigation to multiple hosts will significantly improve the overall performance of your query.


Our navigation engine is one of our finest features, and we hope that you get a chance to use it to perform some dynamic analysis of your data. As always if you have any feedback or questions, feel free to check out our google group and for more information about Objectivity or InfiniteGraph, feel free to visit our website or contact Objectivity support at Happy Trails!

Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedIn