Graph Analytics with Billions of Daily Financial Transaction Events
In this first blog of the series we will explain how ThingSpan can deliver solutions to meet the demanding needs of today’s complex analytics systems by combining big and fast data sources. In particular, we will focus on ThingSpan’s capabilities to scale up as data volumes increase, and scale out to meet response time requirements.
The requirement for this use case was to ingest one billion financial transaction events in under 24 hours (about 12,000 transactions a second) while being able to perform complex query and graph operations simultaneously. ThingSpan has proven to handle this requirement to ingest while performing these complex queries at the same time at speed and scale. It is important to note that the graph in this case is incrementally added to as new events are ingested. In this way the graph is always up to date with the latest data (nodes) and connections (edges) and is always available for query in a transactional consistent state, what we call a living graph.
We used Amazon EC2 nodes for the tests and found that “m4x2large” gave the best resources for scalability, performance, and throughput.
We surpassed the performance requirements using a configuration consisting of 16 “m4x2large” nodes. For this test we used the native Posix file system.
Full details of the proof of concept can be downloaded from:
In our example use case, approximately one billion financial transaction events occur each day. Each event produces a subgraph that creates or merges 8-9 vertices and creates 19 new edges. The relationships between the entities in each transaction event are created instantly as edges between the vertices.
With ThingSpan, the data is available to be queried immediately after the subgraph produced by the financial transaction event is added to or merged with the existing graph. This allows real-time discovery of insight, or signal, to be divined from the apparent noise in the flood of events.
To achieve the ingest performance requirements we leveraged ThingSpan’s distributed architecture to process multiple streams of input data in parallel. We also leverage open source tools like Kafka and Samza to perform the queuing and processing of the events, and YARN for the distributed resource management.
Transaction events are read from a Kafka topic, but they could have easily been read from other streaming technologies like Spark Streaming, Flume or Stream Sets. Each event is formed into vertices and edges. The edges are further decomposed into triples to reduce lock contention and allow the parallel processing of the edge. This results in a lower latency of each operation and increased throughput.
The upsert of vertices and the insert of edges (decomposed to triples) are funneled into Samza tasks running on the cluster and managed by YARN. These upserts are consistent and idempotent.
ThingSpan runs queries in parallel. Each query is partitioned into parts, and a part, or partition, of the query is sent to each machine where it is executed as a YARN job. The query returns multiple paths from each partition, and these are collated into a single result. In the Spark world, this process can be described as transforming (mapPartition) each input partition into an RDD or DataFrame.
Using Spark DataFrames allows results from ThingSpan to be processed even further. Spark SQL statements can join, aggregate, and select from multiple tables. DataFrame operations are processed in parallel across the cluster.
The parallelism of queries allows near linear scaling of query throughput by “scaling out” the cluster.
For the initial results we focused on getting the required throughput on the most optimum configuration possible. The one billion events ingested was achieved with 16 m4x2karge nodes in 17 hours 30 minutes.
We can use these results to show near linear scale up as the volume of data increases and scale out to achieve better throughput performance.
The blue bar represents the time to ingest 250 million events. Therefore, by doubling the number of compute nodes, the ingest time halves. You can see the same affect with the red bar for 500 million.
Also, if you compare the red and blue bars for each of 4, 8, 16 nodes, you will see that for the same number of nodes doubling the number of events effectively doubles the time to process those events.
The results show that ThingSpan scales near linearly as the volume of data increases and as the number of compute nodes is increased.
ThingSpan exceeded the requirement of ingesting one billion transaction events in under 24 hours (actually 17 hours and 30 minutes) while simultaneously querying the graph.
Stay tuned for future blogs that will highlight ThingSpan’s performance as we scale out the number of compute nodes.