There are many reasons why Spark is fast becoming the de-facto standard for large scale data processing. While the clever use of in-memory computing, optimized execution and built-in machine learning libraries are often cited as reasons for its meteoric rise in popularity, it’s the way that Spark has embraced structured data and external sources that I find particularly impressive.
No matter what your reason for using Spark, it will almost certainly involve reading data from external sources. A common use case is to consume large quantities of unstructured operational data dumped into HDFS and fuse it with structured historical metadata that represents the system’s learning or knowledge over time. Typically, this knowledge repository is maintained in a database that can also be leveraged and updated by other applications, business systems, etc.
Over the past few releases of Spark, SparkSQL and the DataFrames API have evolved as powerful ways to interact with structured data. At the lowest level it allows an external datastore to be represented as a set of DataFrames which are akin to virtual SQL-like tables. This allows the use of SQL to access data from disparate data sources, even joining across tables that derive from totally separate physical data stores.
Unfortunately, not all databases are created equal. More importantly, very few can scale horizontally to meet the throughput demands of a large Spark cluster. Consider the following architecture where a centralized database is being used to maintain knowledge or metadata over time for a batch or streaming Spark application.
In this scenario, Spark is free to distribute and scale the job across all worker nodes in the cluster and execute in parallel. When deployed in a Hadoop cluster, it is also able to read HDFS files in parallel. It is even smart enough to prefer reading local data partitions where possible. The obvious bottleneck here is the centralized database being used to read and write metadata or update the knowledge repository. Any fusion platform or data processing architecture using a data store that cannot co-operate with Spark to distribute the flow of data to and from the processing node, will ultimately limit the throughput available.
In comparison, underpinning ThingSpan is an advanced storage layer, not only optimized for storing and querying knowledge and metadata relationships, but also fully integrated into SparkSQL to optimize movement of data in and out of the processing layer.
As part of its Spark DataFrame implementation, ThingSpan communicates partitioning information for any given query before executing it fully, allowing the resource manager to set preferences for reading data stored locally to the worker node that receives it. This results in a SparkSQL query that is able to execute completely in parallel while also avoiding as much movement of data across the network as possible.
While there are other distributed databases, this type of tight integration with SparkSQL is only possible when the underlying database is capable of conveying precise partitioning information based on partial query plan, a feature not common to all distributed databases. This clear advantage is just one reason why ThingSpan is able to accelerate the overall performance of analytics in Spark.
I hope that this provides a high level overview of how ThingSpan integrates tightly with Spark to accelerate movement of data to and from Spark workers. My next blog in this series will get more hands-on with a detailed walkthrough of setting up ThingSpan on a Spark Cluster in AWS.