There are many reasons Spark is fast becoming the defacto standard for large scale data processing. While clever use of in memory computing, optimized execution and built in machine learning libraries are often cited as reasons for its meteoric rise in popularity, it’s the way Spark has embraced structured data and external sources that I find particularly impressive.
No matter what your reason for using Spark, it will almost certainly involve reading data from external sources. A common use case is to consume large quantities of unstructured operational data dumped into HDFS and fuse it with structured historical metadata that represents the system’s learning or knowledge over time. Typically, this knowledge repository is maintained in a database that can also be leveraged and updated by other applications, business systems etc.
Over the past few releases of Spark, SparkSQL and the Dataframes API have evolved as a powerful way to interact with structured data. At the lowest level it allows an external datastore to be represented as a set of Dataframes which are akin to virtual SQL like tables. This allows the use of SQL to access data from disparate datasources, even joining across tables that derive from totally separate physical datastores.