Powerful Persistence: ThingSpan & SparkSQL

Powerful Persistence: ThingSpan & SparkSQL

There are many reasons Spark is fast becoming the defacto standard for large scale data processing. While clever use of in memory computing, optimized execution and built in machine learning libraries are often cited as reasons for its meteoric rise in popularity, it’s the way Spark has embraced structured data and external sources that I find particularly impressive.

No matter what your reason for using Spark, it will almost certainly involve reading data from external sources. A common use case is to consume large quantities of unstructured operational data dumped into HDFS and fuse it with structured historical metadata that represents the system’s learning or knowledge over time. Typically, this knowledge repository is maintained in a database that can also be leveraged and updated by other applications, business systems etc.

Over the past few releases of Spark, SparkSQL and the Dataframes API have evolved as a powerful way to interact with structured data. At the lowest level it allows an external datastore to be represented as a set of Dataframes which are akin to virtual SQL like tables. This allows the use of SQL to access data from disparate datasources, even joining across tables that derive from totally separate physical datastores.

Making Spark Work for Next Generation Workflows

Making Spark Work for Next Generation Workflows

Introduction

You know that you are dealing with “Big” data when you can no longer use general-purpose, off-the-shelf solutions for your problems. Big data technologies are all specialized for specific sets of problems. Apache Spark™ is uniquely designed for in-memory processing of data workflows at scale. Currently, it is the most active open-source project for big data processing. One key strategy for extracting the most value from large connected datasets is the use of graph analytics to derive business insights.

Distributed graph databases also support analytics at scale, but they are specifically designed to store complex graph data and perform fine-grained real-time graph analytics. By using Spark for expressing and executing your workflows in conjunction with a distributed graph database, you can design and execute workflows suitable for the next generation of applications that exploit insights derived from complex graph analytics.