You know that you are dealing with “Big” data when you can no longer use general-purpose, off-the-shelf solutions for your problems. Big data technologies are all specialized for specific sets of problems. Apache Spark™ is uniquely designed for in-memory processing of data workflows at scale. Currently, it is the most active open-source project for big data processing. One key strategy for extracting the most value from large connected datasets is the use of graph analytics to derive business insights.
Distributed graph databases also support analytics at scale, but they are specifically designed to store complex graph data and perform fine-grained real-time graph analytics. By using Spark for expressing and executing your workflows in conjunction with a distributed graph database, you can design and execute workflows suitable for the next generation of applications that exploit insights derived from complex graph analytics.
What is Spark?
Apache Spark is a relatively new tool for data processing. The project was started at UC Berkeley’s AmpLab in 2009 and became an official Apache project in 2013. Essentially, it is an in-memory cloud computing platform that uses a data collection abstraction called a Resilient Distributed Dataset, or RDD. An RDD can express data that is either:
- Read from external storage (database, file, stream)
- Derived from a coarse-grained Spark transformation (GroupBy, Join, Filter, Map, etc)
RDDs are fault tolerant, which means that the system can recover lost data using the lineage graph of the RDD. RDDs are also lazily evaluated, which means that the workflow can be optimized by efficiently evaluating the steps to optimize the execution performance. For example, to get the word count of a particular text file in Spark, you might do the following:
Figure 1: Word Count Example
1 val text_file = spark.textFile("hdfs://...") 2 val word_counts = text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) 3 val count = word_counts.count()
Due to the effect of lazy evaluation, the workflow above will not be executed until the user asks for something back (in line 3). This will enable the reading of the data in line 1 and the chained transformations in line 2 to be executed in parallel on the workers.
The power of lazy evaluation is clearly demonstrated by the SparkSQL project which uses an extension of an RDD to express a dataset with structure called a DataFrame. The DataFrame uses the structure of the data to optimize the operations performed when generating the data through SQL statements and other transformations. DataFrames perform predicate pushdown, which pushes the predicate execution to the backend, and column pruning, which only retrieves field values that are requested. This enables the execution of the DataFrame to be optimized in almost every context.
Popularity of Spark
Spark is popular, because it is fast and easy to use with familiar language bindings such as Java, Scala, Python and R. Also, it comes with a monitoring UI and a Scala interpreter for interactive scripting. There is also strong support from the open-source community (over 630 contributors) and industry leaders. Finally, there are third party libraries for global graph analytics (GraphX), SQL support (SparkSQL), machine learning (MLlib) and stream processing (SparkStreaming).
On the other hand, Spark is not a storage solution. It has no transaction support and delegates to the backend for data persistence. Spark is ideal for fusing data from multiple data sources, which makes it perfect for doing analytics across lots of different kinds of data. In many cases (e.g. JDBC, HDFS, JSON, CSV, custom database connectors, etc.), the loading of the data is handled by the underlying Spark engine interacting with the data source. For SparkSQL, the schema, if present, is also read. Then, there is no data movement required. Here is an example of a join using data fused from two sources: Objectivity/DB and a JSON formatted file.
Figure 2: Fusing Data in SparkSQL
From this example, you can how easy it is to fuse data from multiple sources. In this case, the two DataFrames (customers and purchases) will be merged into one, which also merges the schema. This shows how simple it can be to perform data fusion in Spark.
The entire pipeline from accessing the data sources, to transforming the data, to saving results to the data source, can be expressed through the Spark API. This includes the representation of the table (as an RDD) to the graph representation (as a GraphRDD) and the design and execution of graph analytics. In this way, it would be a good design to use a data fusion backend, like Objectivity/DB, for storing metadata about data from different sources or to store data derived from processing performed on this source data.
Spark vs. Hadoop
In many ways, Spark is different from Apache Hadoop™, but in some ways, the two are complimentary. First, both are big-data processing engines, but Spark does not run map-reduce jobs. The RDD paradigm is a lot cleaner with a simple compute() method, so it requires less code. As opposed to Hadoop, Spark doesn’t dump to disk between jobs which makes Spark orders of magnitude faster and much better for iterative analytics jobs. Finally, Spark is an in-memory computing engine, which means that it will use memory when it can.
Spark is fault-tolerant like Hadoop. Unlike Hadoop though, which writes all intermediate results to HDFS, Spark uses the lineage of the RDD to re-compute the result. The lineage is held as a directed acyclic graph, or DAG. This gives Spark a huge performance edge because it allows Spark to execute all operations in memory. This is critical for the execution of complex graph analytics like page-rank centrality and machine-learning algorithms like logistic regression. These kind of analytics require repeated execution and can optimize their speed by leveraging the intermediate results that held in memory.
At the same time, Spark can also read and write to the Hadoop Distributed File System, or HDFS. Since the Hadoop paradigm of dumping between jobs is not ideal for iterative jobs, it may be preferable to use Hadoop only for one-off ETL-type jobs. Hadoop writes to HDFS, which is a unique file system optimized for the write-once and read many-times scenario. Also, Hadoop is a much more mature project and the integration with other big-data tools like Apache YARN™ is proven.
Spark and Graph Analytics
Spark is optimized for processing the entire workflow including graph analytics. Tables are not ideal data structures for doing graph analytics, which is why the Spark ecosystem provides libraries like GraphX and MLlib that can be set up to run in the same workflow. Because Spark supports graph analytics, there is no need to move data from one paradigm to the other. Instead, you can have the entire pipeline expressed and executed through the same framework.
Figure 3: Graph Pipeline
GraphX supports a property graph model which means that the graph is represented as a pair of vertex and edge property collections. Spark is ideal for repeatable algorithms that traverse the entire distributed graph dataset. This means that it is ideal for machine learning model training or algorithms that are required to converge. GraphX also has many built-in algorithms including Page-Rank, Connected Components, and Triangle Counting. Finally, you can write custom Graph Parallel algorithms to be executed through the Google Pregel™ API.
Data Fusion Workflows
If data is being stored in flat files or in a relational database, it is likely that the connections between the data are not being exploited for their value. By pulling data out using Spark and doing some simple analysis on it, the connections can be drawn out and the metadata with the connections can be stored in a graph metadata repository like the one that Objectivity offers. This continues to enable large scale graph analytics to be done across large swamps of data without moving the entire dataset from its source.
Ideally, if you have potentially valuable streams of data that are not processed, you can use SparkStreaming to process the datasets in batches and derive relevant graph metadata for graph analytics. A good example might be event identification such as anomaly detection in security applications. The event can be identified when the data falls within certain parameters. You can associate particular events with each other and store them in a backend that is suited for graphs. The types of metadata that could be stored include timeseries and geospatial data.
Next Generation Workflows
Graph analytics should be supported by data fusion solutions because this type of analytics provides valuable business insights, such as:
- Identifying influencers or related groups
- Identifying patterns through the data
- Providing valuable recommendations based on network clusters
With Spark, you can perform many of the large-scale analytics through GraphX. However, in order to store graph data and do local analytics in real time, a graph database is ideal. For example, to perform vertex discovery, shortest path, or path pattern matching, it is often only possible and preferable to run these on a graph database, because you can avoid traversing the entire dataset. For large scale graph analytics, Spark/GraphX is ideal.
Figure 4: Spark vs. Graph Repository
The size of the graph backend can grow massive while the data in Spark can be limited to improve the performance of the analytics. The underlying graph repository is distributed so that the workers can write locally to data partitions. This graph repository should also be scalable to handle the speed and volume of streaming data. Objectivity provides a distributed database system that has been successfully used to scale to massive sizes and perform large-scale distributed graph analytics. An installation that includes Spark reading from and writing to a graph repository backend would be ideal for big-data graph-analytic workflows.
Social Trending Data Example
Imagine a social media platform that identifies popular articles and posts to include in a daily email to all subscribers. Subscribers have profile preferences which can help to identify what categories or genres of posts they are interested in. In this case, there may be an activity stream of posts and user-preference updates to a Spark installation in the cloud. Spark and Objectivity would be configured and installed on a cluster of machines in the cloud.
Figure 5: Social Activity Stream
When the data stream is received by Spark, the Spark engine can filter out irrelevant posts and flag inappropriate posts. The graph metadata can be derived from the remaining posts. Later, a centrality algorithm like page rank can be executed on the persisted graph metadata to determine the most popular articles and posts. Finally, a local query can filter out posts from irrelevant genres and the email that includes the trending activity can be generated and sent to the subscriber.
GeoTracking Data Example
Imagine a location-tracking application on a mobile phone that also displays relevant marketing campaigns based on preferences generated from user movements. In this example, the stream would be composed of cell phone GPS locations and times. Also, updates to marketing campaigns and their target category would be added. The GPS location of the user would be associated with a location, which may be retail, commercial, restaurants, etc. A web-service would be used to correlate GPS coordinates to particular mapped locations and category, which may be groceries, sporting goods, fast food, etc. Relevant metadata (for example, basic user information and a timestamp) would then be persisted with the visit and linked to the category in the database along with the user.
Figure 6: GeoTracking Workflow
A scoring algorithm would then be executed on the data in the graph backend to calculate the best marketing campaign to display for the user in real-time. A marketing campaign based on users actual movements and implied preferences would be much more likely to be effective.
From these examples, we can see that the next generation workflow involves pulling fast data from streams and loading large datasets from disk, then deriving a graph from the metadata and saving it to a graph repository.
Figure 7: Workflow Pattern
Once the connected data is persisted, then Spark can process it through the graph algorithms for global analytics and the graph backend can run local analytics natively.
Take a moment to consider the Google Search™ capability. Here, the valuable insights gained from a simple page-rank algorithm executed through a large distributed dataset have revolutionized our world. Of course, it is no longer as simple as page-rank, but the results of a simple application that exploits graph analytics on the right dataset can bring about tremendous change. The next generation workflow implemented with Spark and an Objectivity graph repository is poised to fully exploit graph analytics for connected big data.
Figure 1: Word Count Example from http://spark.apache.org/
Principal Architect of InfiniteGraph