Retailers have deployed advanced business intelligence tools for decades in order to determine what to sell and to whom, when, where and at what price. Much of the transactional data was too voluminous for smaller retailers to keep for long, putting them at a disadvantage against the industry giants and more agile web-based retailers. The falling prices of commodity storage and processors are making it possible to keep data longer. This data can also be combined with external sources, such as information gathered from social networks, then analyzed by more powerful machine learning technologies and other tools.

In this blog, we will look at how any retailer—traditional or online—might identify slow-moving products and use their own sales transaction data in conjunction with social media information about bloggers who have mentioned or bought a product in order to identify and target potential buyers.

We start by loading essential details of the products, orders, customers and social network data into ThingSpan to create the following graph structure in Figure 1 below.



Fig. 1: Retail graph structure

The next step depends upon the sophistication of the retailer’s algorithms for deciding when a product has been in stock for too long. It might range from a simple SQL query based upon profit and time to a complex machine learning and predictive analytics algorithm. Either way, this step can be completed by using Spark SQL and MLlib (or proprietary rules) to identify the products to be dealt with. The result of running the step might result in Figure 2 below, where Pr2 is identified as slow-moving.



Fig. 2: Slow-moving product identified

Although it is possible to complete the next step in one query with ThingSpan’s advanced navigational queries, we’ll approach it in two steps for clarity. The first is to find all other products that tend to be bought at the same time as Pr2. This involves traversing from Pr2 to connected Sales objects, then traversing to the other connected Product objects. In this case we find that Pr2 is often, but not always, sold alongside Pr1 (Figure 3).



Fig. 3: Connection between products and sales identified

We now look at the people who bought products Pr1 and Pr2 to see if they are bloggers and have any followers. The search can be extended out to their followers too, to any depth. This, again, is a simple navigational query in ThingSpan. It turns out (see Figure 4 below) that Fred bought both products and that Mary follows Fred’s blog. Jane and Bill also follow Mary’s blog.



Fig. 4: Social network identified

We can check to see what Mary, Jane and Bill have bought and offer them products Pr1, Pr2 or both. We will not only be helping clear the inventory of product Pr2, which was the original aim, by offering it to Jane and Bill (see Figure 5 below), but we will also have the chance of selling more of Pr1 to Mary, Jane and Bill.



Fig. 5: Product offer identified

We have shown that the combination of Apache Spark and Objectivity’s ThingSpan makes it easy to tackle some of the kinds of problems that retailers face on a daily basis. Figure 6 below shows how the two technologies are tightly but flexibly integrated to solve the problem.



Fig. 6: Objectivity’s ThingSpan architecture

The Spark analytics ecosystem is very powerful and has a huge community that is adding algorithms. However, to fully harness the power of real-time graph analytics, it’s essential to integrate it with a massively scalable distributed graph platform like ThingSpan.

Spark GraphX accesses two kinds of Resilient Distributed Dataset (RDD) that represent the vertices (nodes) and edges (connections) that form the graph. This is the traditional way of handling a graph in relational tables, but it has the disadvantage that join tables and B-Tree indices must be created to access the rows in the RDDs. ThingSpan’s underlying DBMS doesn’t need join tables, making navigation and path-finding much faster than an RDBMS or GraphX.

ThingSpan only loads the parts of the graph required for a particular query and can reuse its cache as the graph is traversed. This enables ThingSpan systems to handle hundreds of billions, even trillions, of nodes and edges, whereas GraphX may run out of memory when handling huge graphs. ThingSpan has the performance and scalability required for retail institutions that are managing Fast and Big Data. ThingSpan can also handle high-speed parallel ingest while the analytic queries are also being run in parallel.

With ThingSpan’s next release, there will be several demonstration suites, but in the meantime, you can see a sample of graph navigation being used in an online shopping system.

To learn more about how ThingSpan can leverage Spark for graph analytics and relationship discovery in your organization, please contact us.



Leon Guzenda

CTMO and Founder

Leon Guzenda - Founder