Information fusion has its foundation in data fusion as used by military and intelligence agencies, and is generally defined as the use of techniques that combine data from multiples sources and gather that information in order to achieve inferences. This process would be more efficient if the fusion was achieved by means of a single source.

Depending on the model used, there are several levels of assessment or refinement. As the fusion process goes through these different levels, the information is refined as more value is added. Information fusion can be defined as the process of merging information from disparate sources despite differences in conceptual, contextual and typographical representations, typically combining data from structured, unstructured and semi-structured resources.

The world is full of real world objects (people, places, things) and relationships (knows, likes). Information fusion works with these real world objects and relationships, and in the fusion process discovers new objects and relationships. The best way to represent these is in an object model representation.

Objectivity’s ThingSpan deals with objects and relationships natively and dynamically: no decomposition into rows and columns of a table, and no pre-defined schema. Moreover, the process of information fusion generates metadata, or data about the data, such as what processing (transformations) took place, when, by whom, and from which source. The information may contain geo-location or time-series data that can be extracted for further analytics.

The demo of ThingSpan that I will now explain has its roots in call detail records (CDR). The FROM subscriber number and TO subscriber number builds a graph of who calls whom, through many degrees of separation, which can be used to find the ‘bad guys.’ The credit card and bank transactions were added later.

The key element is being able to represent a network (or graph) of many-to-many relationships very efficiently and being able to query (or navigate) the graph very quickly.

Demo Introduction

The demonstration simulates three streams of data: telephone calls, bank transactions and credit card transactions (Figure 1). The key fusion element is the person who owns the phones in the calls, owns the credit card, and owns the bank account. The fusion happens during data ingest phase.

 

datastreamsblog

Figure 1 – Data streams

 

Data Ingest

The streams are simulated by reading data from CSV files into DataFrames. The DataFrames are then written to the ThingSpan metadata store using the ThingSpan data frame writer. Note the format parameter denotes that it is a “thingspan” data frame, and the two options.

System.out.println("Write out DataFrame to ThingSpan");
             personDF.write().
             mode(SaveMode.Overwrite).
             format("com.thingspan.spark.sql").
             option("bootFilePath",  bootFile).
             option("dataClassName", "com.thingspan.spark.java.demo.Person").
             save();

The object types are Person, CreditCard, Business, BankAccount, Address, Vehicle, Phone (Figure 2). The relationships (or events) are Knows, LivesAt, Owns, Payment, PhoneCall, and WorksFor.

During the ingest, there are a lot of object lookups to see if the object already exists. If it does not, then it must be created before connections can be made. A recent enhancement to the database has implemented “upsert” and “targetfinder” operations, which are performed in the database kernel and are therefore very fast.

 

objectsrelationshipsblog

Figure 2 - Objects and relationships (events)

 

Queries

There are three types of queries depending on the data set being queried.

 

Query #1: Paths between two phones (Figure 3)

Given two phone numbers, find all calls between the two phone numbers. The depth (degree of separation, number of hops) can be set, as well as the number of results to return. The query also finds the owner of the source and target phone numbers. In the screen shot below the number in red is the source and the number in green is the target. The curved arc indicates direction (from – to) in a clock-wise manner. The query can filter on time and geo-location.

 

pathsphonecallsblog

Figure 3 - Paths Between PhoneCalls

 

Query #2: All paths between two credit card numbers (Figure 4)

Given any two credit card numbers, find all paths between the two cards, including businesses, banks and persons. The depth (degree of separation) and number of results to return can be set. The view chosen for this query is called time arc, where the results are laid out such that the arc represents the time when the event between two objects occurred. The x-axis represents time. The query can filter on time and geo-location.

 

pathscreditblog

Figure 4 - Paths Between Credit Cards time line

 

Query #3: All paths between two persons (Figure 5)

Given any two persons, find all paths between them, including bank transfers, credit card transactions, bank accounts, credit cards, and persons. The depth of this money trail (degree of separation) and number of results to return can be set.  The view chosen for this query overlays on a map using the geo-location of the event.

 

pathstransactionsblog

Figure 5 - Paths Between transactions geo-location

 

Future Work

We have a path-finding query feature where you can build a query by specifying objects and relationships. At this point, we will combine the query builder into one screen and add additional filters.

 

Conclusion

Hopefully, you now have a better understanding of what information fusion is and how it has been applied to a particular problem domain involving many-to-many relationships, in this case how people are connected through telephone calls.

In today’s world, people communicate in many different ways other than telephone calls, including text messages, e-mails and social media. All this additional information can be fused into the graph as additional objects (nodes) and relationships (edges). You can still ask the database how these two people are connected, through any number of degrees of separation, through any combination of object types and relationship types.

So anywhere you have a many-to-many representation of the data AND relationships, you should be able to apply a ‘just imagine’ scenario, e.g. network routing, cyber security (as in cyber attacks spreading through the network), spread of diseases, transportation and logistics.

 

 

 

Brian Clark

Corporate VP of Product

Brian Clark - Corporate VP of Product

 

SHARE THIS POST
Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedIn