Welcome to Objectivity, Inc. -- makers of the industry leading Objectivity/DB object-oriented database management platform, Grid Certified (Levels 1 through 6), and SOA compliant Twitter LinkedIn YouTube RSS Feed

.
Object Oriented Database Learning Center

Data Fusion - The Technical Challenges

Data Fusion - integrating complex data from multiple sources

The Technical Challenges

Making Better Use of Complex Data: It is relatively easy to obtain data from multiple, similar sources. Some common mechanisms include:

  • Using ODBC to access data in multiple relational databases
  • Using messaging protocols such as CORBA or RMI to remotely invoke functions that can manipulate or transfer small amounts of data.
  • Using XML to transfer object definitions and data between systems.
  • Adopting a common structured file format, such as a JPEG or GIF bitmap.
  • Using database triggers to snapshot data out to files or messages that are sent to other systems.

Complex data, such as data from a CAD system, a bioinformatics system or a remo te sensor is much more difficult to store and manipulate. Conventional search languages do not include the ability to search voiceprints, images, fingerprints, solid [mechanical] models, SONAR, RADAR and many other kinds of complex data. Most of these data types can only be represented as opaque BLOBS in relational databases. They do not translate readily to generic data definition languages such as XML and the datasets can often be very large with a huge number of inter-relationships. They generally reside in structured files and can only be indexed with the help of humans or by encoding knowledge of some of the key data types within the file. Many repositories resort to brute force searches of these files in the hope of finding recognizable text strings.

The “Needle in a Haystack” Problem: There is a big difference between owning data and being able to find pertinent and timely information hidden within it. Many organizations routinely collect and file masses of information but lack the means to interpret its real meaning, or to extract all of the facts relating to a known item or event. Furthermore, the rate at which data is being collected or generated is rising exponentially. This issue is characterized as the “needle in a haystack” problem.

One part of the problem is that legacy databases can only store and query relatively simple information. They were not built to manipulate the many kinds of complex data that are being collected and processed in today’s systems. Increasing demands, particularly in the areas of Homeland Security and national defense, will place even greater strains on the legacy systems.

Another major issue is that most data mining tools were built to deal with relatively simple business data. They are not good at following and exploring relationships between data items, particularly where there are many types of relationship. In many cases the relationships between items only become clear as the users’ explore existing data to solve a problem. Users must be able to define and record newly discovered types of relationship. Storing these relationships and building increasing numbers of indices over frequently used data puts an added burden on legacy systems.

Each new relationship may add a new JOIN table, adding to the complexity and overheads of queries that explore the relationships between data items. Each new index adds to the processing and I/Os that must be done to add new data. Once an index is added it must cover all instances of that type of data, which may be overkill.


Object Oriented Database Learning Center