MapReduce implementations are not a low-latency solution
It's important to note that MapReduce implementations are not a low-latency solution. In fact, MapReduce is not particularly adept at being "query-able." There is generally something on-top that makes it so, Google has BigTable on top of their map-reduce solution, Hbase (Data Warehouse) \ Hive (Columnar) live above Hadoop files (HDFS).
Hadoop is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Objectivity/DB.
Hadoop thinks in hours. Objectivity/DB thinks in seconds or less.
From an Objectivity/DB application development perspective, Hadoop potentially takes a lot of the grunt work out of processing raw data for us. Namely, parsing unstructured or semi-structured data. SSAFE (a customer implementation for space situational awareness) had to build their own processing pipeline for raw data (Rogue Wave Hydra and custom code). MR may have been moer appropriate for the raw data processing.
If you still want to "walk a graph" ad-hoc in a low-latency, real-time fashion - which is what we do very well - then you'll still want Objectivity/DB.
For Hadoop integration, Cloudera is our choice (Cloudera is to Hadoop, what Redhat is to Linux). The conventional integration points don't work for us, as they are primarily relationally oriented. So no Sqoop (only one-way anyway), nor DBInputFormat (bidirectional but JDBC based).
For us, I think integrating directly via the map and reduce functions is the best approach. Thrift is also a possibility, however as it stands now you have to go through a proxy service, which makes me nervous.
Hadoop is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Objectivity/DB.
Hadoop thinks in hours. Objectivity/DB thinks in seconds or less.
From an Objectivity/DB application development perspective, Hadoop potentially takes a lot of the grunt work out of processing raw data for us. Namely, parsing unstructured or semi-structured data. SSAFE (a customer implementation for space situational awareness) had to build their own processing pipeline for raw data (Rogue Wave Hydra and custom code). MR may have been moer appropriate for the raw data processing.
If you still want to "walk a graph" ad-hoc in a low-latency, real-time fashion - which is what we do very well - then you'll still want Objectivity/DB.
For Hadoop integration, Cloudera is our choice (Cloudera is to Hadoop, what Redhat is to Linux). The conventional integration points don't work for us, as they are primarily relationally oriented. So no Sqoop (only one-way anyway), nor DBInputFormat (bidirectional but JDBC based).
For us, I think integrating directly via the map and reduce functions is the best approach. Thrift is also a possibility, however as it stands now you have to go through a proxy service, which makes me nervous.
From a scalability perspective, Objectivity/DB could scale side-by-side with Hadoop nodes (i.e. horizontal scaling via commodity hardware). Traditional relational systems are sensitive to the load that Hadoop could place on them. In fact, bulk exports are often used (which means loading delays and potentially stale data) to protect relational processing performance on the live system. Objectivity/DB could stream in processed Hadoop information in parallel as quickly as it was computed and still support low-latency query. That's not too shabby.
Todd Stavish is an Objectivity, Inc. Systems Engineer with expertise in distributed computing applications. He advises customers about complex modeling, performance optimization and building fault tolerant applications.
Todd Stavish is an Objectivity, Inc. Systems Engineer with expertise in distributed computing applications. He advises customers about complex modeling, performance optimization and building fault tolerant applications.
Labels: Cloudera, Data, Hadoop, Latency, MapReduce, Todd Stavish
