Monday, 3 December 2012

Google Dremel vs Apache Hadoop: Big Data Analytics Tools and Techniques

Google Dremel vs Apache Hadoop: Big Data Analytics Tools and Techniques

This is the article about some useful and popular real time big data analytics tools and techniques. Some of the big names are Apache Hadoop and Google Dremel. There are also other open source big data analytics tools and techniques like storm and apache S4. We will see the difference between Google Dremel and Apache Hadoop.

What is Google Dremel?

Google Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.

Apache Hadoop vs Google Dremel: Difference between Apache Hadoop and Google Dremel

Dremel is a data analysis tool designed to quickly run queries on massive, structured datasets (such as log or event files). It supports a SQL-like syntax, but apart from table appends, it is read-only. It doesn't support update or create functions, nor does it feature table indexes. Data is organized in a "columnar" format, which contributes to very fast query speed. Google's BigQuery product is an implementation of Dremel accessible via RESTful API.

Hadoop (an open source implementation of MapReduce) in conjunction with the "Hive" data warehouse software, also allows data analysis for massive datasets using a SQL-style syntax. Hive essentially turns queries into MapReduce functions. In contrast to using a ColumIO format, Hive attempts to make queries quick by using techniques such as table indexing.

Hadoop is for batch processing, meaning that queries are run on a set of data that you already have. Streaming engines process data as it comes in. The terms “streaming” and “real time” are often used interchangeably, which could lead to some confusion about Dremel/Drill since they are also referred to as real time.

It should be noted that Google is intending Dremel as a complement, not a replacement, for MapReduce and Hadoop. According to the paper, Dremel is frequently used to analyze MapReduce results or serve as a test run for large scale computations. Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce, but at a fraction of the execution time. As noted before, Dremel experimentally surpassed MapReduce by orders of magnitude.

Google Dremel vs Apache Drill

Apache Drill is an attempt to build an open source version of Google Dremel. There’s another project in the works to create an open source version of Dremel called OpenDremel. Other projects working on speedy queries for big data include Apache CouchDB and the Cloudant backed variant BigCouch.

Other open source Big Data Analytics Tools and Techniques

1. Storm, which was developed at Backtype and open sourced by Twitter.

2. Apache S4, which was open sourced by Yahoo.
 
Difference between Dremel and other real-time big data systems such as Storm and S4 is that these are streaming engines, while Dremel is designed for ad-hoc querying, ie really fast search results.

6 comments:

  1. Nice introduction!

    There's another open source alternative for Google Dremel, released this year by Cloudera. It's Impala , and you can get more information here: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

    ReplyDelete
  2. big data is all about placing the right technologies and practice of handling, analyzing, and storing colossal volumes of data.
    Big Data trainings

    ReplyDelete
  3. The Information which you provided is very much useful for Hadoop Online Training Learners Thank You for Sharing Valuable Information

    ReplyDelete
  4. Thanks for providing the best information it's very useful for HADOOP learners.we also provide the best HADOOP Online Training.

    ReplyDelete
  5. The information which you provides is very much useful for the Hadoop Learners. Thank you for your valuable information. I found hadooponlinetrainings is the best Hadoop Online Traininginstitute in Hyderabad, India .

    ReplyDelete
  6. http://googlelisting.in/
    Best SEO services and Google ranking in Bangalore
    Global Linking provide the best SEO services in Bangalore which includes search engine optimization, Compatible website, value added services, Google ranking and local marketing.

    ReplyDelete