Monday 3 December 2012

Google Dremel vs Apache Hadoop: Big Data Analytics Tools and Techniques

Google Dremel vs Apache Hadoop: Big Data Analytics Tools and Techniques

This is the article about some useful and popular real time big data analytics tools and techniques. Some of the big names are Apache Hadoop and Google Dremel. There are also other open source big data analytics tools and techniques like storm and apache S4. We will see the difference between Google Dremel and Apache Hadoop.

What is Google Dremel?

Google Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.

Apache Hadoop vs Google Dremel: Difference between Apache Hadoop and Google Dremel

Dremel is a data analysis tool designed to quickly run queries on massive, structured datasets (such as log or event files). It supports a SQL-like syntax, but apart from table appends, it is read-only. It doesn't support update or create functions, nor does it feature table indexes. Data is organized in a "columnar" format, which contributes to very fast query speed. Google's BigQuery product is an implementation of Dremel accessible via RESTful API.

Hadoop (an open source implementation of MapReduce) in conjunction with the "Hive" data warehouse software, also allows data analysis for massive datasets using a SQL-style syntax. Hive essentially turns queries into MapReduce functions. In contrast to using a ColumIO format, Hive attempts to make queries quick by using techniques such as table indexing.

Hadoop is for batch processing, meaning that queries are run on a set of data that you already have. Streaming engines process data as it comes in. The terms “streaming” and “real time” are often used interchangeably, which could lead to some confusion about Dremel/Drill since they are also referred to as real time.

It should be noted that Google is intending Dremel as a complement, not a replacement, for MapReduce and Hadoop. According to the paper, Dremel is frequently used to analyze MapReduce results or serve as a test run for large scale computations. Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce, but at a fraction of the execution time. As noted before, Dremel experimentally surpassed MapReduce by orders of magnitude.

Google Dremel vs Apache Drill

Apache Drill is an attempt to build an open source version of Google Dremel. There’s another project in the works to create an open source version of Dremel called OpenDremel. Other projects working on speedy queries for big data include Apache CouchDB and the Cloudant backed variant BigCouch.

Other open source Big Data Analytics Tools and Techniques

1. Storm, which was developed at Backtype and open sourced by Twitter.

2. Apache S4, which was open sourced by Yahoo.
Difference between Dremel and other real-time big data systems such as Storm and S4 is that these are streaming engines, while Dremel is designed for ad-hoc querying, ie really fast search results.

No comments:

Post a Comment

About the Author

I have more than 10 years of experience in IT industry. Linkedin Profile

I am currently messing up with neural networks in deep learning. I am learning Python, TensorFlow and Keras.

Author: I am an author of a book on deep learning.

Quiz: I run an online quiz on machine learning and deep learning.