Tuesday, 25 December 2018

Kinesis, Firehose and MapReduce: AWS Data Analytics Service

Kinesis, Firehose and Elastic MapReduce are very useful data analytics offerings from AWS. 

You can capture real time data and analyze it in parallel using Kinesis and Firehose. No need to wait to take data in warehouse and then run analytics. Below are some basic and important points about Kinesis and Firehose to remember:

1. Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. 

2. With Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications. 

3. Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin.

4. With Kinesis, you can perform real-time analytics on data that has been traditionally analyzed using batch processing in data warehouses. The most common use cases include data lakes, data science and machine learning. 

5. No need to first save data into warehouse and then run analytics. No need of batch processes. All is done real-time.

6. Types: Kinesis Data and Video Streams, Firehose (also has processing capacity unlike Kinesis), Kinesis Analytics (takes data from Kinesis and Firehose and run SQL queries on it, pay only for the queries you run)

“Kinesis Video/Data Streams” vs “Firehose”

1. Firehose is fully managed whereas Kinesis Streams is manually managed.

2. Firehose PREPARE and LOAD data streams to S3, RedShift, ElasticSearch, Kinesis Data Analytics and Splunk whereas Kinesis Streams just STORES (for 1-7 days) the data streams and you need to write application using Lambda, EC2, Kinesis Data Analytics and Spark to PROCESS it.

For more details, please visit documentation.

EMR (Elastic MapReduce)

1. Big data analysis service

2. Used by data scientist for log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

3. EMR provides a managed Hadoop framework using which you can process vast amounts of data across dynamically scalable Amazon EC2 instances. 

4. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

No comments:

Post a Comment