When to use Hadoop, HBase, Hive and Pig?
Solution 1
MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.
Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started. (Just to add on to my answer. No self promotion intended)
Both Hive and Pig queries get converted into MapReduce jobs under the hood.
HTH
Solution 2
I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team.
Objective
- To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
- To replace daily aggregation data generated thru MySQL with Hive
- Build Custom reports thru queries in Hive
Architecture Options
I benchmarked the following options:
- Hive+HDFS
- Hive+HBase - queries were too slow so I dumped this option
Design
- Daily log Files were transported to HDFS
- MR jobs parsed these log files and output files in HDFS
- Create Hive tables with partitions and locations pointing to HDFS locations
- Create Hive query scripts (call it HQL if you like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
- Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator
Summary
HBase is like a Map. If you know the key, you can instantly get the value. But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone.
If you have data that needs to be aggregated, rolled up, analyzed across rows then consider Hive.
Hopefully this helps.
Hive actually rocks ...I know, I have lived it for 12 months now... So does HBase...
Solution 3
Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop:
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Hive Vs PIG comparison can be found at this article and my other post at this SE question.
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases.
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.
Solution 4
Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.
If you select full table scan - use hive. If index access - HBase.
Solution 5
Understanding in depth
Hadoop
Hadoop
is an open source project of the Apache
foundation. It is a framework written in Java
, originally developed by Doug Cutting in 2005. It was created to support distribution for Nutch
, the text search engine. Hadoop
uses Google's Map Reduce
and Google File System Technologies as its foundation.
Features of Hadoop
- It is optimized to handle massive quantities of structured, semi-structured and unstructured data using commodity hardware.
- It has shared nothing architecture.
- It replicates its data into multiple computers so that if one goes down, the data can still be processed from another machine that stores its replica.
Hadoop
is for high throughput rather than low latency. It is a batch operation handling massive quantities of data; therefore the response time is not immediate.- It complements Online Transaction Processing and Online Analytical Processing. However, it is not a replacement for a
RDBMS
. - It is not good when work cannot be parallelized or when there are dependencies within the data.
- It is not good for processing small files. It works best with huge data files and data sets.
Versions of Hadoop
There are two versions of Hadoop
available :
- Hadoop 1.0
- Hadoop 2.0
Hadoop 1.0
It has two main parts :
1. Data Storage Framework
It is a general-purpose file system called Hadoop Distributed File System (HDFS
).
HDFS
is schema-less
It simply stores data files and these data files can be in just about any format.
The idea is to store files as close to their original form as possible.
This in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement.
2. Data Processing Framework
This is a simple functional programming model initially popularized by Google as MapReduce
.
It essentially uses two functions: MAP
and REDUCE
to process data.
The "Mappers" take in a set of key-value pairs and generate intermediate data (which is another list of key-value pairs).
The "Reducers" then act on this input to produce the output data.
The two functions seemingly work in isolation with one another, thus enabling the processing to be highly distributed in highly parallel, fault-tolerance and scalable way.
Limitations of Hadoop 1.0
The first limitation was the requirement of
MapReduce
programming expertise.It supported only batch processing which although is suitable for tasks such as log analysis, large scale data mining projects but pretty much unsuitable for other kinds of projects.
One major limitation was that
Hadoop 1.0
was tightly computationally coupled withMapReduce
, which meant that the established data management vendors where left with two opinions:Either rewrite their functionality in
MapReduce
so that it could be executed inHadoop
orExtract data from
HDFS
or process it outside ofHadoop
.
None of the options were viable as it led to process inefficiencies caused by data being moved in and out of the Hadoop
cluster.
Hadoop 2.0
In Hadoop 2.0
, HDFS
continues to be data storage framework.
However, a new and seperate resource management framework called Yet Another Resource Negotiater (YARN) has been added.
Any application capable of dividing itself into parallel tasks is supported by YARN.
YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability and efficiency of applications.
It works by having an Application Master in place of Job Tracker, running applications on resources governed by new Node Manager.
ApplicationMaster is able to run any application and not just MapReduce
.
This means it does not only support batch processing but also real-time processing. MapReduce
is no longer the only data processing option.
Advantages of Hadoop
It stores data in its native from. There is no structure imposed while keying in data or storing data. HDFS
is schema less. It is only later when the data needs to be processed that the structure is imposed on the raw data.
It is scalable. Hadoop
can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel.
It is resilient to failure. Hadoop
is fault tolerance. It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in event of node failure,there will always be another copy of data available for use.
It is flexible. One of the key advantages of Hadoop
is that it can work with any kind of data: structured, unstructured or semi-structured. Also, the processing is extremely fast in Hadoop
owing to the "move code to data" paradigm.
Hadoop Ecosystem
Following are the components of Hadoop
ecosystem:
HDFS: Hadoop
Distributed File System. It simply stores data files as close to the original form as possible.
HBase: It is Hadoop's database and compares well with an RDBMS
. It supports structured data storage for large tables.
Hive: It enables analysis of large datasets using a language very similar to standard ANSI SQL
, which implies that anyone familier with SQL
should be able to access data on a Hadoop
cluster.
Pig: It is an easy to understand data flow language. It helps with analysis of large datasets which is quite the order with Hadoop
. Pig
scripts are automatically converted to MapReduce
jobs by the Pig
interpreter.
ZooKeeper: It is a coordination service for distributed applications.
Oozie: It is a workflow schedular
system to manage Apache Hadoop
jobs.
Mahout: It is a scalable machine learning and data mining library.
Chukwa: It is data collection system for managing large distributed system.
Sqoop: It is used to transfer bulk data between Hadoop
and structured data stores such as relational databases.
Ambari: It is a web based tool for provisioning, managing and monitoring Hadoop
clusters.
Hive
Hive
is a data warehouse infrastructure tool to process structured data in Hadoop
. It resides on top of Hadoop
to summarize Big Data and makes querying and analyzing easy.
Hive is not
A relational database
A design for Online Transaction Processing (
OLTP
).A language for real-time queries and row-level updates.
Features of Hive
It stores schema in database and processed data into
HDFS
.It is designed for
OLAP
.It provides
SQL
type language for querying calledHiveQL
orHQL
.It is familier, fast, scalable and extensible.
Hive Architecture
The following components are contained in Hive Architecture:
User Interface:
Hive
is adata warehouse
infrastructure that can create interaction between user andHDFS
. The User Interfaces thatHive
supports are Hive Web UI, Hive Command line and Hive HD Insight(In Windows Server).MetaStore:
Hive
chooses respectivedatabase
servers
to store the schema orMetadata
of tables, databases, columns in a table, their data types andHDFS
mapping.HiveQL Process Engine:
HiveQL
is similar toSQL
for querying on schema info on theMetastore
. It is one of the replacements of traditional approach forMapReduce
program. Instead of writingMapReduce
inJava
, we can write a query forMapReduce
and process it.Exceution Engine: The conjunction part of
HiveQL
process engine andMapReduce
is theHive
Execution Engine. Execution engine processes the query and generates results as same asMapReduce results
. It uses the flavor ofMapReduce
.HDFS or HBase:
Hadoop
Distributed File System orHBase
are the data storage techniques to store data into file system.
Related videos on Youtube
Khalefa
Updated on July 29, 2021Comments
-
Khalefa almost 3 years
What are the benefits of using either Hadoop or HBase or Hive ?
From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.
I would also like to know how Hive compares with Pig.
-
dbustosp almost 6 yearsHadoop: Hadoop Distributed File System + Computational processing model MapReduce. HBase: Key-Value storage, good for reading and writing in near real time. Hive: Used for data extraction from the HDFS using SQL-like syntax. Pig: is a data flow language for creating ETL.
-
-
FrostNovaZzz about 9 yearsActually you can build Hive on HBase so that you can use HQL to full scan hbase while being able to do indexed query on hbase directly. But I doubt this gives you slower performance on full scan.
-
David Gruzman about 9 yearsHBase is write oriented system, it is not optimal on scans, although data is stored sorted. So while scanning some ranges can be good choice, full scans will be much slower then directly from HDFS
-
Kenry Sanchez about 5 yearsYou forget to talk about
yarn
on Hadoop ecosystem :(. -
Root Loop over 4 yearsHBase is a NonSQL database that stores data in HDFS. It is used when you need random, real-time read/write access to your big data.
-
PPK about 4 yearsFacebook no longer use open source HBase for real time messaging systems. They replaced it with their in-house [Myrocks database]. (engineering.fb.com/core-data/…)
-
Guy Coder over 3 yearsYour like is dead. Can you udapte?