MySQL Cluster vs. Hadoop for handling big data

14,172

Solution 1

Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.

Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.

If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?

Solution 2

Hadoop is not a replacement of MySQL, so I think they have their own scenario。

Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.

If you wanna choose a offline compute & storage arch.

I suggest hadoop not MySQL cluster for offline compute & storage, because of :

  1. Cost : obviously, hadoop cluster is more cheap than MySQL cluster
  2. Scalability : hadoop support more than ten thousands machine in a cluster
  3. Ecosystem : mapreduce, hive, pig, sqoop and etc.

So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.

Solution 3

The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.

MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.

The answer to this question has a good explanation of this distinction.

Share:
14,172
Tobi Weißhaar
Author by

Tobi Weißhaar

Updated on June 04, 2022

Comments

  • Tobi Weißhaar
    Tobi Weißhaar almost 2 years

    I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework. What is the better solution. I would like to read your opinion.

    I think the advantages of using a MySQL Cluster are:

    1. high availability
    2. good scalability
    3. high performance / real time data access
    4. you can use commodity hardware

    And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?

    The advantages of Hadoop with Hive on top of it are:

    1. also good scalability
    2. you can also use commodity hardware
    3. the ability to run in heterogenous environments
    4. parallel computing with the MapReduce framework
    5. Hive with HiveQL

    and the disadvantage is:

    1. no real time data access. It may takes minutes or hours to analyze the data.

    So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?