What is the relationship between Spark, Hadoop and Cassandra

hadoop cassandra apache-spark apache-spark-sql

16,156

Solution 1

Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.

For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.

In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).

Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.

Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.

So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.

Solution 2

Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.

On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.

I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

Solution 3

The README.md file in Spark can solve your puzzle:

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

16,156

Author by

Scott Cumming

Updated on June 20, 2022

Comments

Scott Cumming almost 2 years

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.

Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Daniel Darabos almost 9 years

Fantastic answer! On the Hive vs Spark SQL front it may be insightful to mention that Hive is in the process of adopting Spark as its execution backend (as an alternative to MapReduce). I think at that point the difference between Hive and Spark SQL will just be the query execution planner implementation.
Scott Cumming almost 9 years

I was assuming that Spark's RDDs are stored on HDFS and that it probably uses hadoop's zookeeper and other infrastructure. You seem (@Jim Meyer) seem to be implying that spark doesn't have a hard dependency and spark has its own counter-parts to those components?
Scott Cumming almost 9 years

Also, I keep reading about spark being an in-memory system. I'm looking at a system to handle around two terra-bytes (compressed) data every day. There is no way I keep keep it in memory, even when using a cluster of computers. If I need to bring disks into play, where does that leave spark? Does it lose its edge over hadoop/cassandra/hive or does it still have something to offer?
Scott Cumming almost 9 years

nice article. you mention that spark can run in stand-alone mode. however, their own download page doesn't give me that option. that's what started the confusion. all download options reference hadoop!
Scott Cumming almost 9 years

you also mention that spark should have memory equal to the data being processed. however, spark's landing page claims 10x improvement over hadoop for disk based processed (100x for memory based). did you find that they had something interesting to offer for disk based data as well? for massive data, do they have an alternative to hdfs?
Jim Meyer almost 9 years

Spark likes to have a lot of memory to work with. If your data doesn't all fit into memory, Spark will have to evict some data from memory, which of course will reduce performance. To process 2TB/day, you'd usually break it up into smaller processing chunks than a day (e.g. process one hour at a time, etc.)
sascha10000 almost 9 years

i know it's a little bit late for the answer but I was stressed out. My topic was more the hadoop side and spark was just popping up, so I'm not really sure whether it's loading the whole data into the memory or if it's only loading parts but I think the second point is more reasonable. Regarding to my conclusion I think you don't need the same amount of storage for the memory but you need a lot. I think it's interesting to take a close look on this topic.