How to load Spark Cassandra Connector in the shell?

cassandra apache-spark datastax-enterprise

21,565

Solution 1

I got it. Below is what I did:

$ git clone https://github.com/datastax/spark-cassandra-connector.git
$ cd spark-cassandra-connector
$ sbt/sbt assembly
$ $SPARK_HOME/bin/spark-shell --jars ~/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/connector-assembly-1.2.0-SNAPSHOT.jar

In scala prompt,

scala> sc.stop
scala> import com.datastax.spark.connector._
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> import org.apache.spark.SparkConf
scala> val conf = new SparkConf(true).set("spark.cassandra.connection.host", "my cassandra host")
scala> val sc = new SparkContext("spark://spark host:7077", "test", conf)

Solution 2

Edit: Things are a bit easier now

For in-depth instructions check out the project website https://github.com/datastax/spark-cassandra-connector/blob/master/doc/13_spark_shell.md

Or feel free to use Spark-Packages to load the Library (Not all versions published) http://spark-packages.org/package/datastax/spark-cassandra-connector

> $SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3-s_2.10

The following assumes you are running with OSS Apache C*

You'll want to start the class with the –driver-class-path set to include all your connector libs

I'll quote a blog post from the illustrious Amy Tobey

The easiest way I’ve found is to set the classpath with then restart the context in the REPL with the necessary classes imported to make sc.cassandraTable() visible. The newly loaded methods will not show up in tab completion. I don’t know why.

  /opt/spark/bin/spark-shell --driver-class-path $(echo /path/to/connector/*.jar |sed 's/ /:/g')

It will print a bunch of log information then present scala> prompt.

scala> sc.stop

Now that the context is stopped, it’s time to import the connector.

scala> import com.datastax.spark.connector._
scala> val conf = new SparkConf()
scala> conf.set("cassandra.connection.host", "node1.pc.datastax.com")
scala> val sc = new SparkContext("local[2]", "Cassandra Connector Test", conf)
scala> val table = sc.cassandraTable("keyspace", "table")
scala> table.count

If you are running with DSE < 4.5.1

There is a slight issue with the DSE Classloader and previous package naming conventions that will prevent you from finding the new spark-connector libraries. You should be able to get around this by removing the line specifying the DSE Class loader in the scripts starting spark-shell.

Solution 3

If you want to avoid stoppping/starting the context in the shell you can also add it into your spark properties in:

{spark_install}/conf/spark-defaults.conf

spark.cassandra.connection.host=192.168.10.10

Solution 4

To access Cassandra from the spark-shell, I've built an assembly out of the cassandra-spark-driver with all dependencies (an "uberjar"). Providing it to the spark-shell using the --jars option like this:

spark-shell --jars spark-cassandra-assembly-1.0.0-SNAPSHOT-jar-with-dependencies.jar

I was facing the same issue described here and this method is both simple and convenient (instead of loading the long list of dependencies)

I've created a gist with the POM file that you can download. Using the pom to create the uberjar you should do:

mvn package

If you're using sbt, look into the sbt-assembly plugin.

View more solutions

21,565

Author by

egerhard

Passionate love for mathematics, functional programming, and distributed computing. I am also interested in topics like big data, which often bump into my three loves. I am currently filling my fun time learning about F#, Clojure, and Spark. I've built a few ray tracers for fun, including one in F#. I am currently building up a little side project, with the goal of learning more about distributed algorithms and F#.

Updated on April 30, 2020

Comments

egerhard about 4 years

I am trying to use Spark Cassandra Connector in Spark 1.1.0.

I have successfully built the jar file from the master branch on GitHub and have gotten the included demos to work. However, when I try to load the jar files into the spark-shell I can't import any of the classes from the com.datastax.spark.connector package.

I have tried using the --jars option on spark-shell and adding the directory with the jar file to Java's CLASSPATH. Neither of these options work. In fact, when I use the --jars option, the logging output shows that the Datastax jar is getting loaded, but I still cannot import anything from com.datastax.

I have been able to load the Tuplejump Calliope Cassandra connector into the spark-shell using --jars, so I know that's working. It's just the Datastax connector which is failing for me.